Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718418
Takeshi Ohkawa, T. Yokota, K. Ootsu
A prototyping system for hardware distributed objects using a hardwired ORB (Object Request Broker) protocol processing engine was implemented on Xilinx Zynq-7000 platform; by which a circuit IP on an FPGA can be operated from application software on Linux/ARM processor through an object-oriented method call. The proposed framework increases controllability and design-productivity on FPGA-based systems. A developer can define an object-oriented interface for a circuit IP in an FPGA, and implement the control sequence part using JavaRock Java-to-HDL synthesizer. By the conformance to the standard CORBA (Common Object Request Broker Architecture) protocol, circuit IPs in an FPGA can be handled through object-oriented interface from diversity of programing languages; like C++, Java, Python and so on. The round trip delay performance measurement of the prototype system was done on Xillybus FIFO interface channel.
{"title":"A prototyping system for hardware distributed objects with diversity of programming languages design and preliminary evaluation","authors":"Takeshi Ohkawa, T. Yokota, K. Ootsu","doi":"10.1109/FPT.2013.6718418","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718418","url":null,"abstract":"A prototyping system for hardware distributed objects using a hardwired ORB (Object Request Broker) protocol processing engine was implemented on Xilinx Zynq-7000 platform; by which a circuit IP on an FPGA can be operated from application software on Linux/ARM processor through an object-oriented method call. The proposed framework increases controllability and design-productivity on FPGA-based systems. A developer can define an object-oriented interface for a circuit IP in an FPGA, and implement the control sequence part using JavaRock Java-to-HDL synthesizer. By the conformance to the standard CORBA (Common Object Request Broker Architecture) protocol, circuit IPs in an FPGA can be handled through object-oriented interface from diversity of programing languages; like C++, Java, Python and so on. The round trip delay performance measurement of the prototype system was done on Xillybus FIFO interface channel.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"246 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124712406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718393
M. S. A. Talip, Takayuki Akamine, Mao Hatto, Yasunori Osana, N. Fujita, H. Amano
Fast Aerodynamics Routines (FaSTAR) is one of the most recent fluid dynamics software package. The problem of FaSTAR is hard to be executed in parallel machines because of its irregular and unpredictable data structure. Exploiting reconfigurable hardware with their advantages to make up for the inadequacy of the existing high performance computers had gradually become the solutions. However, a single FPGA is not enough for the FaSTAR package because the whole module is very large. Instead of using many FPGAs, partially reconfigurable hardware available in recent FPGAs is explored for this application. Advection term computation module in FaSTAR is chosen as a target subroutine. We proposed a reconfigurable flux calculation scheme using partial reconfiguration technique to save hardware resources to fit in a single FPGA. We developed flux computational module and five flux calculation schemes are implemented as reconfigurable modules. This implementation has advantages of up to 62.75% resource saving and enhancing the configuration speed by 6.28 times. Performance evaluation also shows that 2.65 times acceleration is achieved compared to Intel Core 2 Duo at 2.4 GHz.
{"title":"Partially reconfigurable flux calculation scheme in advection term computation","authors":"M. S. A. Talip, Takayuki Akamine, Mao Hatto, Yasunori Osana, N. Fujita, H. Amano","doi":"10.1109/FPT.2013.6718393","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718393","url":null,"abstract":"Fast Aerodynamics Routines (FaSTAR) is one of the most recent fluid dynamics software package. The problem of FaSTAR is hard to be executed in parallel machines because of its irregular and unpredictable data structure. Exploiting reconfigurable hardware with their advantages to make up for the inadequacy of the existing high performance computers had gradually become the solutions. However, a single FPGA is not enough for the FaSTAR package because the whole module is very large. Instead of using many FPGAs, partially reconfigurable hardware available in recent FPGAs is explored for this application. Advection term computation module in FaSTAR is chosen as a target subroutine. We proposed a reconfigurable flux calculation scheme using partial reconfiguration technique to save hardware resources to fit in a single FPGA. We developed flux computational module and five flux calculation schemes are implemented as reconfigurable modules. This implementation has advantages of up to 62.75% resource saving and enhancing the configuration speed by 6.28 times. Performance evaluation also shows that 2.65 times acceleration is achieved compared to Intel Core 2 Duo at 2.4 GHz.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131342812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718347
Sebastian Kutzner, A. Poschmann, Marc Stöttinger
In this article we present a new side-channel building block for FPGAs, which, akin to the old Roman god of Janus, has two contradictory faces: as a watermarking tool, it allows to uniquely identify IP cores by adding a single slice to the design; as a Trojan Side-Channel (TSC) it can potentially leak an entire encryption key within only one trace and without the knowledge of either the plaintext or the ciphertext. We practically verify TROJANUS' feasibility by embedding it as a TSC into a lightweight FPGA implementation of PRESENT. Besides, we investigate the leakage behavior of FPGAs in more detail and present a new pre-processing technique, which can potentially increase the correlation coefficient of DPA attacks.
{"title":"TROJANUS: An ultra-lightweight side-channel leakage generator for FPGAs","authors":"Sebastian Kutzner, A. Poschmann, Marc Stöttinger","doi":"10.1109/FPT.2013.6718347","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718347","url":null,"abstract":"In this article we present a new side-channel building block for FPGAs, which, akin to the old Roman god of Janus, has two contradictory faces: as a watermarking tool, it allows to uniquely identify IP cores by adding a single slice to the design; as a Trojan Side-Channel (TSC) it can potentially leak an entire encryption key within only one trace and without the knowledge of either the plaintext or the ciphertext. We practically verify TROJANUS' feasibility by embedding it as a TSC into a lightweight FPGA implementation of PRESENT. Besides, we investigate the leakage behavior of FPGAs in more detail and present a new pre-processing technique, which can potentially increase the correlation coefficient of DPA attacks.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123478621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718353
Kwon-Taek Kwon, Sungjin Son, Jeongae Park, Jeongae Park, Sangoak Woo, Seokyoon Jung, Soojung Ryu
Coarse-grained reconfigurable arrays (CGRAs) based processors provide high performance and energy-efficiency as well as programmability by means of the ability to reconfigure the datapath connecting the ALU arrays. A CGRA based processor executes loop kernels whose schedule should be fixed at compile time. This restriction hinders CGRA from being efficient particularly in accessing external memories or caches whose access time varies greatly. This makes it challenging to build a CGRA based high-performance, energy-efficient mobile GPU because GPU shader execution usually involves massive texture memory accesses which consist of accesses to texture cache and external texture memory. In this paper, we present an Non-blocking Coarse Grained Reconfigurable Arrays (NBC-GRA) architecture which can handle varying-latency operations efficiently. We also propose an improved CGRA based GPU shader processor architecture based on it. Retry buffer enables threads to re-execute later when the required memory access completes. With a non-blocking texture cache, the shader core can execute without stalls even in the case of cache misses. All of these components help to improve CGRA core throughput greatly despite of longer memory access latencies. Evaluation results show that our NBCGRA architecture based shader processor could perform efficiently despite extreme variation of texture cache access latencies and could reduce the shader execution cycles by upto 68% with minimal hardware cost overhead.
{"title":"Mobile GPU shader processor based on non-blocking Coarse Grained Reconfigurable Arrays architecture","authors":"Kwon-Taek Kwon, Sungjin Son, Jeongae Park, Jeongae Park, Sangoak Woo, Seokyoon Jung, Soojung Ryu","doi":"10.1109/FPT.2013.6718353","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718353","url":null,"abstract":"Coarse-grained reconfigurable arrays (CGRAs) based processors provide high performance and energy-efficiency as well as programmability by means of the ability to reconfigure the datapath connecting the ALU arrays. A CGRA based processor executes loop kernels whose schedule should be fixed at compile time. This restriction hinders CGRA from being efficient particularly in accessing external memories or caches whose access time varies greatly. This makes it challenging to build a CGRA based high-performance, energy-efficient mobile GPU because GPU shader execution usually involves massive texture memory accesses which consist of accesses to texture cache and external texture memory. In this paper, we present an Non-blocking Coarse Grained Reconfigurable Arrays (NBC-GRA) architecture which can handle varying-latency operations efficiently. We also propose an improved CGRA based GPU shader processor architecture based on it. Retry buffer enables threads to re-execute later when the required memory access completes. With a non-blocking texture cache, the shader core can execute without stalls even in the case of cache misses. All of these components help to improve CGRA core throughput greatly despite of longer memory access latencies. Evaluation results show that our NBCGRA architecture based shader processor could perform efficiently despite extreme variation of texture cache access latencies and could reduce the shader execution cycles by upto 68% with minimal hardware cost overhead.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"122 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114113814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718414
Ho-Cheung Ng, Yuk-Ming Choi, Hayden Kwok-Hay So
Heterogeneous computing utilizing both CPU and FPGA requires access to data in the main memory from both devices. While a typical system relies on software executing on the CPU to orchestrate all data movements between the FPGA and the main memory, our demo presents a complementary FPGA-centric approach that allows gateware to directly access the virtual memory space as part of the executing process without involving the CPU. A caching address translation buffer was implemented alongside the user FPGA gateware to provide runtime mapping between virtual and physical memory addresses. The system was implemented on a commercial off-the-shelf FPGA add-on card to demonstrate the viability of such approach in low-cost systems. Experiment demonstrated reasonable performance improvement when compared to a typical software-centric implementation; while the number of context switches between FPGA and CPU in both kernel and user mode was significantly reduced, freeing the CPU for other concurrent user tasks.
{"title":"Direct virtual memory access from FPGA for high-productivity heterogeneous computing","authors":"Ho-Cheung Ng, Yuk-Ming Choi, Hayden Kwok-Hay So","doi":"10.1109/FPT.2013.6718414","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718414","url":null,"abstract":"Heterogeneous computing utilizing both CPU and FPGA requires access to data in the main memory from both devices. While a typical system relies on software executing on the CPU to orchestrate all data movements between the FPGA and the main memory, our demo presents a complementary FPGA-centric approach that allows gateware to directly access the virtual memory space as part of the executing process without involving the CPU. A caching address translation buffer was implemented alongside the user FPGA gateware to provide runtime mapping between virtual and physical memory addresses. The system was implemented on a commercial off-the-shelf FPGA add-on card to demonstrate the viability of such approach in low-cost systems. Experiment demonstrated reasonable performance improvement when compared to a typical software-centric implementation; while the number of context switches between FPGA and CPU in both kernel and user mode was significantly reduced, freeing the CPU for other concurrent user tasks.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121125890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718352
J. D. Lee, Youngsam Shin, Won-Jong Lee, Soojung Ryu, Jeongwook Kim
Ray tracing is a 3D rendering method for generating an image by simulating the path of light. It can generate high quality images, but it requires great computing power. Recent advances in ray tracing technology enable realtime ray tracing on modern desktop CPUs/GPUs. But in the current mobile environment, it is difficult because of inadequate computing power, memory bandwidth, and flexibility in mobile GPUs. In this paper, we present a mobile ray tracing system using Samsung Reconfigurable Processor (SRP). SRP architecture includes a tightly coupled very long instruction word (VLIW) engine and coarse-grained reconfigurable array (CGRA). The VLIW engine is designed for general-purpose computations, such as function invocation and branch selection, and the coarsegrained reconfigurable array is specialized for data-intensive part of the program and can be configured dynamically. We proposed iterative batch-based ray tracing algorithm for SRP, and optimized memory bandwidth with local memory and data cache. Our ray tracing system is implemented on a commercial FPGA-based prototyping system. The experimental results show that our system is suitable for the mobile ray tracing.
{"title":"Real-time ray tracing on coarse-grained reconfigurable processor","authors":"J. D. Lee, Youngsam Shin, Won-Jong Lee, Soojung Ryu, Jeongwook Kim","doi":"10.1109/FPT.2013.6718352","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718352","url":null,"abstract":"Ray tracing is a 3D rendering method for generating an image by simulating the path of light. It can generate high quality images, but it requires great computing power. Recent advances in ray tracing technology enable realtime ray tracing on modern desktop CPUs/GPUs. But in the current mobile environment, it is difficult because of inadequate computing power, memory bandwidth, and flexibility in mobile GPUs. In this paper, we present a mobile ray tracing system using Samsung Reconfigurable Processor (SRP). SRP architecture includes a tightly coupled very long instruction word (VLIW) engine and coarse-grained reconfigurable array (CGRA). The VLIW engine is designed for general-purpose computations, such as function invocation and branch selection, and the coarsegrained reconfigurable array is specialized for data-intensive part of the program and can be configured dynamically. We proposed iterative batch-based ray tracing algorithm for SRP, and optimized memory bandwidth with local memory and data cache. Our ray tracing system is implemented on a commercial FPGA-based prototyping system. The experimental results show that our system is suitable for the mobile ray tracing.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116559456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718385
Y. Sogabe, T. Maruyama
The rapid development of Next Generation Sequencing (NGS) has enabled to generate more than 100G base pairs per day from one machine. The produced data are randomly fragmented DNA base pair strings, called short reads, and millions of short reads are mapped onto the reference genomes, which are complete genetic sequences, to reconstruct the sequence of the sample DNA. This short read mapping is becoming the bottle-neck of NGS systems. In this paper, we propose an FPGA system for the mapping based on a hash-index method. In our system, short reads are divided into seeds, which are fixed-length substrings used for the mapping, and the seeds are sorted using buckets. Then, the seeds in each bucket are compared in parallel with the candidate locations. With this approach, many seeds can be compared in massively parallel manner with their candidate locations, and it becomes possible to improve the processing speed by reducing the number of the random accesses to DRAM banks which store the candidate locations. Furthermore, substitutions of the nucleotides in a seed can be allowed in this parallel comparison. This makes it possible to achieve higher matching rates than previous works.
{"title":"An acceleration method of short read mapping using FPGA","authors":"Y. Sogabe, T. Maruyama","doi":"10.1109/FPT.2013.6718385","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718385","url":null,"abstract":"The rapid development of Next Generation Sequencing (NGS) has enabled to generate more than 100G base pairs per day from one machine. The produced data are randomly fragmented DNA base pair strings, called short reads, and millions of short reads are mapped onto the reference genomes, which are complete genetic sequences, to reconstruct the sequence of the sample DNA. This short read mapping is becoming the bottle-neck of NGS systems. In this paper, we propose an FPGA system for the mapping based on a hash-index method. In our system, short reads are divided into seeds, which are fixed-length substrings used for the mapping, and the seeds are sorted using buckets. Then, the seeds in each bucket are compared in parallel with the candidate locations. With this approach, many seeds can be compared in massively parallel manner with their candidate locations, and it becomes possible to improve the processing speed by reducing the number of the random accesses to DRAM banks which store the candidate locations. Furthermore, substitutions of the nucleotides in a seed can be allowed in this parallel comparison. This makes it possible to achieve higher matching rates than previous works.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125019670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718407
Arwa Ben Dhia, S. Rehman, Adrien Blanchardon, L. Naviner, M. Benabdenbi, R. Chotin-Avot, Emna Amouri, H. Mehrez, Z. Marrakchi
In this paper, we propose the implementation of multiple defect-tolerant techniques on an SRAM-based FPGA. These techniques include redundancy at both the logic block and intra-cluster interconnect. In the logic block, redundancy is implemented at the multiplexer level. Its efficiency is analyzed by injecting a single defect at the output of a multiplexer, considering all possible locations and input combinations. While at the interconnect level, fine grain redundancy is introduced which not only bypasses defects but also increases routability. Taking advantage of the sparse intra-cluster interconnect structures, routability is further improved by efficient distribution of feedback paths allowing more flexibility in the connections among logic blocks. Emulation results show a significant improvement of about 15% and 34% in the robustness of logic block and intra-cluster interconnect respectively. Furthermore, the impact of these hardening schemes on the testability of the FPGA cluster for manufacturing defects is also investigated in terms of maximum achievable fault coverage and the respective cost.
{"title":"A defect-tolerant cluster in a mesh SRAM-based FPGA","authors":"Arwa Ben Dhia, S. Rehman, Adrien Blanchardon, L. Naviner, M. Benabdenbi, R. Chotin-Avot, Emna Amouri, H. Mehrez, Z. Marrakchi","doi":"10.1109/FPT.2013.6718407","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718407","url":null,"abstract":"In this paper, we propose the implementation of multiple defect-tolerant techniques on an SRAM-based FPGA. These techniques include redundancy at both the logic block and intra-cluster interconnect. In the logic block, redundancy is implemented at the multiplexer level. Its efficiency is analyzed by injecting a single defect at the output of a multiplexer, considering all possible locations and input combinations. While at the interconnect level, fine grain redundancy is introduced which not only bypasses defects but also increases routability. Taking advantage of the sparse intra-cluster interconnect structures, routability is further improved by efficient distribution of feedback paths allowing more flexibility in the connections among logic blocks. Emulation results show a significant improvement of about 15% and 34% in the robustness of logic block and intra-cluster interconnect respectively. Furthermore, the impact of these hardening schemes on the testability of the FPGA cluster for manufacturing defects is also investigated in terms of maximum achievable fault coverage and the respective cost.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129381157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718329
Satoshi Jo, A. M. Gharehbaghi, Takeshi Matsumoto, M. Fujita
In this paper, we propose an automated method for debugging and rectification of logical bugs in processors that are implemented on FPGAs. Our method is based on preserving the current circuit topology, and debugging and rectifying bugs by only changing the contents of LUTs, without any modification to the wiring. As a result, correcting the bugs does not require re-synthesis, which can be very time consuming for complex processors due to possible timing closure problems. As the topology of the circuit is preserved, correcting the bugs does not affect the timings of the circuit. In the design phase, we may add additional LUTs or additional inputs to LUTs in the original circuit, so that we can use them in debugging and rectification phase. After a bug is found, first we try to identify the candidate signals as well as their required changes to correct their behavior. This is achieved by using symbolic simulation and equivalence checking between an instruction-set architecture model of the processor and its erroneous model at micro-architecture level. Then, we try to map the corrected functionality into the existing LUT topology. This is realized by a novel method that formulates the problem as a QBF (Quantified Boolean Formula) problem, and solves it by repeatedly applying normal SAT solvers incrementally instead of QBF solvers utilizing ideas from CEGAR (Counter Example Guided Abstraction Refinement) paradigm. We show effectiveness as well as efficiency of our method by correcting bugs in two complex out-of-order superscalar processors with a timing error recovery mechanism.
{"title":"Debugging processors with advanced features by reprogramming LUTs on FPGA","authors":"Satoshi Jo, A. M. Gharehbaghi, Takeshi Matsumoto, M. Fujita","doi":"10.1109/FPT.2013.6718329","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718329","url":null,"abstract":"In this paper, we propose an automated method for debugging and rectification of logical bugs in processors that are implemented on FPGAs. Our method is based on preserving the current circuit topology, and debugging and rectifying bugs by only changing the contents of LUTs, without any modification to the wiring. As a result, correcting the bugs does not require re-synthesis, which can be very time consuming for complex processors due to possible timing closure problems. As the topology of the circuit is preserved, correcting the bugs does not affect the timings of the circuit. In the design phase, we may add additional LUTs or additional inputs to LUTs in the original circuit, so that we can use them in debugging and rectification phase. After a bug is found, first we try to identify the candidate signals as well as their required changes to correct their behavior. This is achieved by using symbolic simulation and equivalence checking between an instruction-set architecture model of the processor and its erroneous model at micro-architecture level. Then, we try to map the corrected functionality into the existing LUT topology. This is realized by a novel method that formulates the problem as a QBF (Quantified Boolean Formula) problem, and solves it by repeatedly applying normal SAT solvers incrementally instead of QBF solvers utilizing ideas from CEGAR (Counter Example Guided Abstraction Refinement) paradigm. We show effectiveness as well as efficiency of our method by correcting bugs in two complex out-of-order superscalar processors with a timing error recovery mechanism.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117146903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718388
F. Winterstein, Samuel Bayliss, G. Constantinides
High-level synthesis promises a significant shortening of the FPGA design cycle when compared with design entry using register transfer level (RTL) languages. Recent evaluations report that C-to-RTL flows can produce results with a quality close to hand-crafted designs [1]. Algorithms which use dynamic, pointer-based data structures, which are common in software, remain difficult to implement well. In this paper, we describe a comparative case study using Xilinx Vivado HLS as an exemplary state-of-the-art high-level synthesis tool. Our test cases are two alternative algorithms for the same compute-intensive machine learning technique (clustering) with significantly different computational properties. We compare a data-flow centric implementation to a recursive tree traversal implementation which incorporates complex data-dependent control flow and makes use of pointer-linked data structures and dynamic memory allocation. The outcome of this case study is twofold: We confirm similar performance between the hand-written and automatically generated RTL designs for the first test case. The second case reveals a degradation in latency by a factor greater than 30× if the source code is not altered prior to high-level synthesis. We identify the reasons for this shortcoming and present code transformations that narrow the performance gap to a factor of four. We generalise our source-to-source transformations whose automation motivates research directions to improve high-level synthesis of dynamic data structures in the future.
{"title":"High-level synthesis of dynamic data structures: A case study using Vivado HLS","authors":"F. Winterstein, Samuel Bayliss, G. Constantinides","doi":"10.1109/FPT.2013.6718388","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718388","url":null,"abstract":"High-level synthesis promises a significant shortening of the FPGA design cycle when compared with design entry using register transfer level (RTL) languages. Recent evaluations report that C-to-RTL flows can produce results with a quality close to hand-crafted designs [1]. Algorithms which use dynamic, pointer-based data structures, which are common in software, remain difficult to implement well. In this paper, we describe a comparative case study using Xilinx Vivado HLS as an exemplary state-of-the-art high-level synthesis tool. Our test cases are two alternative algorithms for the same compute-intensive machine learning technique (clustering) with significantly different computational properties. We compare a data-flow centric implementation to a recursive tree traversal implementation which incorporates complex data-dependent control flow and makes use of pointer-linked data structures and dynamic memory allocation. The outcome of this case study is twofold: We confirm similar performance between the hand-written and automatically generated RTL designs for the first test case. The second case reveals a degradation in latency by a factor greater than 30× if the source code is not altered prior to high-level synthesis. We identify the reasons for this shortcoming and present code transformations that narrow the performance gap to a factor of four. We generalise our source-to-source transformations whose automation motivates research directions to improve high-level synthesis of dynamic data structures in the future.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125628172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}