Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718328
S. Chin, J. Anderson
This paper presents a case for a hybrid configurable logic block that contains a mixture of LUTs and hardened multiplexers towards the goal of higher logic density and area reduction. Technology mapping optimizations, called MuxMap, that target the proposed architecture are implemented using a modified version of the mapper in the ABC logic synthesis tool. VPR is used to model the new hybrid configurable logic block and verify post place and route implementation. Multiple hybrid configurable logic block architectures with varying MUX:LUT ratios are evaluated across three benchmark suites with both Quartus II and Odin-II front-end RTL synthesis tools. Experimentally, we show that without any mapper optimizations we naturally save ~4% area post place and route and with MuxMap optimizations in ABC yielding ~6% area reduction post place and route while maintaining mapping depth, overall configurable logic block count, and routing demand.
{"title":"A case for hardened multiplexers in FPGAs","authors":"S. Chin, J. Anderson","doi":"10.1109/FPT.2013.6718328","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718328","url":null,"abstract":"This paper presents a case for a hybrid configurable logic block that contains a mixture of LUTs and hardened multiplexers towards the goal of higher logic density and area reduction. Technology mapping optimizations, called MuxMap, that target the proposed architecture are implemented using a modified version of the mapper in the ABC logic synthesis tool. VPR is used to model the new hybrid configurable logic block and verify post place and route implementation. Multiple hybrid configurable logic block architectures with varying MUX:LUT ratios are evaluated across three benchmark suites with both Quartus II and Odin-II front-end RTL synthesis tools. Experimentally, we show that without any mapper optimizations we naturally save ~4% area post place and route and with MuxMap optimizations in ABC yielding ~6% area reduction post place and route while maintaining mapping depth, overall configurable logic block count, and routing demand.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114478165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718393
M. S. A. Talip, Takayuki Akamine, Mao Hatto, Yasunori Osana, N. Fujita, H. Amano
Fast Aerodynamics Routines (FaSTAR) is one of the most recent fluid dynamics software package. The problem of FaSTAR is hard to be executed in parallel machines because of its irregular and unpredictable data structure. Exploiting reconfigurable hardware with their advantages to make up for the inadequacy of the existing high performance computers had gradually become the solutions. However, a single FPGA is not enough for the FaSTAR package because the whole module is very large. Instead of using many FPGAs, partially reconfigurable hardware available in recent FPGAs is explored for this application. Advection term computation module in FaSTAR is chosen as a target subroutine. We proposed a reconfigurable flux calculation scheme using partial reconfiguration technique to save hardware resources to fit in a single FPGA. We developed flux computational module and five flux calculation schemes are implemented as reconfigurable modules. This implementation has advantages of up to 62.75% resource saving and enhancing the configuration speed by 6.28 times. Performance evaluation also shows that 2.65 times acceleration is achieved compared to Intel Core 2 Duo at 2.4 GHz.
{"title":"Partially reconfigurable flux calculation scheme in advection term computation","authors":"M. S. A. Talip, Takayuki Akamine, Mao Hatto, Yasunori Osana, N. Fujita, H. Amano","doi":"10.1109/FPT.2013.6718393","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718393","url":null,"abstract":"Fast Aerodynamics Routines (FaSTAR) is one of the most recent fluid dynamics software package. The problem of FaSTAR is hard to be executed in parallel machines because of its irregular and unpredictable data structure. Exploiting reconfigurable hardware with their advantages to make up for the inadequacy of the existing high performance computers had gradually become the solutions. However, a single FPGA is not enough for the FaSTAR package because the whole module is very large. Instead of using many FPGAs, partially reconfigurable hardware available in recent FPGAs is explored for this application. Advection term computation module in FaSTAR is chosen as a target subroutine. We proposed a reconfigurable flux calculation scheme using partial reconfiguration technique to save hardware resources to fit in a single FPGA. We developed flux computational module and five flux calculation schemes are implemented as reconfigurable modules. This implementation has advantages of up to 62.75% resource saving and enhancing the configuration speed by 6.28 times. Performance evaluation also shows that 2.65 times acceleration is achieved compared to Intel Core 2 Duo at 2.4 GHz.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131342812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718353
Kwon-Taek Kwon, Sungjin Son, Jeongae Park, Jeongae Park, Sangoak Woo, Seokyoon Jung, Soojung Ryu
Coarse-grained reconfigurable arrays (CGRAs) based processors provide high performance and energy-efficiency as well as programmability by means of the ability to reconfigure the datapath connecting the ALU arrays. A CGRA based processor executes loop kernels whose schedule should be fixed at compile time. This restriction hinders CGRA from being efficient particularly in accessing external memories or caches whose access time varies greatly. This makes it challenging to build a CGRA based high-performance, energy-efficient mobile GPU because GPU shader execution usually involves massive texture memory accesses which consist of accesses to texture cache and external texture memory. In this paper, we present an Non-blocking Coarse Grained Reconfigurable Arrays (NBC-GRA) architecture which can handle varying-latency operations efficiently. We also propose an improved CGRA based GPU shader processor architecture based on it. Retry buffer enables threads to re-execute later when the required memory access completes. With a non-blocking texture cache, the shader core can execute without stalls even in the case of cache misses. All of these components help to improve CGRA core throughput greatly despite of longer memory access latencies. Evaluation results show that our NBCGRA architecture based shader processor could perform efficiently despite extreme variation of texture cache access latencies and could reduce the shader execution cycles by upto 68% with minimal hardware cost overhead.
{"title":"Mobile GPU shader processor based on non-blocking Coarse Grained Reconfigurable Arrays architecture","authors":"Kwon-Taek Kwon, Sungjin Son, Jeongae Park, Jeongae Park, Sangoak Woo, Seokyoon Jung, Soojung Ryu","doi":"10.1109/FPT.2013.6718353","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718353","url":null,"abstract":"Coarse-grained reconfigurable arrays (CGRAs) based processors provide high performance and energy-efficiency as well as programmability by means of the ability to reconfigure the datapath connecting the ALU arrays. A CGRA based processor executes loop kernels whose schedule should be fixed at compile time. This restriction hinders CGRA from being efficient particularly in accessing external memories or caches whose access time varies greatly. This makes it challenging to build a CGRA based high-performance, energy-efficient mobile GPU because GPU shader execution usually involves massive texture memory accesses which consist of accesses to texture cache and external texture memory. In this paper, we present an Non-blocking Coarse Grained Reconfigurable Arrays (NBC-GRA) architecture which can handle varying-latency operations efficiently. We also propose an improved CGRA based GPU shader processor architecture based on it. Retry buffer enables threads to re-execute later when the required memory access completes. With a non-blocking texture cache, the shader core can execute without stalls even in the case of cache misses. All of these components help to improve CGRA core throughput greatly despite of longer memory access latencies. Evaluation results show that our NBCGRA architecture based shader processor could perform efficiently despite extreme variation of texture cache access latencies and could reduce the shader execution cycles by upto 68% with minimal hardware cost overhead.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"122 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114113814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718347
Sebastian Kutzner, A. Poschmann, Marc Stöttinger
In this article we present a new side-channel building block for FPGAs, which, akin to the old Roman god of Janus, has two contradictory faces: as a watermarking tool, it allows to uniquely identify IP cores by adding a single slice to the design; as a Trojan Side-Channel (TSC) it can potentially leak an entire encryption key within only one trace and without the knowledge of either the plaintext or the ciphertext. We practically verify TROJANUS' feasibility by embedding it as a TSC into a lightweight FPGA implementation of PRESENT. Besides, we investigate the leakage behavior of FPGAs in more detail and present a new pre-processing technique, which can potentially increase the correlation coefficient of DPA attacks.
{"title":"TROJANUS: An ultra-lightweight side-channel leakage generator for FPGAs","authors":"Sebastian Kutzner, A. Poschmann, Marc Stöttinger","doi":"10.1109/FPT.2013.6718347","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718347","url":null,"abstract":"In this article we present a new side-channel building block for FPGAs, which, akin to the old Roman god of Janus, has two contradictory faces: as a watermarking tool, it allows to uniquely identify IP cores by adding a single slice to the design; as a Trojan Side-Channel (TSC) it can potentially leak an entire encryption key within only one trace and without the knowledge of either the plaintext or the ciphertext. We practically verify TROJANUS' feasibility by embedding it as a TSC into a lightweight FPGA implementation of PRESENT. Besides, we investigate the leakage behavior of FPGAs in more detail and present a new pre-processing technique, which can potentially increase the correlation coefficient of DPA attacks.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123478621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718414
Ho-Cheung Ng, Yuk-Ming Choi, Hayden Kwok-Hay So
Heterogeneous computing utilizing both CPU and FPGA requires access to data in the main memory from both devices. While a typical system relies on software executing on the CPU to orchestrate all data movements between the FPGA and the main memory, our demo presents a complementary FPGA-centric approach that allows gateware to directly access the virtual memory space as part of the executing process without involving the CPU. A caching address translation buffer was implemented alongside the user FPGA gateware to provide runtime mapping between virtual and physical memory addresses. The system was implemented on a commercial off-the-shelf FPGA add-on card to demonstrate the viability of such approach in low-cost systems. Experiment demonstrated reasonable performance improvement when compared to a typical software-centric implementation; while the number of context switches between FPGA and CPU in both kernel and user mode was significantly reduced, freeing the CPU for other concurrent user tasks.
{"title":"Direct virtual memory access from FPGA for high-productivity heterogeneous computing","authors":"Ho-Cheung Ng, Yuk-Ming Choi, Hayden Kwok-Hay So","doi":"10.1109/FPT.2013.6718414","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718414","url":null,"abstract":"Heterogeneous computing utilizing both CPU and FPGA requires access to data in the main memory from both devices. While a typical system relies on software executing on the CPU to orchestrate all data movements between the FPGA and the main memory, our demo presents a complementary FPGA-centric approach that allows gateware to directly access the virtual memory space as part of the executing process without involving the CPU. A caching address translation buffer was implemented alongside the user FPGA gateware to provide runtime mapping between virtual and physical memory addresses. The system was implemented on a commercial off-the-shelf FPGA add-on card to demonstrate the viability of such approach in low-cost systems. Experiment demonstrated reasonable performance improvement when compared to a typical software-centric implementation; while the number of context switches between FPGA and CPU in both kernel and user mode was significantly reduced, freeing the CPU for other concurrent user tasks.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121125890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718352
J. D. Lee, Youngsam Shin, Won-Jong Lee, Soojung Ryu, Jeongwook Kim
Ray tracing is a 3D rendering method for generating an image by simulating the path of light. It can generate high quality images, but it requires great computing power. Recent advances in ray tracing technology enable realtime ray tracing on modern desktop CPUs/GPUs. But in the current mobile environment, it is difficult because of inadequate computing power, memory bandwidth, and flexibility in mobile GPUs. In this paper, we present a mobile ray tracing system using Samsung Reconfigurable Processor (SRP). SRP architecture includes a tightly coupled very long instruction word (VLIW) engine and coarse-grained reconfigurable array (CGRA). The VLIW engine is designed for general-purpose computations, such as function invocation and branch selection, and the coarsegrained reconfigurable array is specialized for data-intensive part of the program and can be configured dynamically. We proposed iterative batch-based ray tracing algorithm for SRP, and optimized memory bandwidth with local memory and data cache. Our ray tracing system is implemented on a commercial FPGA-based prototyping system. The experimental results show that our system is suitable for the mobile ray tracing.
{"title":"Real-time ray tracing on coarse-grained reconfigurable processor","authors":"J. D. Lee, Youngsam Shin, Won-Jong Lee, Soojung Ryu, Jeongwook Kim","doi":"10.1109/FPT.2013.6718352","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718352","url":null,"abstract":"Ray tracing is a 3D rendering method for generating an image by simulating the path of light. It can generate high quality images, but it requires great computing power. Recent advances in ray tracing technology enable realtime ray tracing on modern desktop CPUs/GPUs. But in the current mobile environment, it is difficult because of inadequate computing power, memory bandwidth, and flexibility in mobile GPUs. In this paper, we present a mobile ray tracing system using Samsung Reconfigurable Processor (SRP). SRP architecture includes a tightly coupled very long instruction word (VLIW) engine and coarse-grained reconfigurable array (CGRA). The VLIW engine is designed for general-purpose computations, such as function invocation and branch selection, and the coarsegrained reconfigurable array is specialized for data-intensive part of the program and can be configured dynamically. We proposed iterative batch-based ray tracing algorithm for SRP, and optimized memory bandwidth with local memory and data cache. Our ray tracing system is implemented on a commercial FPGA-based prototyping system. The experimental results show that our system is suitable for the mobile ray tracing.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116559456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718385
Y. Sogabe, T. Maruyama
The rapid development of Next Generation Sequencing (NGS) has enabled to generate more than 100G base pairs per day from one machine. The produced data are randomly fragmented DNA base pair strings, called short reads, and millions of short reads are mapped onto the reference genomes, which are complete genetic sequences, to reconstruct the sequence of the sample DNA. This short read mapping is becoming the bottle-neck of NGS systems. In this paper, we propose an FPGA system for the mapping based on a hash-index method. In our system, short reads are divided into seeds, which are fixed-length substrings used for the mapping, and the seeds are sorted using buckets. Then, the seeds in each bucket are compared in parallel with the candidate locations. With this approach, many seeds can be compared in massively parallel manner with their candidate locations, and it becomes possible to improve the processing speed by reducing the number of the random accesses to DRAM banks which store the candidate locations. Furthermore, substitutions of the nucleotides in a seed can be allowed in this parallel comparison. This makes it possible to achieve higher matching rates than previous works.
{"title":"An acceleration method of short read mapping using FPGA","authors":"Y. Sogabe, T. Maruyama","doi":"10.1109/FPT.2013.6718385","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718385","url":null,"abstract":"The rapid development of Next Generation Sequencing (NGS) has enabled to generate more than 100G base pairs per day from one machine. The produced data are randomly fragmented DNA base pair strings, called short reads, and millions of short reads are mapped onto the reference genomes, which are complete genetic sequences, to reconstruct the sequence of the sample DNA. This short read mapping is becoming the bottle-neck of NGS systems. In this paper, we propose an FPGA system for the mapping based on a hash-index method. In our system, short reads are divided into seeds, which are fixed-length substrings used for the mapping, and the seeds are sorted using buckets. Then, the seeds in each bucket are compared in parallel with the candidate locations. With this approach, many seeds can be compared in massively parallel manner with their candidate locations, and it becomes possible to improve the processing speed by reducing the number of the random accesses to DRAM banks which store the candidate locations. Furthermore, substitutions of the nucleotides in a seed can be allowed in this parallel comparison. This makes it possible to achieve higher matching rates than previous works.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125019670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718322
Shanker Shreejith, Suhaib A. Fahmy, M. Lukasiewycz
Automotive systems comprise a high number of networked safety-critical functions. Any design changes or addition of new functionality must be rigorously tested to ensure that no performance or safety issues are introduced, and this consumes a significant amount of time. Validation should be conducted using a faithful representation of the system, and so typically, a full subsystem is built for validation. We present a scalable scheme for emulating a complete cluster of automotive embedded compute units on an FPGA, with accelerated network communication using custom physical level interfaces. With these interfaces, we can achieve acceleration of system emulation by 8× or more, with a systematic way of exploring real-world issues like jitter, network delays, and data corruption, among others. By using the same communication infrastructure as in a real deployed system, this validation is closer to the requirements of standards compliance. This approach also enables hardware-in-the-loop (HIL) validation, allowing rapid prototyping of distributed functions, including changes in network topology and parameters, and modification of time-triggered schedules without physical hardware modification. We present an implementation of this framework on the Xilinx ML605 evaluation board that integrates six FlexRay automotive functions to demonstrate the potential of the framework.
{"title":"Accelerating validation of time-triggered automotive systems on FPGAs","authors":"Shanker Shreejith, Suhaib A. Fahmy, M. Lukasiewycz","doi":"10.1109/FPT.2013.6718322","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718322","url":null,"abstract":"Automotive systems comprise a high number of networked safety-critical functions. Any design changes or addition of new functionality must be rigorously tested to ensure that no performance or safety issues are introduced, and this consumes a significant amount of time. Validation should be conducted using a faithful representation of the system, and so typically, a full subsystem is built for validation. We present a scalable scheme for emulating a complete cluster of automotive embedded compute units on an FPGA, with accelerated network communication using custom physical level interfaces. With these interfaces, we can achieve acceleration of system emulation by 8× or more, with a systematic way of exploring real-world issues like jitter, network delays, and data corruption, among others. By using the same communication infrastructure as in a real deployed system, this validation is closer to the requirements of standards compliance. This approach also enables hardware-in-the-loop (HIL) validation, allowing rapid prototyping of distributed functions, including changes in network topology and parameters, and modification of time-triggered schedules without physical hardware modification. We present an implementation of this framework on the Xilinx ML605 evaluation board that integrates six FlexRay automotive functions to demonstrate the potential of the framework.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121239079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718329
Satoshi Jo, A. M. Gharehbaghi, Takeshi Matsumoto, M. Fujita
In this paper, we propose an automated method for debugging and rectification of logical bugs in processors that are implemented on FPGAs. Our method is based on preserving the current circuit topology, and debugging and rectifying bugs by only changing the contents of LUTs, without any modification to the wiring. As a result, correcting the bugs does not require re-synthesis, which can be very time consuming for complex processors due to possible timing closure problems. As the topology of the circuit is preserved, correcting the bugs does not affect the timings of the circuit. In the design phase, we may add additional LUTs or additional inputs to LUTs in the original circuit, so that we can use them in debugging and rectification phase. After a bug is found, first we try to identify the candidate signals as well as their required changes to correct their behavior. This is achieved by using symbolic simulation and equivalence checking between an instruction-set architecture model of the processor and its erroneous model at micro-architecture level. Then, we try to map the corrected functionality into the existing LUT topology. This is realized by a novel method that formulates the problem as a QBF (Quantified Boolean Formula) problem, and solves it by repeatedly applying normal SAT solvers incrementally instead of QBF solvers utilizing ideas from CEGAR (Counter Example Guided Abstraction Refinement) paradigm. We show effectiveness as well as efficiency of our method by correcting bugs in two complex out-of-order superscalar processors with a timing error recovery mechanism.
{"title":"Debugging processors with advanced features by reprogramming LUTs on FPGA","authors":"Satoshi Jo, A. M. Gharehbaghi, Takeshi Matsumoto, M. Fujita","doi":"10.1109/FPT.2013.6718329","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718329","url":null,"abstract":"In this paper, we propose an automated method for debugging and rectification of logical bugs in processors that are implemented on FPGAs. Our method is based on preserving the current circuit topology, and debugging and rectifying bugs by only changing the contents of LUTs, without any modification to the wiring. As a result, correcting the bugs does not require re-synthesis, which can be very time consuming for complex processors due to possible timing closure problems. As the topology of the circuit is preserved, correcting the bugs does not affect the timings of the circuit. In the design phase, we may add additional LUTs or additional inputs to LUTs in the original circuit, so that we can use them in debugging and rectification phase. After a bug is found, first we try to identify the candidate signals as well as their required changes to correct their behavior. This is achieved by using symbolic simulation and equivalence checking between an instruction-set architecture model of the processor and its erroneous model at micro-architecture level. Then, we try to map the corrected functionality into the existing LUT topology. This is realized by a novel method that formulates the problem as a QBF (Quantified Boolean Formula) problem, and solves it by repeatedly applying normal SAT solvers incrementally instead of QBF solvers utilizing ideas from CEGAR (Counter Example Guided Abstraction Refinement) paradigm. We show effectiveness as well as efficiency of our method by correcting bugs in two complex out-of-order superscalar processors with a timing error recovery mechanism.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117146903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718407
Arwa Ben Dhia, S. Rehman, Adrien Blanchardon, L. Naviner, M. Benabdenbi, R. Chotin-Avot, Emna Amouri, H. Mehrez, Z. Marrakchi
In this paper, we propose the implementation of multiple defect-tolerant techniques on an SRAM-based FPGA. These techniques include redundancy at both the logic block and intra-cluster interconnect. In the logic block, redundancy is implemented at the multiplexer level. Its efficiency is analyzed by injecting a single defect at the output of a multiplexer, considering all possible locations and input combinations. While at the interconnect level, fine grain redundancy is introduced which not only bypasses defects but also increases routability. Taking advantage of the sparse intra-cluster interconnect structures, routability is further improved by efficient distribution of feedback paths allowing more flexibility in the connections among logic blocks. Emulation results show a significant improvement of about 15% and 34% in the robustness of logic block and intra-cluster interconnect respectively. Furthermore, the impact of these hardening schemes on the testability of the FPGA cluster for manufacturing defects is also investigated in terms of maximum achievable fault coverage and the respective cost.
{"title":"A defect-tolerant cluster in a mesh SRAM-based FPGA","authors":"Arwa Ben Dhia, S. Rehman, Adrien Blanchardon, L. Naviner, M. Benabdenbi, R. Chotin-Avot, Emna Amouri, H. Mehrez, Z. Marrakchi","doi":"10.1109/FPT.2013.6718407","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718407","url":null,"abstract":"In this paper, we propose the implementation of multiple defect-tolerant techniques on an SRAM-based FPGA. These techniques include redundancy at both the logic block and intra-cluster interconnect. In the logic block, redundancy is implemented at the multiplexer level. Its efficiency is analyzed by injecting a single defect at the output of a multiplexer, considering all possible locations and input combinations. While at the interconnect level, fine grain redundancy is introduced which not only bypasses defects but also increases routability. Taking advantage of the sparse intra-cluster interconnect structures, routability is further improved by efficient distribution of feedback paths allowing more flexibility in the connections among logic blocks. Emulation results show a significant improvement of about 15% and 34% in the robustness of logic block and intra-cluster interconnect respectively. Furthermore, the impact of these hardening schemes on the testability of the FPGA cluster for manufacturing defects is also investigated in terms of maximum achievable fault coverage and the respective cost.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129381157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}