A. E. Wilson, Nathan Baker, Ethan Campbell, Michael Wirthlin
FPGAs have been shown to operate reliably within harsh radiation environments by employing single-event upset (SEU) mitigation techniques such as configuration scrubbing, triple-modular redundancy, error correction coding, and radiation aware implementation techniques. The effectiveness of these techniques, however, is limited when using complex system-level designs that employ complex I/O interfaces with single-point failures. In previous work, a complex SoC system running Linux applied several of these techniques only to obtain an improvement of 14 (times) in Mean Time to Failure (MTTF). A detailed post-radiation fault analysis found that the limitations in reliability were due to the DDR interface, the global clock network, and interconnect. This paper applied a number of design-specific SEU mitigation techniques to address the limitations in reliability of this design. These changes include triplicating the global clock, optimizing the placement of the reduction output voters and input flip-flops, and employing a mapping technique called “striping”. The application of these techniques improved MTTF of the mitigated design by a factor of 1.54 (times) and thus provides a 22.8X (times) MTTF improvement over the unmitigated design. A post-radiation fault analysis using BFAT was also performed to find the remaining design vulnerabilities.
{"title":"Improving Fault Tolerance for FPGA SoCs Through Post Radiation Design Analysis","authors":"A. E. Wilson, Nathan Baker, Ethan Campbell, Michael Wirthlin","doi":"10.1145/3674841","DOIUrl":"https://doi.org/10.1145/3674841","url":null,"abstract":"\u0000 FPGAs have been shown to operate reliably within harsh radiation environments by employing single-event upset (SEU) mitigation techniques such as configuration scrubbing, triple-modular redundancy, error correction coding, and radiation aware implementation techniques. The effectiveness of these techniques, however, is limited when using complex system-level designs that employ complex I/O interfaces with single-point failures. In previous work, a complex SoC system running Linux applied several of these techniques only to obtain an improvement of 14\u0000 \u0000 (times)\u0000 \u0000 in Mean Time to Failure (MTTF). A detailed post-radiation fault analysis found that the limitations in reliability were due to the DDR interface, the global clock network, and interconnect. This paper applied a number of design-specific SEU mitigation techniques to address the limitations in reliability of this design. These changes include triplicating the global clock, optimizing the placement of the reduction output voters and input flip-flops, and employing a mapping technique called “striping”. The application of these techniques improved MTTF of the mitigated design by a factor of 1.54\u0000 \u0000 (times)\u0000 \u0000 and thus provides a 22.8X\u0000 \u0000 (times)\u0000 \u0000 MTTF improvement over the unmitigated design. A post-radiation fault analysis using BFAT was also performed to find the remaining design vulnerabilities.\u0000","PeriodicalId":505501,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"103 22","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141821702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Giovanni Gozzi, M. Fiorito, S. Curzel, Claudio Barone, Vito Giovanni Castellana, Marco Minutoli, Antonino Tumeo, Fabrizio Ferrandi
This paper presents a methodology for the Synthesis of PARallel multi-Threaded Accelerators (SPARTA) from OpenMP annotated C/C++ specifications. SPARTA extends an open-source HLS tool, enabling the generation of accelerators that provide latency tolerance for irregular memory accesses through multithreading, support fine-grained memory-level parallelism through a hot-potato deflection-based network-on-chip (NoC), support synchronization constructs, and can instantiate memory-side caches. Our approach is based on a custom runtime OpenMP library, providing flexibility and extensibility. Experimental results show high scalability when synthesizing irregular graph kernels. The accelerators generated with our approach are, on average, 2.29x faster than state-of-the-art HLS methodologies.
{"title":"SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators","authors":"Giovanni Gozzi, M. Fiorito, S. Curzel, Claudio Barone, Vito Giovanni Castellana, Marco Minutoli, Antonino Tumeo, Fabrizio Ferrandi","doi":"10.1145/3677035","DOIUrl":"https://doi.org/10.1145/3677035","url":null,"abstract":"This paper presents a methodology for the Synthesis of PARallel multi-Threaded Accelerators (SPARTA) from OpenMP annotated C/C++ specifications. SPARTA extends an open-source HLS tool, enabling the generation of accelerators that provide latency tolerance for irregular memory accesses through multithreading, support fine-grained memory-level parallelism through a hot-potato deflection-based network-on-chip (NoC), support synchronization constructs, and can instantiate memory-side caches. Our approach is based on a custom runtime OpenMP library, providing flexibility and extensibility. Experimental results show high scalability when synthesizing irregular graph kernels. The accelerators generated with our approach are, on average, 2.29x faster than state-of-the-art HLS methodologies.","PeriodicalId":505501,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"97 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141652841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Today’s big data query engines are constantly under pressure to keep up with the rapidly increasing demand for faster processing of more complex workloads. In the past few years, FPGA-based database acceleration efforts have demonstrated promising performance improvement with good energy efficiency. However, few studies target the programming and design automation support to leverage the FPGA accelerator benefits in query processing. Most of them rely on the SQL query plan generated by CPU query engines and manually map the query plan onto the FPGA accelerators, which is tedious and error-prone. Moreover, such CPU-oriented query plans do not consider the utilization of FPGA accelerators and could lose more optimization opportunities. In this paper, we present SQL2FPGA, an FPGA accelerator-aware compiler to automatically map SQL queries onto the heterogeneous CPU-FPGA platforms. Our SQL2FPGA front-end takes an optimized logical plan of a SQL query from a database query engine and transforms it into a unified operator-level intermediate representation. To generate an optimized FPGAaware physical plan, SQL2FPGA implements a set of compiler optimization passes to 1) improve operator acceleration coverage by the FPGA, 2) eliminate redundant computation during physical execution, and 3) minimize data transfer overhead between operators on the CPU and FPGA. Furthermore, it also leverages machine learning techniques to predict and identify the optimal platform, either CPU or FPGA, for the physical execution of individual query operations. Finally, SQL2FPGA generates the associated query acceleration code for heterogeneous CPU-FPGA system deployment. Compared to the widely used Apache Spark SQL framework running on the CPU, SQL2FPGA—using two AMD/Xilinx HBM-based Alveo U280 FPGA boards and Ver.2020 AMD/Xilinx FPGA overlay designs—achieves an average performance speedup of 10.1x and 13.9x across all 22 TPC-H benchmark queries in a scale factor of 1GB (SF1) and 30GB (SF30), respectively. While evaluated on AMD/Xilinx Alveo U50 FPGA boards, SQL2FPGA using Ver. 2022 AMD/Xilinx FPGA overlay designs also achieve an average speedup of 9.6x at SF1 scale factor.
{"title":"SQL2FPGA: Automated Acceleration of SQL Query Processing on Modern CPU-FPGA Platforms","authors":"Alec Lu, Jahanvi Narendra Agrawal, Zhenman Fang","doi":"10.1145/3674843","DOIUrl":"https://doi.org/10.1145/3674843","url":null,"abstract":"Today’s big data query engines are constantly under pressure to keep up with the rapidly increasing demand for faster processing of more complex workloads. In the past few years, FPGA-based database acceleration efforts have demonstrated promising performance improvement with good energy efficiency. However, few studies target the programming and design automation support to leverage the FPGA accelerator benefits in query processing. Most of them rely on the SQL query plan generated by CPU query engines and manually map the query plan onto the FPGA accelerators, which is tedious and error-prone. Moreover, such CPU-oriented query plans do not consider the utilization of FPGA accelerators and could lose more optimization opportunities. In this paper, we present SQL2FPGA, an FPGA accelerator-aware compiler to automatically map SQL queries onto the heterogeneous CPU-FPGA platforms. Our SQL2FPGA front-end takes an optimized logical plan of a SQL query from a database query engine and transforms it into a unified operator-level intermediate representation. To generate an optimized FPGAaware physical plan, SQL2FPGA implements a set of compiler optimization passes to 1) improve operator acceleration coverage by the FPGA, 2) eliminate redundant computation during physical execution, and 3) minimize data transfer overhead between operators on the CPU and FPGA. Furthermore, it also leverages machine learning techniques to predict and identify the optimal platform, either CPU or FPGA, for the physical execution of individual query operations. Finally, SQL2FPGA generates the associated query acceleration code for heterogeneous CPU-FPGA system deployment. Compared to the widely used Apache Spark SQL framework running on the CPU, SQL2FPGA—using two AMD/Xilinx HBM-based Alveo U280 FPGA boards and Ver.2020 AMD/Xilinx FPGA overlay designs—achieves an average performance speedup of 10.1x and 13.9x across all 22 TPC-H benchmark queries in a scale factor of 1GB (SF1) and 30GB (SF30), respectively. While evaluated on AMD/Xilinx Alveo U50 FPGA boards, SQL2FPGA using Ver. 2022 AMD/Xilinx FPGA overlay designs also achieve an average speedup of 9.6x at SF1 scale factor.","PeriodicalId":505501,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"24 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141687883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xavier Carril, Charalampos Kardaris, Jordi Ribes-González, O. Farràs, Carles Hernández, Vatistas Kostalabros, Joel Ulises González-Jiménez, Miquel Moretó
Many high-demand digital services need to perform several cryptographic operations, such as key exchange or security credentialing, in a concise amount of time. In turn, the security of some of these cryptographic schemes is threatened by advances in quantum computing, as quantum computer could break their security in the near future. Post-Quantum Cryptography (PQC) is an emerging field that studies cryptographic algorithms that resist such attacks. The National Institute of Standards and Technology (NIST) has selected the CRYSTALS-Kyber Key Encapsulation Mechanism and the CRYSTALSDilithium Digital Signature algorithm as primary PQC standards. In this paper, we present FPGA-based hardware accelerators for high-volume operations of both schemes. We apply High-Level Synthesis (HLS) for hardware optimization, leveraging a batch processing approach to maximize the memory throughput, and applying custom HLS logic to specific algorithmic components. Using reconfigurable field-programmable gate arrays (FPGAs), we show that our hardware accelerators achieve speedups between 3x and 9x over software baseline implementations, even over ones leveraging CPU vector architectures. Furthermore, the methods used in this study can also be extended to the new CRYSTALS-based NIST FIPS drafts, ML-KEM and ML-DSA, with similar acceleration results.
{"title":"Hardware Acceleration for High-Volume Operations of CRYSTALS-Kyber and CRYSTALS-Dilithium","authors":"Xavier Carril, Charalampos Kardaris, Jordi Ribes-González, O. Farràs, Carles Hernández, Vatistas Kostalabros, Joel Ulises González-Jiménez, Miquel Moretó","doi":"10.1145/3675172","DOIUrl":"https://doi.org/10.1145/3675172","url":null,"abstract":"Many high-demand digital services need to perform several cryptographic operations, such as key exchange or security credentialing, in a concise amount of time. In turn, the security of some of these cryptographic schemes is threatened by advances in quantum computing, as quantum computer could break their security in the near future. Post-Quantum Cryptography (PQC) is an emerging field that studies cryptographic algorithms that resist such attacks. The National Institute of Standards and Technology (NIST) has selected the CRYSTALS-Kyber Key Encapsulation Mechanism and the CRYSTALSDilithium Digital Signature algorithm as primary PQC standards. In this paper, we present FPGA-based hardware accelerators for high-volume operations of both schemes. We apply High-Level Synthesis (HLS) for hardware optimization, leveraging a batch processing approach to maximize the memory throughput, and applying custom HLS logic to specific algorithmic components. Using reconfigurable field-programmable gate arrays (FPGAs), we show that our hardware accelerators achieve speedups between 3x and 9x over software baseline implementations, even over ones leveraging CPU vector architectures. Furthermore, the methods used in this study can also be extended to the new CRYSTALS-based NIST FIPS drafts, ML-KEM and ML-DSA, with similar acceleration results.","PeriodicalId":505501,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"17 14","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141685685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A Bayesian network is a powerful tool for representing uncertainty in data, offering transparent and interpretable inference, unlike neural networks’ black-box mechanisms. To fully harness the potential of Bayesian networks, it is essential to learn the graph structure that appropriately represents variable interrelations within data. Score-based structure learning, which involves constructing collections of potentially optimal parent sets for each variable, is computationally intensive, especially when dealing with high-dimensional data in discrete random variables. Our proposed novel acceleration algorithm extracts high levels of parallelism, offering significant advantages even with reduced reusability of computational results. In addition, it employs an elastic data representation tailored for parallel computation, making it FPGA-friendly and optimizing module occupancy while ensuring uniform handling of diverse problem scenarios. Demonstrated on a Xilinx Alveo U50 FPGA, our implementation significantly outperforms optimal CPU algorithms and is several times faster than GPU implementations on an NVIDIA TITAN RTX. Furthermore, the results of performance modeling for the accelerator indicate that, for sufficiently large problem instances, it is weakly scalable, meaning that it effectively utilizes increased computational resources for parallelization. To our knowledge, this is the first study to propose a comprehensive methodology for accelerating score-based structure learning, blending algorithmic and architectural considerations.
{"title":"A Scalable Accelerator for Local Score Computation of Structure Learning in Bayesian Networks","authors":"Ryota Miyagi, Ryota Yasudo, Kentaro Sano, Hideki Takase","doi":"10.1145/3674842","DOIUrl":"https://doi.org/10.1145/3674842","url":null,"abstract":"A Bayesian network is a powerful tool for representing uncertainty in data, offering transparent and interpretable inference, unlike neural networks’ black-box mechanisms. To fully harness the potential of Bayesian networks, it is essential to learn the graph structure that appropriately represents variable interrelations within data. Score-based structure learning, which involves constructing collections of potentially optimal parent sets for each variable, is computationally intensive, especially when dealing with high-dimensional data in discrete random variables. Our proposed novel acceleration algorithm extracts high levels of parallelism, offering significant advantages even with reduced reusability of computational results. In addition, it employs an elastic data representation tailored for parallel computation, making it FPGA-friendly and optimizing module occupancy while ensuring uniform handling of diverse problem scenarios. Demonstrated on a Xilinx Alveo U50 FPGA, our implementation significantly outperforms optimal CPU algorithms and is several times faster than GPU implementations on an NVIDIA TITAN RTX. Furthermore, the results of performance modeling for the accelerator indicate that, for sufficiently large problem instances, it is weakly scalable, meaning that it effectively utilizes increased computational resources for parallelization. To our knowledge, this is the first study to propose a comprehensive methodology for accelerating score-based structure learning, blending algorithmic and architectural considerations.","PeriodicalId":505501,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"66 s94","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141688409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lennart Van Hirtum, P. D. Causmaecker, Jens Goemaere, Tobias Kenter, Heinrich Riebler, Michael Lass, Christian Plessl
This manuscript makes the claim of having computed the (9^{th}) Dedekind number, D(9). This was done by accelerating the core operation of the process with an efficient FPGA design that outperforms an optimized 64-core CPU reference by 95 (times) . The FPGA execution was parallelized on the Noctua 2 supercomputer at Paderborn University. The resulting value for D(9) is (286386577668298411128469151667598498812366) . This value can be verified in two steps. We have made the data file containing the 490M results available, each of which can be verified separately on CPU, and the whole file sums to our proposed value. The paper explains the mathematical approach in the first part, before putting the focus on a deep dive into the FPGA accelerator implementation followed by a performance analysis. The FPGA implementation was done in RTL using a dual-clock architecture and shows how we achieved an impressive FMax of 450MHz on the targeted Stratix 10 GX 2800 FPGAs. The total compute time used was 47’000 FPGA Hours.
{"title":"A Computation of the Ninth Dedekind Number using FPGA Supercomputing","authors":"Lennart Van Hirtum, P. D. Causmaecker, Jens Goemaere, Tobias Kenter, Heinrich Riebler, Michael Lass, Christian Plessl","doi":"10.1145/3674147","DOIUrl":"https://doi.org/10.1145/3674147","url":null,"abstract":"This manuscript makes the claim of having computed the (9^{th}) Dedekind number, D(9). This was done by accelerating the core operation of the process with an efficient FPGA design that outperforms an optimized 64-core CPU reference by 95 (times) . The FPGA execution was parallelized on the Noctua 2 supercomputer at Paderborn University. The resulting value for D(9) is (286386577668298411128469151667598498812366) . This value can be verified in two steps. We have made the data file containing the 490M results available, each of which can be verified separately on CPU, and the whole file sums to our proposed value. The paper explains the mathematical approach in the first part, before putting the focus on a deep dive into the FPGA accelerator implementation followed by a performance analysis. The FPGA implementation was done in RTL using a dual-clock architecture and shows how we achieved an impressive FMax of 450MHz on the targeted Stratix 10 GX 2800 FPGAs. The total compute time used was 47’000 FPGA Hours.","PeriodicalId":505501,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"2 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141686751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}