Elliptic Curve Cryptography (ECC), one of the most widely used asymmetric cryptographic algorithms, has been deployed in Transport Layer Security (TLS) protocol, blockchain, secure multiparty computation, etc. As one of the most secure ECC curves, Curve25519 is employed by some secure protocols, such as TLS 1.3 and Diffie-Hellman Private Set Intersection (DH-PSI) protocol. High performance implementation of ECC is required, especially for the DH-PSI protocol used in privacy-preserving platform.
Point multiplication, the chief cryptographic primitive in ECC, is computationally expensive. To improve the performance of DH-PSI protocol, we propose Topgun, a novel and high-performance hardware architecture for point multiplication over Curve25519. The proposed architecture features a pipelined Finite-field Arithmetic Unit and a simple and highly efficient instruction set architecture. Compared to the best existing work on Xilinx Zynq 7000 series FPGA, our implementation with one Processing Element can achieve 3.14 × speedup on the same device. To the best of our knowledge, our implementation appears to be the fastest among the state-of-the-art works. We also have implemented our architecture consisting of 4 Compute Groups, each with 16 PEs, on an Intel Agilex AGF027 FPGA. The measured performance of 4.48 Mops/s is achieved at the cost of 86 Watts power, which is the record-setting performance for point multiplication over Curve25519 on FPGAs.
{"title":"Topgun: An ECC Accelerator for Private Set Intersection","authors":"Guiming Wu, Qianwen He, Jiali Jiang, Zhenxiang Zhang, Yuan Zhao, Yinchao Zou, Jie Zhang, Changzheng Wei, Ying Yan, Hui Zhang","doi":"https://dl.acm.org/doi/10.1145/3603114","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3603114","url":null,"abstract":"<p>Elliptic Curve Cryptography (ECC), one of the most widely used asymmetric cryptographic algorithms, has been deployed in Transport Layer Security (TLS) protocol, blockchain, secure multiparty computation, etc. As one of the most secure ECC curves, Curve25519 is employed by some secure protocols, such as TLS 1.3 and Diffie-Hellman Private Set Intersection (DH-PSI) protocol. High performance implementation of ECC is required, especially for the DH-PSI protocol used in privacy-preserving platform. </p><p>Point multiplication, the chief cryptographic primitive in ECC, is computationally expensive. To improve the performance of DH-PSI protocol, we propose Topgun, a novel and high-performance hardware architecture for point multiplication over Curve25519. The proposed architecture features a pipelined Finite-field Arithmetic Unit and a simple and highly efficient instruction set architecture. Compared to the best existing work on Xilinx Zynq 7000 series FPGA, our implementation with one Processing Element can achieve 3.14 × speedup on the same device. To the best of our knowledge, our implementation appears to be the fastest among the state-of-the-art works. We also have implemented our architecture consisting of 4 Compute Groups, each with 16 PEs, on an Intel Agilex AGF027 FPGA. The measured performance of 4.48 Mops/s is achieved at the cost of 86 Watts power, which is the record-setting performance for point multiplication over Curve25519 on FPGAs.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138504979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Elliptic Curve Cryptography (ECC), one of the most widely used asymmetric cryptographic algorithms, has been deployed in Transport Layer Security (TLS) protocol, blockchain, secure multiparty computation, etc. As one of the most secure ECC curves, Curve25519 is employed by some secure protocols, such as TLS 1.3 and Diffie-Hellman Private Set Intersection (DH-PSI) protocol. High performance implementation of ECC is required, especially for the DH-PSI protocol used in privacy-preserving platform. Point multiplication, the chief cryptographic primitive in ECC, is computationally expensive. To improve the performance of DH-PSI protocol, we propose Topgun, a novel and high-performance hardware architecture for point multiplication over Curve25519. The proposed architecture features a pipelined Finite-field Arithmetic Unit and a simple and highly efficient instruction set architecture. Compared to the best existing work on Xilinx Zynq 7000 series FPGA, our implementation with one Processing Element can achieve 3.14 × speedup on the same device. To the best of our knowledge, our implementation appears to be the fastest among the state-of-the-art works. We also have implemented our architecture consisting of 4 Compute Groups, each with 16 PEs, on an Intel Agilex AGF027 FPGA. The measured performance of 4.48 Mops/s is achieved at the cost of 86 Watts power, which is the record-setting performance for point multiplication over Curve25519 on FPGAs.
{"title":"Topgun: An ECC Accelerator for Private Set Intersection","authors":"Guiming Wu, Qianwen He, Jiali Jiang, Zhenxiang Zhang, Yuan Zhao, Yinchao Zou, Jie Zhang, Changzheng Wei, Ying Yan, Hui Zhang","doi":"10.1145/3603114","DOIUrl":"https://doi.org/10.1145/3603114","url":null,"abstract":"Elliptic Curve Cryptography (ECC), one of the most widely used asymmetric cryptographic algorithms, has been deployed in Transport Layer Security (TLS) protocol, blockchain, secure multiparty computation, etc. As one of the most secure ECC curves, Curve25519 is employed by some secure protocols, such as TLS 1.3 and Diffie-Hellman Private Set Intersection (DH-PSI) protocol. High performance implementation of ECC is required, especially for the DH-PSI protocol used in privacy-preserving platform. Point multiplication, the chief cryptographic primitive in ECC, is computationally expensive. To improve the performance of DH-PSI protocol, we propose Topgun, a novel and high-performance hardware architecture for point multiplication over Curve25519. The proposed architecture features a pipelined Finite-field Arithmetic Unit and a simple and highly efficient instruction set architecture. Compared to the best existing work on Xilinx Zynq 7000 series FPGA, our implementation with one Processing Element can achieve 3.14 × speedup on the same device. To the best of our knowledge, our implementation appears to be the fastest among the state-of-the-art works. We also have implemented our architecture consisting of 4 Compute Groups, each with 16 PEs, on an Intel Agilex AGF027 FPGA. The measured performance of 4.48 Mops/s is achieved at the cost of 86 Watts power, which is the record-setting performance for point multiplication over Curve25519 on FPGAs.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44420898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-22DOI: https://dl.acm.org/doi/10.1145/3593586
Rasha Karakchi, Jason D. Bakos
Deterministic and Non-deterministic Finite Automata (DFA and NFA) comprise the core of many big data applications. Recent efforts to develop Domain-Specific Architectures (DSAs) for DFA/NFA have taken divergent approaches, but achieving consistent throughput for arbitrarily-large pattern sets, state activation rates, and pattern match rates remains a challenge. In this article, we present NAPOLY (Non-Deterministic Automata Processor OverLaY), an FPGA overlay and associated compiler. A common limitation of prior efforts is a limit on NFA size for achieving the advertised throughput. NAPOLY is optimized for fast re-programming to permit practical time-division multiplexing of the hardware and permit high asymptotic throughput for NFAs of unlimited size, unlimited state activation rate, and high pattern reporting rate. NAPOLY also allows for offline generation of configurations having tradeoffs between state capacity and transition capacity. In this article, we (1) evaluate NAPOLY using benchmarks packaged in the ANMLZoo benchmark suite, (2) evaluate the use of an SAT solver for allocating physical resources, and (3) compare NAPOLY’s performance against existing solutions. NAPOLY performs most favorably on larger benchmarks, benchmarks with higher state activation frequency, and benchmarks with higher reporting frequency. NAPOLY outperforms the fastest of the CPU and GPU implementations in 10 out of 12 benchmarks.
{"title":"NAPOLY: A Non-deterministic Automata Processor OverLaY","authors":"Rasha Karakchi, Jason D. Bakos","doi":"https://dl.acm.org/doi/10.1145/3593586","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3593586","url":null,"abstract":"<p>Deterministic and Non-deterministic Finite Automata (DFA and NFA) comprise the core of many big data applications. Recent efforts to develop Domain-Specific Architectures (DSAs) for DFA/NFA have taken divergent approaches, but achieving consistent throughput for arbitrarily-large pattern sets, state activation rates, and pattern match rates remains a challenge. In this article, we present NAPOLY (Non-Deterministic Automata Processor OverLaY), an FPGA overlay and associated compiler. A common limitation of prior efforts is a limit on NFA size for achieving the advertised throughput. NAPOLY is optimized for fast re-programming to permit practical time-division multiplexing of the hardware and permit high asymptotic throughput for NFAs of unlimited size, unlimited state activation rate, and high pattern reporting rate. NAPOLY also allows for offline generation of configurations having tradeoffs between state capacity and transition capacity. In this article, we (1) evaluate NAPOLY using benchmarks packaged in the ANMLZoo benchmark suite, (2) evaluate the use of an SAT solver for allocating physical resources, and (3) compare NAPOLY’s performance against existing solutions. NAPOLY performs most favorably on larger benchmarks, benchmarks with higher state activation frequency, and benchmarks with higher reporting frequency. NAPOLY outperforms the fastest of the CPU and GPU implementations in 10 out of 12 benchmarks.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-22DOI: https://dl.acm.org/doi/10.1145/3567429
Carol Jingyi Li, Xiangwei Li, Binglei Lou, Craig T. Jin, David Boland, Philip H. W. Leong
The spectral correlation density (SCD) is an important tool in cyclostationary signal detection and classification. Even using efficient techniques based on the fast Fourier transform (FFT), real-time implementations are challenging because of the high computational complexity. A key dimension for computational optimization lies in minimizing the wordlength employed. In this article, we analyze the relationship between wordlength and signal-to-quantization noise in fixed-point implementations of the SCD function. A canonical SCD estimation algorithm, the FFT accumulation method (FAM) using fixed-point arithmetic, is studied. We derive closed-form expressions for SQNR and compare them at wordlengths ranging from 14 to 26 bits. The differences between the calculated SQNR and bit-exact simulations are less than 1 dB. Furthermore, an HLS-based FPGA design is implemented on a Xilinx Zynq UltraScale+ XCZU28DR-2FFVG1517E RFSoC. Using less than 25% of the logic fabric on the device, it consumes 7.7 W total on-chip power and has a power efficiency of 12.4 GOPS/W, which is an order of magnitude improvement over an Nvidia Tesla K40 graphics processing unit (GPU) implementation. In terms of throughput, it achieves 50 MS/sec, which is a speedup of 1.6 over a recent optimized FPGA implementation.
{"title":"Fixed-point FPGA Implementation of the FFT Accumulation Method for Real-time Cyclostationary Analysis","authors":"Carol Jingyi Li, Xiangwei Li, Binglei Lou, Craig T. Jin, David Boland, Philip H. W. Leong","doi":"https://dl.acm.org/doi/10.1145/3567429","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3567429","url":null,"abstract":"<p>The spectral correlation density (SCD) is an important tool in cyclostationary signal detection and classification. Even using efficient techniques based on the fast Fourier transform (FFT), real-time implementations are challenging because of the high computational complexity. A key dimension for computational optimization lies in minimizing the wordlength employed. In this article, we analyze the relationship between wordlength and signal-to-quantization noise in fixed-point implementations of the SCD function. A canonical SCD estimation algorithm, the FFT accumulation method (FAM) using fixed-point arithmetic, is studied. We derive closed-form expressions for SQNR and compare them at wordlengths ranging from 14 to 26 bits. The differences between the calculated SQNR and bit-exact simulations are less than 1 dB. Furthermore, an HLS-based FPGA design is implemented on a Xilinx Zynq UltraScale+ XCZU28DR-2FFVG1517E RFSoC. Using less than 25% of the logic fabric on the device, it consumes 7.7 W total on-chip power and has a power efficiency of 12.4 GOPS/W, which is an order of magnitude improvement over an Nvidia Tesla K40 graphics processing unit (GPU) implementation. In terms of throughput, it achieves 50 MS/sec, which is a speedup of 1.6 over a recent optimized FPGA implementation.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-21DOI: https://dl.acm.org/doi/10.1145/3568992
Binglei Lou, David Boland, Philip Leong
Machine learning ensembles combine multiple base models to produce a more accurate output. They can be applied to a range of machine learning problems, including anomaly detection. In this article, we investigate how to maximize the composability and scalability of an FPGA-based streaming ensemble anomaly detector (fSEAD). To achieve this, we propose a flexible computing architecture consisting of multiple partially reconfigurable regions, pblocks, which each implement anomaly detectors. Our proof-of-concept design supports three state-of-the-art anomaly detection algorithms: Loda, RS-Hash, and xStream. Each algorithm is scalable, meaning multiple instances can be placed within a pblock to improve performance. Moreover, fSEAD is implemented using High-level synthesis (HLS), meaning further custom anomaly detectors can be supported. Pblocks are interconnected via an AXI-switch, enabling them to be composed in an arbitrary fashion before combining and merging results at runtime to create an ensemble that maximizes the use of FPGA resources and accuracy. Through utilizing reconfigurable Dynamic Function eXchange (DFX), the detector can be modified at runtime to adapt to changing environmental conditions. We compare fSEAD to an equivalent central processing unit (CPU) implementation using four standard datasets, with speedups ranging from 3× to 8×.
{"title":"fSEAD: A Composable FPGA-based Streaming Ensemble Anomaly Detection Library","authors":"Binglei Lou, David Boland, Philip Leong","doi":"https://dl.acm.org/doi/10.1145/3568992","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568992","url":null,"abstract":"<p>Machine learning ensembles combine multiple base models to produce a more accurate output. They can be applied to a range of machine learning problems, including anomaly detection. In this article, we investigate how to maximize the composability and scalability of an FPGA-based streaming ensemble anomaly detector (fSEAD). To achieve this, we propose a flexible computing architecture consisting of multiple partially reconfigurable regions, pblocks, which each implement anomaly detectors. Our proof-of-concept design supports three state-of-the-art anomaly detection algorithms: Loda, RS-Hash, and xStream. Each algorithm is scalable, meaning multiple instances can be placed within a pblock to improve performance. Moreover, fSEAD is implemented using High-level synthesis (HLS), meaning further custom anomaly detectors can be supported. Pblocks are interconnected via an AXI-switch, enabling them to be composed in an arbitrary fashion before combining and merging results at runtime to create an ensemble that maximizes the use of FPGA resources and accuracy. Through utilizing reconfigurable Dynamic Function eXchange (DFX), the detector can be modified at runtime to adapt to changing environmental conditions. We compare fSEAD to an equivalent central processing unit (CPU) implementation using four standard datasets, with speedups ranging from 3× to 8×.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138504975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-21DOI: https://dl.acm.org/doi/10.1145/3588318
Pedro Machado, João Filipe Ferreira, Andreas Oikonomou, T. M. McGinnity
Vertebrate retinas are highly-efficient in processing trivial visual tasks such as detecting moving objects, which still represent complex challenges for modern computers. In vertebrates, the detection of object motion is performed by specialised retinal cells named Object Motion Sensitive Ganglion Cells (OMS-GC). OMS-GC process continuous visual signals and generate spike patterns that are post-processed by the Visual Cortex. Our previous Hybrid Sensitive Motion Detector (HSMD) algorithm was the first hybrid algorithm to enhance Background subtraction (BS) algorithms with a customised 3-layer Spiking Neural Network (SNN) that generates OMS-GC spiking-like responses. In this work, we present a Neuromorphic Hybrid Sensitive Motion Detector (NeuroHSMD) algorithm that accelerates our HSMD algorithm using Field-Programmable Gate Arrays (FPGAs). The NeuroHSMD was compared against the HSMD algorithm, using the same 2012 Change Detection (CDnet2012) and 2014 Change Detection (CDnet2014) benchmark datasets. When tested against the CDnet2012 and CDnet2014 datasets, NeuroHSMD performs object motion detection at 720 × 480 at 28.06 Frames Per Second (fps) and 720 × 480 at 28.71 fps, respectively, with no degradation of quality. Moreover, the NeuroHSMD proposed in this article was completely implemented in Open Computer Language (OpenCL) and therefore is easily replicated in other devices such as Graphical Processing Units (GPUs) and clusters of Central Processing Units (CPUs).
{"title":"NeuroHSMD: Neuromorphic Hybrid Spiking Motion Detector","authors":"Pedro Machado, João Filipe Ferreira, Andreas Oikonomou, T. M. McGinnity","doi":"https://dl.acm.org/doi/10.1145/3588318","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3588318","url":null,"abstract":"<p>Vertebrate retinas are highly-efficient in processing trivial visual tasks such as detecting moving objects, which still represent complex challenges for modern computers. In vertebrates, the detection of object motion is performed by specialised retinal cells named Object Motion Sensitive Ganglion Cells (OMS-GC). OMS-GC process continuous visual signals and generate spike patterns that are post-processed by the Visual Cortex. Our previous Hybrid Sensitive Motion Detector (HSMD) algorithm was the first hybrid algorithm to enhance Background subtraction (BS) algorithms with a customised 3-layer Spiking Neural Network (SNN) that generates OMS-GC spiking-like responses. In this work, we present a Neuromorphic Hybrid Sensitive Motion Detector (NeuroHSMD) algorithm that accelerates our HSMD algorithm using Field-Programmable Gate Arrays (FPGAs). The NeuroHSMD was compared against the HSMD algorithm, using the same 2012 Change Detection (CDnet2012) and 2014 Change Detection (CDnet2014) benchmark datasets. When tested against the CDnet2012 and CDnet2014 datasets, NeuroHSMD performs object motion detection at 720 × 480 at 28.06 Frames Per Second (fps) and 720 × 480 at 28.71 fps, respectively, with no degradation of quality. Moreover, the NeuroHSMD proposed in this article was completely implemented in Open Computer Language (OpenCL) and therefore is easily replicated in other devices such as Graphical Processing Units (GPUs) and clusters of Central Processing Units (CPUs).</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138504999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-21DOI: https://dl.acm.org/doi/10.1145/3572959
Hyegang Jun, Hanchen Ye, Hyunmin Jeong, Deming Chen
High-Level Synthesis (HLS) has enabled users to rapidly develop designs targeted for FPGAs from the behavioral description of the design. However, to synthesize an optimal design capable of taking better advantage of the target FPGA, a considerable amount of effort is needed to transform the initial behavioral description into a form that can capture the desired level of parallelism. Thus, a design space exploration (DSE) engine capable of optimizing large complex designs is needed to achieve this goal. We present a new DSE engine capable of considering code transformation, compiler directives (pragmas), and the compatibility of these optimizations. To accomplish this, we initially express the structure of the input code as a graph to guide the exploration process. To appropriately transform the code, we take advantage of ScaleHLS based on the multi-level compiler infrastructure (MLIR). Finally, we identify problems that limit the scalability of existing DSEs, which we name the “design space merging problem.” We address this issue by employing a Random Forest classifier that can successfully decrease the number of invalid design points without invoking the HLS compiler as a validation tool. We evaluated our DSE engine against the ScaleHLS DSE, outperforming it by a maximum of 59×. We additionally demonstrate the scalability of our design by applying our DSE to large-scale HLS designs, achieving a maximum speedup of 12× for the benchmarks in the MachSuite and Rodinia set.
{"title":"AutoScaleDSE: A Scalable Design Space Exploration Engine for High-Level Synthesis","authors":"Hyegang Jun, Hanchen Ye, Hyunmin Jeong, Deming Chen","doi":"https://dl.acm.org/doi/10.1145/3572959","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3572959","url":null,"abstract":"<p>High-Level Synthesis (HLS) has enabled users to rapidly develop designs targeted for FPGAs from the behavioral description of the design. However, to synthesize an optimal design capable of taking better advantage of the target FPGA, a considerable amount of effort is needed to transform the initial behavioral description into a form that can capture the desired level of parallelism. Thus, a design space exploration (DSE) engine capable of optimizing large complex designs is needed to achieve this goal. We present a new DSE engine capable of considering code transformation, compiler directives (pragmas), and the compatibility of these optimizations. To accomplish this, we initially express the structure of the input code as a graph to guide the exploration process. To appropriately transform the code, we take advantage of ScaleHLS based on the multi-level compiler infrastructure (MLIR). Finally, we identify problems that limit the scalability of existing DSEs, which we name the “design space merging problem.” We address this issue by employing a Random Forest classifier that can successfully decrease the number of invalid design points without invoking the HLS compiler as a validation tool. We evaluated our DSE engine against the ScaleHLS DSE, outperforming it by a maximum of 59×. We additionally demonstrate the scalability of our design by applying our DSE to large-scale HLS designs, achieving a maximum speedup of 12× for the benchmarks in the MachSuite and Rodinia set.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138504978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-21DOI: https://dl.acm.org/doi/10.1145/3596513
Miriam Leeser
Authors of papers that were accepted to ACM TRETS via the FPT 2022 journal track had the option of participating in Artifact Evaluation (AE). Four papers from this track volunteered to participate in the AE process. All of these papers have been awarded badges from ACM as described below.
{"title":"Artifact Evaluation for ACM TRETS Papers Submitted from the FPT Journal Track","authors":"Miriam Leeser","doi":"https://dl.acm.org/doi/10.1145/3596513","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3596513","url":null,"abstract":"<p>Authors of papers that were accepted to ACM TRETS via the FPT 2022 journal track had the option of participating in Artifact Evaluation (AE). Four papers from this track volunteered to participate in the AE process. All of these papers have been awarded badges from ACM as described below.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138504976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-21DOI: https://dl.acm.org/doi/10.1145/3585521
Alex R. Bucknall, Suhaib A. Fahmy
Partial reconfiguration (PR) is a key enabler to the design and development of adaptive systems on modern Field Programmable Gate Array (FPGA) Systems-on-Chip (SoCs), allowing hardware to be adapted dynamically at runtime. Vendor-supported PR infrastructure is performance-limited and blocking, drivers entail complex memory management, and software/hardware design requires bespoke knowledge of the underlying hardware. This article presents ZyPR: a complete end-to-end framework that provides high-performance reconfiguration of hardware from within a software abstraction in the Linux userspace, automating the process of building PR applications with support for the Xilinx Zynq and Zynq UltraScale+ architectures, aimed at enabling non-expert application designers to leverage PR for edge applications. We compare ZyPR against traditional vendor tooling for PR management as well as recent open source tools that support PR under Linux. The framework provides a high-performance runtime along with low overhead for its provided abstractions. We introduce improvements to our previous work, increasing the provisioning throughput for PR bitstreams on the Zynq Ultrascale+ by 2× and 5.4× compared to Xilinx’s FPGA Manager.
{"title":"ZyPR: End-to-end Build Tool and Runtime Manager for Partial Reconfiguration of FPGA SoCs at the Edge","authors":"Alex R. Bucknall, Suhaib A. Fahmy","doi":"https://dl.acm.org/doi/10.1145/3585521","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3585521","url":null,"abstract":"<p>Partial reconfiguration (PR) is a key enabler to the design and development of adaptive systems on modern Field Programmable Gate Array (FPGA) Systems-on-Chip (SoCs), allowing hardware to be adapted dynamically at runtime. Vendor-supported PR infrastructure is performance-limited and blocking, drivers entail complex memory management, and software/hardware design requires bespoke knowledge of the underlying hardware. This article presents ZyPR: a complete end-to-end framework that provides high-performance reconfiguration of hardware from within a software abstraction in the Linux userspace, automating the process of building PR applications with support for the Xilinx Zynq and Zynq UltraScale+ architectures, aimed at enabling non-expert application designers to leverage PR for edge applications. We compare ZyPR against traditional vendor tooling for PR management as well as recent open source tools that support PR under Linux. The framework provides a high-performance runtime along with low overhead for its provided abstractions. We introduce improvements to our previous work, increasing the provisioning throughput for PR bitstreams on the Zynq Ultrascale+ by 2× and 5.4× compared to Xilinx’s FPGA Manager.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Post-quantum cryptography (PQC) has recently drawn substantial attention from various communities owing to the proven vulnerability of existing public-key cryptosystems against the attacks launched from well-established quantum computers. The Ring-Binary-Learning-with-Errors (RBLWE), a variant of Ring-LWE, has been proposed to build PQC for lightweight applications. As more Field-Programmable Gate Array (FPGA) devices are being deployed in lightweight applications like Internet-of-Things (IoT) devices, it would be interesting if the RBLWE-based PQC can be implemented on the FPGA with ultra-low complexity and flexible processing. However, thus far, limited information is available for such implementations. In this article, we propose novel RBLWE-based PQC accelerators on the FPGA with ultra-low implementation complexity and flexible timing. We first present the process of deriving the key operation of the RBLWE-based scheme into the proposed algorithmic operation. The corresponding hardware accelerator is then efficiently mapped from the proposed algorithm with the help of algorithm-to-architecture implementation techniques and extended to obtain higher-throughput designs. The final complexity analysis and implementation results (on a variety of FPGAs) show that the proposed accelerators have significantly smaller area-time complexities than the state-of-the-art designs. Overall, the proposed accelerators feature low implementation complexity and flexible processing, making them desirable for emerging FPGA-based lightweight applications.
{"title":"FPGA Implementation of Compact Hardware Accelerators for Ring-Binary-LWE-based Post-quantum Cryptography","authors":"Pengzhou He, Tianyou Bao, Jiafeng Xie, Moeness Amin","doi":"https://dl.acm.org/doi/10.1145/3569457","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3569457","url":null,"abstract":"<p>Post-quantum cryptography (PQC) has recently drawn substantial attention from various communities owing to the proven vulnerability of existing public-key cryptosystems against the attacks launched from well-established quantum computers. The Ring-Binary-Learning-with-Errors (RBLWE), a variant of Ring-LWE, has been proposed to build PQC for lightweight applications. As more Field-Programmable Gate Array (FPGA) devices are being deployed in lightweight applications like Internet-of-Things (IoT) devices, it would be interesting if the RBLWE-based PQC can be implemented on the FPGA with ultra-low complexity and flexible processing. However, thus far, limited information is available for such implementations. In this article, we propose novel RBLWE-based PQC accelerators on the FPGA with ultra-low implementation complexity and flexible timing. We first present the process of deriving the key operation of the RBLWE-based scheme into the proposed algorithmic operation. The corresponding hardware accelerator is then efficiently mapped from the proposed algorithm with the help of algorithm-to-architecture implementation techniques and extended to obtain higher-throughput designs. The final complexity analysis and implementation results (on a variety of FPGAs) show that the proposed accelerators have significantly smaller area-time complexities than the state-of-the-art designs. Overall, the proposed accelerators feature low implementation complexity and flexible processing, making them desirable for emerging FPGA-based lightweight applications.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138543681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}