ACM Transactions on Reconfigurable Technology and Systems最新文献_第7页

fSEAD: A Composable FPGA-based Streaming Ensemble Anomaly Detection Library fSEAD:一个基于fpga的可组合流集成异常检测库

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-21 DOI: https://dl.acm.org/doi/10.1145/3568992

Binglei Lou, David Boland, Philip Leong

Machine learning ensembles combine multiple base models to produce a more accurate output. They can be applied to a range of machine learning problems, including anomaly detection. In this article, we investigate how to maximize the composability and scalability of an FPGA-based streaming ensemble anomaly detector (fSEAD). To achieve this, we propose a flexible computing architecture consisting of multiple partially reconfigurable regions, pblocks, which each implement anomaly detectors. Our proof-of-concept design supports three state-of-the-art anomaly detection algorithms: Loda, RS-Hash, and xStream. Each algorithm is scalable, meaning multiple instances can be placed within a pblock to improve performance. Moreover, fSEAD is implemented using High-level synthesis (HLS), meaning further custom anomaly detectors can be supported. Pblocks are interconnected via an AXI-switch, enabling them to be composed in an arbitrary fashion before combining and merging results at runtime to create an ensemble that maximizes the use of FPGA resources and accuracy. Through utilizing reconfigurable Dynamic Function eXchange (DFX), the detector can be modified at runtime to adapt to changing environmental conditions. We compare fSEAD to an equivalent central processing unit (CPU) implementation using four standard datasets, with speedups ranging from 3× to 8×.

机器学习集成结合多个基本模型来产生更准确的输出。它们可以应用于一系列机器学习问题，包括异常检测。在本文中，我们研究了如何最大化基于fpga的流集成异常检测器(fSEAD)的可组合性和可扩展性。为了实现这一点，我们提出了一个灵活的计算架构，由多个部分可重构的区域组成，每个区域都实现异常检测器。我们的概念验证设计支持三种最先进的异常检测算法:Loda, RS-Hash和xStream。每个算法都是可伸缩的，这意味着可以在一个pblock中放置多个实例来提高性能。此外，fSEAD是使用高级综合(HLS)实现的，这意味着可以支持更多的自定义异常检测器。pblock通过轴向开关相互连接，使它们能够在运行时组合和合并结果之前以任意方式组合，以创建最大限度地利用FPGA资源和精度的集成。通过利用可重构的动态功能交换(DFX)，探测器可以在运行时进行修改，以适应不断变化的环境条件。我们将fSEAD与使用四个标准数据集的等效中央处理器(CPU)实现进行比较，其速度从3倍到8倍不等。

{"title":"fSEAD: A Composable FPGA-based Streaming Ensemble Anomaly Detection Library","authors":"Binglei Lou, David Boland, Philip Leong","doi":"https://dl.acm.org/doi/10.1145/3568992","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568992","url":null,"abstract":"Machine learning ensembles combine multiple base models to produce a more accurate output. They can be applied to a range of machine learning problems, including anomaly detection. In this article, we investigate how to maximize the composability and scalability of an FPGA-based streaming ensemble anomaly detector (fSEAD). To achieve this, we propose a flexible computing architecture consisting of multiple partially reconfigurable regions, pblocks, which each implement anomaly detectors. Our proof-of-concept design supports three state-of-the-art anomaly detection algorithms: Loda, RS-Hash, and xStream. Each algorithm is scalable, meaning multiple instances can be placed within a pblock to improve performance. Moreover, fSEAD is implemented using High-level synthesis (HLS), meaning further custom anomaly detectors can be supported. Pblocks are interconnected via an AXI-switch, enabling them to be composed in an arbitrary fashion before combining and merging results at runtime to create an ensemble that maximizes the use of FPGA resources and accuracy. Through utilizing reconfigurable Dynamic Function eXchange (DFX), the detector can be modified at runtime to adapt to changing environmental conditions. We compare fSEAD to an equivalent central processing unit (CPU) implementation using four standard datasets, with speedups ranging from 3× to 8×.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"84 4","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138504975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NeuroHSMD: Neuromorphic Hybrid Spiking Motion Detector NeuroHSMD:神经形态杂交脉冲运动检测器

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-21 DOI: https://dl.acm.org/doi/10.1145/3588318

Pedro Machado, João Filipe Ferreira, Andreas Oikonomou, T. M. McGinnity

Vertebrate retinas are highly-efficient in processing trivial visual tasks such as detecting moving objects, which still represent complex challenges for modern computers. In vertebrates, the detection of object motion is performed by specialised retinal cells named Object Motion Sensitive Ganglion Cells (OMS-GC). OMS-GC process continuous visual signals and generate spike patterns that are post-processed by the Visual Cortex. Our previous Hybrid Sensitive Motion Detector (HSMD) algorithm was the first hybrid algorithm to enhance Background subtraction (BS) algorithms with a customised 3-layer Spiking Neural Network (SNN) that generates OMS-GC spiking-like responses. In this work, we present a Neuromorphic Hybrid Sensitive Motion Detector (NeuroHSMD) algorithm that accelerates our HSMD algorithm using Field-Programmable Gate Arrays (FPGAs). The NeuroHSMD was compared against the HSMD algorithm, using the same 2012 Change Detection (CDnet2012) and 2014 Change Detection (CDnet2014) benchmark datasets. When tested against the CDnet2012 and CDnet2014 datasets, NeuroHSMD performs object motion detection at 720 × 480 at 28.06 Frames Per Second (fps) and 720 × 480 at 28.71 fps, respectively, with no degradation of quality. Moreover, the NeuroHSMD proposed in this article was completely implemented in Open Computer Language (OpenCL) and therefore is easily replicated in other devices such as Graphical Processing Units (GPUs) and clusters of Central Processing Units (CPUs).

脊椎动物的视网膜在处理琐碎的视觉任务时效率很高，比如检测移动的物体，这对现代计算机来说仍然是一个复杂的挑战。在脊椎动物中，物体运动的检测是由称为物体运动敏感神经节细胞(OMS-GC)的特殊视网膜细胞完成的。OMS-GC处理连续的视觉信号并产生由视觉皮层后处理的脉冲模式。我们之前的混合敏感运动检测器(HSMD)算法是第一个使用定制的3层峰值神经网络(SNN)增强背景减法(BS)算法的混合算法，该算法可以产生类似OMS-GC峰值的响应。在这项工作中，我们提出了一种神经形态混合敏感运动检测器(NeuroHSMD)算法，该算法使用现场可编程门阵列(fpga)加速了我们的HSMD算法。使用相同的2012年变化检测(CDnet2012)和2014年变化检测(CDnet2014)基准数据集，将NeuroHSMD与HSMD算法进行比较。在针对CDnet2012和CDnet2014数据集进行测试时，NeuroHSMD分别以28.06帧/秒(fps)的720 × 480和28.71帧/秒(fps)的720 × 480进行物体运动检测，质量没有下降。此外，本文提出的NeuroHSMD完全是在开放计算机语言(OpenCL)中实现的，因此很容易在其他设备中复制，例如图形处理单元(gpu)和中央处理单元(cpu)集群。

{"title":"NeuroHSMD: Neuromorphic Hybrid Spiking Motion Detector","authors":"Pedro Machado, João Filipe Ferreira, Andreas Oikonomou, T. M. McGinnity","doi":"https://dl.acm.org/doi/10.1145/3588318","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3588318","url":null,"abstract":"Vertebrate retinas are highly-efficient in processing trivial visual tasks such as detecting moving objects, which still represent complex challenges for modern computers. In vertebrates, the detection of object motion is performed by specialised retinal cells named Object Motion Sensitive Ganglion Cells (OMS-GC). OMS-GC process continuous visual signals and generate spike patterns that are post-processed by the Visual Cortex. Our previous Hybrid Sensitive Motion Detector (HSMD) algorithm was the first hybrid algorithm to enhance Background subtraction (BS) algorithms with a customised 3-layer Spiking Neural Network (SNN) that generates OMS-GC spiking-like responses. In this work, we present a Neuromorphic Hybrid Sensitive Motion Detector (NeuroHSMD) algorithm that accelerates our HSMD algorithm using Field-Programmable Gate Arrays (FPGAs). The NeuroHSMD was compared against the HSMD algorithm, using the same 2012 Change Detection (CDnet2012) and 2014 Change Detection (CDnet2014) benchmark datasets. When tested against the CDnet2012 and CDnet2014 datasets, NeuroHSMD performs object motion detection at 720 × 480 at 28.06 Frames Per Second (fps) and 720 × 480 at 28.71 fps, respectively, with no degradation of quality. Moreover, the NeuroHSMD proposed in this article was completely implemented in Open Computer Language (OpenCL) and therefore is easily replicated in other devices such as Graphical Processing Units (GPUs) and clusters of Central Processing Units (CPUs).","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"78 4","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138504999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AutoScaleDSE: A Scalable Design Space Exploration Engine for High-Level Synthesis AutoScaleDSE:用于高级合成的可扩展设计空间探索引擎

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-21 DOI: https://dl.acm.org/doi/10.1145/3572959

Hyegang Jun, Hanchen Ye, Hyunmin Jeong, Deming Chen

High-Level Synthesis (HLS) has enabled users to rapidly develop designs targeted for FPGAs from the behavioral description of the design. However, to synthesize an optimal design capable of taking better advantage of the target FPGA, a considerable amount of effort is needed to transform the initial behavioral description into a form that can capture the desired level of parallelism. Thus, a design space exploration (DSE) engine capable of optimizing large complex designs is needed to achieve this goal. We present a new DSE engine capable of considering code transformation, compiler directives (pragmas), and the compatibility of these optimizations. To accomplish this, we initially express the structure of the input code as a graph to guide the exploration process. To appropriately transform the code, we take advantage of ScaleHLS based on the multi-level compiler infrastructure (MLIR). Finally, we identify problems that limit the scalability of existing DSEs, which we name the “design space merging problem.” We address this issue by employing a Random Forest classifier that can successfully decrease the number of invalid design points without invoking the HLS compiler as a validation tool. We evaluated our DSE engine against the ScaleHLS DSE, outperforming it by a maximum of 59×. We additionally demonstrate the scalability of our design by applying our DSE to large-scale HLS designs, achieving a maximum speedup of 12× for the benchmarks in the MachSuite and Rodinia set.

高级综合(HLS)使用户能够从设计的行为描述中快速开发针对fpga的设计。然而，为了综合一个能够更好地利用目标FPGA的最佳设计，需要付出相当大的努力将初始行为描述转换为能够捕获所需并行性水平的形式。因此，需要一个能够优化大型复杂设计的设计空间探索(DSE)引擎来实现这一目标。我们提出了一个新的DSE引擎，它能够考虑代码转换、编译器指令(pragmas)以及这些优化的兼容性。为了实现这一点，我们首先将输入代码的结构表示为一个图，以指导探索过程。为了适当地转换代码，我们利用了基于多级编译器基础结构(MLIR)的ScaleHLS。最后，我们确定限制现有dse可伸缩性的问题，我们将其命名为“设计空间合并问题”。我们通过使用随机森林分类器来解决这个问题，该分类器可以成功地减少无效设计点的数量，而无需调用HLS编译器作为验证工具。我们将我们的DSE引擎与ScaleHLS的DSE进行了对比，结果显示，我们的DSE引擎的性能比ScaleHLS的DSE引擎高出59倍。我们还通过将我们的DSE应用于大规模HLS设计来证明我们设计的可扩展性，在MachSuite和Rodinia设置的基准测试中实现了12倍的最大加速。

{"title":"AutoScaleDSE: A Scalable Design Space Exploration Engine for High-Level Synthesis","authors":"Hyegang Jun, Hanchen Ye, Hyunmin Jeong, Deming Chen","doi":"https://dl.acm.org/doi/10.1145/3572959","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3572959","url":null,"abstract":"High-Level Synthesis (HLS) has enabled users to rapidly develop designs targeted for FPGAs from the behavioral description of the design. However, to synthesize an optimal design capable of taking better advantage of the target FPGA, a considerable amount of effort is needed to transform the initial behavioral description into a form that can capture the desired level of parallelism. Thus, a design space exploration (DSE) engine capable of optimizing large complex designs is needed to achieve this goal. We present a new DSE engine capable of considering code transformation, compiler directives (pragmas), and the compatibility of these optimizations. To accomplish this, we initially express the structure of the input code as a graph to guide the exploration process. To appropriately transform the code, we take advantage of ScaleHLS based on the multi-level compiler infrastructure (MLIR). Finally, we identify problems that limit the scalability of existing DSEs, which we name the “design space merging problem.” We address this issue by employing a Random Forest classifier that can successfully decrease the number of invalid design points without invoking the HLS compiler as a validation tool. We evaluated our DSE engine against the ScaleHLS DSE, outperforming it by a maximum of 59×. We additionally demonstrate the scalability of our design by applying our DSE to large-scale HLS designs, achieving a maximum speedup of 12× for the benchmarks in the MachSuite and Rodinia set.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"82 2","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138504978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Artifact Evaluation for ACM TRETS Papers Submitted from the FPT Journal Track 从FPT期刊轨道提交的ACM TRETS论文的伪影评估

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-21 DOI: https://dl.acm.org/doi/10.1145/3596513

Miriam Leeser

Authors of papers that were accepted to ACM TRETS via the FPT 2022 journal track had the option of participating in Artifact Evaluation (AE). Four papers from this track volunteered to participate in the AE process. All of these papers have been awarded badges from ACM as described below.

通过FPT 2022期刊轨道被ACM TRETS接受的论文的作者可以选择参加工件评估(AE)。该方向的四篇论文自愿参与AE过程。所有这些论文都获得了ACM颁发的徽章，如下所述。

引用次数: 0

ZyPR: End-to-end Build Tool and Runtime Manager for Partial Reconfiguration of FPGA SoCs at the Edge ZyPR:端到端构建工具和运行时管理器，用于FPGA soc的边缘部分重新配置

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-21 DOI: https://dl.acm.org/doi/10.1145/3585521

Alex R. Bucknall, Suhaib A. Fahmy

Partial reconfiguration (PR) is a key enabler to the design and development of adaptive systems on modern Field Programmable Gate Array (FPGA) Systems-on-Chip (SoCs), allowing hardware to be adapted dynamically at runtime. Vendor-supported PR infrastructure is performance-limited and blocking, drivers entail complex memory management, and software/hardware design requires bespoke knowledge of the underlying hardware. This article presents ZyPR: a complete end-to-end framework that provides high-performance reconfiguration of hardware from within a software abstraction in the Linux userspace, automating the process of building PR applications with support for the Xilinx Zynq and Zynq UltraScale+ architectures, aimed at enabling non-expert application designers to leverage PR for edge applications. We compare ZyPR against traditional vendor tooling for PR management as well as recent open source tools that support PR under Linux. The framework provides a high-performance runtime along with low overhead for its provided abstractions. We introduce improvements to our previous work, increasing the provisioning throughput for PR bitstreams on the Zynq Ultrascale+ by 2× and 5.4× compared to Xilinx’s FPGA Manager.

部分重构(PR)是现代现场可编程门阵列(FPGA)片上系统(soc)自适应系统设计和开发的关键，它允许硬件在运行时动态适应。供应商支持的PR基础设施性能有限且阻塞，驱动程序需要复杂的内存管理，软件/硬件设计需要对底层硬件的定制知识。本文介绍了ZyPR:一个完整的端到端框架，从Linux用户空间的软件抽象中提供高性能的硬件重构，自动化构建PR应用程序的过程，支持Xilinx Zynq和Zynq UltraScale+架构，旨在使非专家应用程序设计人员能够利用边缘应用程序的PR。我们将ZyPR与传统的公关管理供应商工具以及最近在Linux下支持公关的开源工具进行比较。该框架提供了一个高性能的运行时，并为其提供的抽象提供了低开销。我们对之前的工作进行了改进，与Xilinx的FPGA Manager相比，Zynq Ultrascale+上PR位流的配置吞吐量提高了2倍和5.4倍。

{"title":"ZyPR: End-to-end Build Tool and Runtime Manager for Partial Reconfiguration of FPGA SoCs at the Edge","authors":"Alex R. Bucknall, Suhaib A. Fahmy","doi":"https://dl.acm.org/doi/10.1145/3585521","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3585521","url":null,"abstract":"Partial reconfiguration (PR) is a key enabler to the design and development of adaptive systems on modern Field Programmable Gate Array (FPGA) Systems-on-Chip (SoCs), allowing hardware to be adapted dynamically at runtime. Vendor-supported PR infrastructure is performance-limited and blocking, drivers entail complex memory management, and software/hardware design requires bespoke knowledge of the underlying hardware. This article presents ZyPR: a complete end-to-end framework that provides high-performance reconfiguration of hardware from within a software abstraction in the Linux userspace, automating the process of building PR applications with support for the Xilinx Zynq and Zynq UltraScale+ architectures, aimed at enabling non-expert application designers to leverage PR for edge applications. We compare ZyPR against traditional vendor tooling for PR management as well as recent open source tools that support PR under Linux. The framework provides a high-performance runtime along with low overhead for its provided abstractions. We introduce improvements to our previous work, increasing the provisioning throughput for PR bitstreams on the Zynq Ultrascale+ by 2× and 5.4× compared to Xilinx’s FPGA Manager.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"42 14","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FPGA Implementation of Compact Hardware Accelerators for Ring-Binary-LWE-based Post-quantum Cryptography 基于环二进制lwe的后量子密码紧凑硬件加速器的FPGA实现

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-21 DOI: https://dl.acm.org/doi/10.1145/3569457

Pengzhou He, Tianyou Bao, Jiafeng Xie, Moeness Amin

Post-quantum cryptography (PQC) has recently drawn substantial attention from various communities owing to the proven vulnerability of existing public-key cryptosystems against the attacks launched from well-established quantum computers. The Ring-Binary-Learning-with-Errors (RBLWE), a variant of Ring-LWE, has been proposed to build PQC for lightweight applications. As more Field-Programmable Gate Array (FPGA) devices are being deployed in lightweight applications like Internet-of-Things (IoT) devices, it would be interesting if the RBLWE-based PQC can be implemented on the FPGA with ultra-low complexity and flexible processing. However, thus far, limited information is available for such implementations. In this article, we propose novel RBLWE-based PQC accelerators on the FPGA with ultra-low implementation complexity and flexible timing. We first present the process of deriving the key operation of the RBLWE-based scheme into the proposed algorithmic operation. The corresponding hardware accelerator is then efficiently mapped from the proposed algorithm with the help of algorithm-to-architecture implementation techniques and extended to obtain higher-throughput designs. The final complexity analysis and implementation results (on a variety of FPGAs) show that the proposed accelerators have significantly smaller area-time complexities than the state-of-the-art designs. Overall, the proposed accelerators feature low implementation complexity and flexible processing, making them desirable for emerging FPGA-based lightweight applications.

后量子密码学(PQC)最近引起了各个社区的广泛关注，因为现有的公钥密码系统被证明容易受到来自成熟量子计算机的攻击。ring - binary - learning - witherrors (RBLWE)是Ring-LWE的一种变体，被提出用于构建轻量级应用程序的PQC。随着越来越多的现场可编程门阵列(FPGA)设备被部署在物联网(IoT)设备等轻量级应用中，如果基于rblwe的PQC能够以超低的复杂性和灵活的处理方式在FPGA上实现，那将是一件有趣的事情。然而，到目前为止，可用于此类实现的信息有限。在本文中，我们在FPGA上提出了一种基于rblwe的PQC加速器，具有超低的实现复杂度和灵活的时序。我们首先介绍了将基于rblwe的方案的关键操作导出到所提出的算法操作的过程。然后借助算法到体系结构的实现技术，从所提出的算法有效地映射相应的硬件加速器，并扩展以获得更高吞吐量的设计。最终的复杂性分析和实现结果(在各种fpga上)表明，所提出的加速器比最先进的设计具有明显更小的面积-时间复杂性。总体而言，所提出的加速器具有低实现复杂性和灵活处理的特点，使其成为新兴的基于fpga的轻量级应用的理想选择。

{"title":"FPGA Implementation of Compact Hardware Accelerators for Ring-Binary-LWE-based Post-quantum Cryptography","authors":"Pengzhou He, Tianyou Bao, Jiafeng Xie, Moeness Amin","doi":"https://dl.acm.org/doi/10.1145/3569457","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3569457","url":null,"abstract":"Post-quantum cryptography (PQC) has recently drawn substantial attention from various communities owing to the proven vulnerability of existing public-key cryptosystems against the attacks launched from well-established quantum computers. The Ring-Binary-Learning-with-Errors (RBLWE), a variant of Ring-LWE, has been proposed to build PQC for lightweight applications. As more Field-Programmable Gate Array (FPGA) devices are being deployed in lightweight applications like Internet-of-Things (IoT) devices, it would be interesting if the RBLWE-based PQC can be implemented on the FPGA with ultra-low complexity and flexible processing. However, thus far, limited information is available for such implementations. In this article, we propose novel RBLWE-based PQC accelerators on the FPGA with ultra-low implementation complexity and flexible timing. We first present the process of deriving the key operation of the RBLWE-based scheme into the proposed algorithmic operation. The corresponding hardware accelerator is then efficiently mapped from the proposed algorithm with the help of algorithm-to-architecture implementation techniques and extended to obtain higher-throughput designs. The final complexity analysis and implementation results (on a variety of FPGAs) show that the proposed accelerators have significantly smaller area-time complexities than the state-of-the-art designs. Overall, the proposed accelerators feature low implementation complexity and flexible processing, making them desirable for emerging FPGA-based lightweight applications.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"194 3 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138543681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Artifact Evaluation for ACM TRETS Papers Submitted from the FPT Journal Track 从FPT期刊轨道提交的ACM TRETS论文的伪影评估

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-21 DOI: 10.1145/3596513

M. Leeser

Authors of papers that were accepted to ACM TRETS via the FPT 2022 journal track had the option of participating in Artifact Evaluation (AE). Four papers from this track volunteered to participate in the AE process. All of these papers have been awarded badges from ACM as described below.

通过FPT 2022期刊轨道被ACM TRETS接受的论文的作者可以选择参加工件评估(AE)。该方向的四篇论文自愿参与AE过程。所有这些论文都获得了ACM颁发的徽章，如下所述。

引用次数: 0

A Survey of Processing Systems for Phylogenetics and Population Genetics 系统发育与群体遗传学处理系统综述

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-20 DOI: https://dl.acm.org/doi/10.1145/3588033

Reinout Corts, Nikolaos Alachiotis

The COVID-19 pandemic brought Bioinformatics into the spotlight, revealing that several existing methods, algorithms, and tools were not well prepared to handle large amounts of genomic data efficiently. This led to prohibitively long execution times and the need to reduce the extent of analyses to obtain results in a reasonable amount of time. In this survey, we review available high-performance computing and hardware-accelerated systems based on FPGA and GPU technology. Optimized and hardware-accelerated systems can conduct more thorough analyses considerably faster than pure software implementations, allowing to reach important conclusions in a timely manner to drive scientific discoveries. We discuss the reasons that are currently hindering high-performance solutions from being widely deployed in real-world biological analyses and describe a research direction that can pave the way to enable this.

2019冠状病毒病大流行使生物信息学成为人们关注的焦点，揭示了一些现有的方法、算法和工具尚未为有效处理大量基因组数据做好充分准备。这导致执行时间过长，并且需要减少分析的范围，以便在合理的时间内获得结果。在本调查中，我们回顾了基于FPGA和GPU技术的现有高性能计算和硬件加速系统。优化和硬件加速的系统可以比纯软件实现更快地进行更彻底的分析，从而及时得出重要的结论，从而推动科学发现。我们讨论了目前阻碍高性能解决方案在现实世界生物分析中广泛应用的原因，并描述了一个可以为实现这一目标铺平道路的研究方向。

引用次数: 0

ADAS: A High Computational Utilization Dynamic Reconfigurable Hardware Accelerator for Super Resolution ADAS:一种高计算利用率的动态可重构超分辨率硬件加速器

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-20 DOI: https://dl.acm.org/doi/10.1145/3570927

Liang Chang, Xin Zhao, Jun Zhou

Super-resolution (SR) based on deep learning has obtained superior performance in image reconstruction. Recently, various algorithm efforts have been committed to improving image reconstruction quality and speed. However, the inference of SR contains huge amounts of computation and data access, leading to low hardware implementation efficiency. For instance, the up-sampling with the deconvolution process requires considerable computation resources. In addition, the sizes of output feature maps of several middle layers are extraordinarily large, which is challenging to optimize, causing serious data access issues. In this work, we present an all-on-chip hardware architecture based on the deconvolution scheme and feature map segmentation strategy, namely ADAS, where all the generated data by the middle layers are buffered on-chip to avoid large data movements between on- and off-chip. In ADAS, we develop a hardware-friendly and efficient deconvolution scheme to accelerate the computation. Also, the dynamic reconfigurable process element (PE) combined with efficient mapping is proposed to enhance PE utilization up to nearly 100% and support multiple scaling factors. Based on our experimental results, ADAS demonstrates real-time image SR and better image reconstruction quality with PSNR (37.15 dB) and SSIM (0.9587). Compared to baseline and validated with the FPGA platform, ADAS can support scaling factors of 2, 3, and 4, achieving 2.68 ×, 5.02 ×, and 8.28 × speedup.

基于深度学习的超分辨率(SR)在图像重建中取得了优异的性能。近年来，各种算法都致力于提高图像重建的质量和速度。然而，SR的推理包含了大量的计算和数据访问，导致硬件实现效率较低。例如，带有反褶积过程的上采样需要大量的计算资源。此外，几个中间层的输出特征映射的大小非常大，这对优化具有挑战性，导致严重的数据访问问题。在这项工作中，我们提出了一种基于反褶积方案和特征映射分割策略的全片上硬件架构，即ADAS，其中中间层生成的所有数据都在片上缓冲，以避免片内和片外之间的大数据移动。在ADAS中，我们开发了一种硬件友好且高效的反褶积方案来加速计算。提出了动态可重构过程元素(PE)与高效映射相结合的方法，将PE的利用率提高到接近100%，并支持多个缩放因子。实验结果表明，ADAS具有较好的图像重建质量和实时性，PSNR为37.15 dB, SSIM为0.9587。与基线相比，并在FPGA平台上进行验证，ADAS可以支持2、3和4的缩放因子，实现2.68倍、5.02倍和8.28倍的加速。

{"title":"ADAS: A High Computational Utilization Dynamic Reconfigurable Hardware Accelerator for Super Resolution","authors":"Liang Chang, Xin Zhao, Jun Zhou","doi":"https://dl.acm.org/doi/10.1145/3570927","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3570927","url":null,"abstract":"Super-resolution (SR) based on deep learning has obtained superior performance in image reconstruction. Recently, various algorithm efforts have been committed to improving image reconstruction quality and speed. However, the inference of SR contains huge amounts of computation and data access, leading to low hardware implementation efficiency. For instance, the up-sampling with the deconvolution process requires considerable computation resources. In addition, the sizes of output feature maps of several middle layers are extraordinarily large, which is challenging to optimize, causing serious data access issues. In this work, we present an all-on-chip hardware architecture based on the deconvolution scheme and feature map segmentation strategy, namely ADAS, where all the generated data by the middle layers are buffered on-chip to avoid large data movements between on- and off-chip. In ADAS, we develop a hardware-friendly and efficient deconvolution scheme to accelerate the computation. Also, the dynamic reconfigurable process element (PE) combined with efficient mapping is proposed to enhance PE utilization up to nearly 100% and support multiple scaling factors. Based on our experimental results, ADAS demonstrates real-time image SR and better image reconstruction quality with PSNR (37.15 dB) and SSIM (0.9587). Compared to baseline and validated with the FPGA platform, ADAS can support scaling factors of 2, 3, and 4, achieving 2.68 ×, 5.02 ×, and 8.28 × speedup.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"4 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High-performance and Configurable SW/HW Co-design of Post-quantum Signature CRYSTALS-Dilithium 后量子特征晶体-锂的高性能可配置SW/HW协同设计

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-20 DOI: https://dl.acm.org/doi/10.1145/3569456

Gaoyu Mao, Donglong Chen, Guangyan Li, Wangchen Dai, Abdurrashid Ibrahim Sanka, Çetin Kaya Koç, Ray C. C. Cheung

CRYSTALS-Dilithium is a lattice-based post-quantum digital signature scheme that is resistant to attacks by quantum computers and has been selected to be standardized in the NIST post-quantum cryptography (PQC) standardization process. However, the speed performance and design flexibility of the Dilithium still need to be evaluated. This article presents a high-performance software/hardware co-design of CRYSTALS-Dilithium based on the NIST PQC round-3 parameters. High-speed pipelined hardware modules for NTT/INTT, point-wise multiplication/addition, and for SHAKE are included in the design to accelerate the time-consuming operations in Dilithium. All hardware modules are parameterized, thus allowing full support of runtime configuration to increase versatility. Moreover, the proposed software/hardware architecture and tight operating workflows reduce the data transmission overhead between the processor and other hardware modules. The hardware accelerator is implemented with a reconfigurable logic on FPGA and is integrated with the high-performance ARM Cortex-A9 processor in the Xilinx Zynq Architecture. We measure the performance of the software/hardware system for Dilithium in NIST security levels 2, 3, and 5. Compared to pure software implementations, we achieve 8.7–12.5 times speedup in Key generation, 6.3–7.3 times speedup in Sign, and 9.1–12.2 times speedup in Verify operations.

CRYSTALS-Dilithium是一种基于晶格的后量子数字签名方案，可抵抗量子计算机的攻击，并已被选中在NIST后量子密码学(PQC)标准化过程中进行标准化。但其速度性能和设计灵活性仍有待进一步评估。本文提出了一种基于NIST PQC round-3参数的晶体-二锂高性能软硬件协同设计方法。设计中包含用于NTT/INTT，点乘法/加法和SHAKE的高速流水线硬件模块，以加速耗时的Dilithium操作。所有硬件模块都是参数化的，因此允许完全支持运行时配置以增加多功能性。此外，所提出的软/硬件架构和紧凑的操作工作流减少了处理器与其他硬件模块之间的数据传输开销。硬件加速器在FPGA上采用可重构逻辑实现，并与Xilinx Zynq架构中的高性能ARM Cortex-A9处理器集成。我们在NIST安全级别2、3和5中测量了Dilithium软件/硬件系统的性能。与纯软件实现相比，我们在密钥生成方面实现了8.7-12.5倍的加速，在签名方面实现了6.3-7.3倍的加速，在验证操作方面实现了9.1-12.2倍的加速。

{"title":"High-performance and Configurable SW/HW Co-design of Post-quantum Signature CRYSTALS-Dilithium","authors":"Gaoyu Mao, Donglong Chen, Guangyan Li, Wangchen Dai, Abdurrashid Ibrahim Sanka, Çetin Kaya Koç, Ray C. C. Cheung","doi":"https://dl.acm.org/doi/10.1145/3569456","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3569456","url":null,"abstract":"CRYSTALS-Dilithium is a lattice-based post-quantum digital signature scheme that is resistant to attacks by quantum computers and has been selected to be standardized in the NIST post-quantum cryptography (PQC) standardization process. However, the speed performance and design flexibility of the Dilithium still need to be evaluated. This article presents a high-performance software/hardware co-design of CRYSTALS-Dilithium based on the NIST PQC round-3 parameters. High-speed pipelined hardware modules for NTT/INTT, point-wise multiplication/addition, and for SHAKE are included in the design to accelerate the time-consuming operations in Dilithium. All hardware modules are parameterized, thus allowing full support of runtime configuration to increase versatility. Moreover, the proposed software/hardware architecture and tight operating workflows reduce the data transmission overhead between the processor and other hardware modules. The hardware accelerator is implemented with a reconfigurable logic on FPGA and is integrated with the high-performance ARM Cortex-A9 processor in the Xilinx Zynq Architecture. We measure the performance of the software/hardware system for Dilithium in NIST security levels 2, 3, and 5. Compared to pure software implementations, we achieve 8.7–12.5 times speedup in Key generation, 6.3–7.3 times speedup in Sign, and 9.1–12.2 times speedup in Verify operations.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"10 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0