首页 > 最新文献

ACM Transactions on Reconfigurable Technology and Systems最新文献

英文 中文
High-Efficiency Compressor Trees for Latest AMD FPGAs 用于最新 AMD FPGA 的高效压缩机树
IF 2.3 4区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-02-10 DOI: 10.1145/3645097
Konstantin J. Hoßfeld, Hans Jakob Damsgaard, Jari Nurmi, Michaela Blott, Thomas B. Preußer

High-fan-in dot product computations are ubiquitous in highly relevant application domains, such as signal processing and machine learning. Particularly, the diverse set of data formats used in machine learning poses a challenge for flexible efficient design solutions. Ideally, a dot product summation is composed from a carry-free compressor tree followed by a terminal carry-propagate addition. On FPGA, these compressor trees are constructed from generalized parallel counters (GPCs) whose architecture is closely tied to the underlying reconfigurable fabric. This work reviews known counter designs and proposes new ones in the context of the new AMD Versal™ fabric. On this basis, we develop a compressor generator featuring variable-sized counters, novel counter composition heuristics, explicit clustering strategies, and case-specific optimizations like logic gate absorption. In comparison to the Vivado™ default implementation, the combination of such a compressor with a novel, highly efficient quaternary adder reduces the LUT footprint across different bit matrix input shapes by 45% for a plain summation and by 46% for a terminal accumulation at a slight cost in critical path delay still allowing an operation well above 500 MHz. We demonstrate the aptness of our solution at examples of low-precision integer dot product accumulation units.

在信号处理和机器学习等高度相关的应用领域中,高扇入点积计算无处不在。特别是,机器学习中使用的数据格式多种多样,这对灵活高效的设计方案提出了挑战。理想情况下,点乘求和是由一个无进位压缩树和一个终端进位递增加法组成的。在 FPGA 上,这些压缩树是由广义并行计数器(GPC)构建的,其架构与底层可重构结构密切相关。这项工作回顾了已知的计数器设计,并在 AMD Versal™ 新结构的背景下提出了新的设计。在此基础上,我们开发了一种压缩器生成器,它具有可变大小的计数器、新颖的计数器组成启发式方法、明确的聚类策略以及针对具体情况的优化(如逻辑门吸收)。与 Vivado™ 默认实现相比,这种压缩器与新颖、高效的四级加法器相结合,在不同的位矩阵输入形状下,普通求和的 LUT 基底面减少了 45%,终端累加的 LUT 基底面减少了 46%,但关键路径延迟略有增加,运行频率仍远远高于 500 MHz。我们通过低精度整数点积累加单元的实例来证明我们的解决方案的适用性。
{"title":"High-Efficiency Compressor Trees for Latest AMD FPGAs","authors":"Konstantin J. Hoßfeld, Hans Jakob Damsgaard, Jari Nurmi, Michaela Blott, Thomas B. Preußer","doi":"10.1145/3645097","DOIUrl":"https://doi.org/10.1145/3645097","url":null,"abstract":"<p>High-fan-in dot product computations are ubiquitous in highly relevant application domains, such as signal processing and machine learning. Particularly, the diverse set of data formats used in machine learning poses a challenge for flexible efficient design solutions. Ideally, a dot product summation is composed from a carry-free compressor tree followed by a terminal carry-propagate addition. On FPGA, these compressor trees are constructed from generalized parallel counters (GPCs) whose architecture is closely tied to the underlying reconfigurable fabric. This work reviews known counter designs and proposes new ones in the context of the new AMD Versal™ fabric. On this basis, we develop a compressor generator featuring variable-sized counters, novel counter composition heuristics, explicit clustering strategies, and case-specific optimizations like logic gate absorption. In comparison to the Vivado™ default implementation, the combination of such a compressor with a novel, highly efficient quaternary adder reduces the LUT footprint across different bit matrix input shapes by 45% for a plain summation and by 46% for a terminal accumulation at a slight cost in critical path delay still allowing an operation well above 500 MHz. We demonstrate the aptness of our solution at examples of low-precision integer dot product accumulation units.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"34 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139772659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An All-Digital Compute-In-Memory FPGA Architecture for Deep Learning Acceleration 用于深度学习加速的全数字内存计算 FPGA 架构
IF 2.3 4区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-01-15 DOI: 10.1145/3640469
Yonggen Li, Xin Li, Haibin Shen, Jicong Fan, Yanfeng Xu, Kejie Huang

Field Programmable Gate Array (FPGA) is a versatile and programmable hardware platform, which makes it a promising candidate for accelerating Deep Neural Networks (DNNs). However, FPGA’s computing energy efficiency is low due to the domination of energy consumption by interconnect data movement. In this paper, we propose an all-digital Compute-In-Memory FPGA architecture for deep learning acceleration. Furthermore, we present a bit-serial computing circuit of the Digital CIM core for accelerating vector-matrix multiplication (VMM) operations. A Network-CIM-Deployer (NCIMD) is also developed to support automatic deployment and mapping of DNN networks. NCIMD provides a user-friendly API of DNN models in Caffe format. Meanwhile, we introduce a Weight-Stationary (WS) dataflow and describe the method of mapping a single layer of the network to the CIM array in the architecture. We conduct experimental tests on the proposed FPGA architecture in the field of Deep Learning (DL), as well as in non-DL fields, using different architectural layouts and mapping strategies. We also compare the results with the conventional FPGA architecture. The experimental results show that compared to the conventional FPGA architecture, the energy efficiency can achieve a maximum speedup of 16.1 ×, while the latency can decrease up to (40% ) in our proposed CIM FPGA architecture.

现场可编程门阵列(FPGA)是一种通用的可编程硬件平台,这使其成为加速深度神经网络(DNN)的理想选择。然而,FPGA 的计算能效较低,原因是能耗主要由互连数据移动造成。在本文中,我们提出了一种用于深度学习加速的全数字内存计算 FPGA 架构。此外,我们还提出了数字 CIM 内核的位串行计算电路,用于加速向量矩阵乘法(VMM)运算。我们还开发了网络-CIM-部署器(NCIMD),以支持 DNN 网络的自动部署和映射。NCIMD 提供了 Caffe 格式 DNN 模型的用户友好 API。同时,我们引入了权重静态(WS)数据流,并介绍了将单层网络映射到该架构中的 CIM 阵列的方法。我们使用不同的架构布局和映射策略,在深度学习(DL)领域和非 DL 领域对所提出的 FPGA 架构进行了实验测试。我们还将测试结果与传统的 FPGA 架构进行了比较。实验结果表明,与传统的FPGA架构相比,在我们提出的CIM FPGA架构中,能效最高可实现16.1倍的提速,而时延最高可降低40%。
{"title":"An All-Digital Compute-In-Memory FPGA Architecture for Deep Learning Acceleration","authors":"Yonggen Li, Xin Li, Haibin Shen, Jicong Fan, Yanfeng Xu, Kejie Huang","doi":"10.1145/3640469","DOIUrl":"https://doi.org/10.1145/3640469","url":null,"abstract":"<p>Field Programmable Gate Array (FPGA) is a versatile and programmable hardware platform, which makes it a promising candidate for accelerating Deep Neural Networks (DNNs). However, FPGA’s computing energy efficiency is low due to the domination of energy consumption by interconnect data movement. In this paper, we propose an all-digital Compute-In-Memory FPGA architecture for deep learning acceleration. Furthermore, we present a bit-serial computing circuit of the Digital CIM core for accelerating vector-matrix multiplication (VMM) operations. A Network-CIM-Deployer (<i>NCIMD</i>) is also developed to support automatic deployment and mapping of DNN networks. <i>NCIMD</i> provides a user-friendly API of DNN models in Caffe format. Meanwhile, we introduce a Weight-Stationary (WS) dataflow and describe the method of mapping a single layer of the network to the CIM array in the architecture. We conduct experimental tests on the proposed FPGA architecture in the field of Deep Learning (DL), as well as in non-DL fields, using different architectural layouts and mapping strategies. We also compare the results with the conventional FPGA architecture. The experimental results show that compared to the conventional FPGA architecture, the energy efficiency can achieve a maximum speedup of 16.1 ×, while the latency can decrease up to (40% ) in our proposed CIM FPGA architecture.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"14 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139469745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating the Impact of Using Multiple-Metal Layers on the Layout Area of Switch Blocks for Tile-Based FPGAs in FinFET 7nm 评估使用多金属层对 FinFET 7 纳米瓦片式 FPGA 开关模块布局面积的影响
IF 2.3 4区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-01-02 DOI: 10.1145/3639055
Sajjad Rostami Sani, Andy Ye

A new area model for estimating the layout area of switch blocks is introduced in this work. The model is based on a realistic layout strategy. As a result, it not only takes into consideration the active area that is needed to construct a switch block but also the number of metal layers available and the actual dimensions of these metals. The model assigns metal layers to the routing tracks in a way that reduces the number of vias that are needed to connect different routing tracks together while maintaining the tile-based structure of FPGAs. It also takes into account the wiring area required for buffer insertion for long wire segments. The model is evaluated based on the layouts constructed in ASAP7 FinFET 7nm Predictive Design Kit. We found that the new model, while specific to the layout strategy that it employs, improves upon the traditional active-based area estimation models by considering the growth of the metal area independently from the growth of the active area. As a result, the new model is able to more accurately estimate layout area by predicting when metal area will overtake active area as the number of routing tracks is increased. This ability allows the more accurate estimation of the true layout cost of FPGA fabrics at the early floor planning and architectural exploration stage; and this increase in accuracy can encourage a wider use of custom FPGA fabrics that target specific sets of benchmarks in future SOC designs. Furthermore, our data indicate that the conclusions drawn from several significant prior architectural studies remain to be correct under FinFET geometries and wiring area considerations despite their exclusive use of active-only area models. This correctness is due to the small channel widths, around 30-60 tracks per channel, of the architectures that these studies investigate. For architectures that approach the channel width of modern commercial FPGAs with over one to two hundreds tracks per channel, our data show that wiring area models justified by detailed layout considerations are an essential addition to active area models in the correct prediction of the implementation area of FPGAs.

这项工作引入了一种新的面积模型,用于估算开关模块的布局面积。该模型基于现实的布局策略。因此,它不仅考虑了构建开关模块所需的有效面积,还考虑了可用金属层的数量和这些金属的实际尺寸。该模型为布线轨道分配金属层的方式减少了将不同布线轨道连接在一起所需的通孔数量,同时保持了 FPGA 基于瓦片的结构。它还考虑了长线段缓冲插入所需的布线面积。我们根据 ASAP7 FinFET 7nm 预测设计工具包中构建的布局对该模型进行了评估。我们发现,新模型虽然针对其采用的布局策略,但通过将金属面积的增长与有源面积的增长分开考虑,改进了传统的基于有源面积的面积估算模型。因此,新模型能够预测随着布线轨道数量的增加,金属面积何时会超过活动面积,从而更准确地估算布局面积。这种能力可以在早期平面规划和架构探索阶段更准确地估算 FPGA 结构的真实布局成本;这种准确性的提高可以鼓励在未来的 SOC 设计中更广泛地使用针对特定基准集的定制 FPGA 结构。此外,我们的数据表明,尽管之前的几项重要架构研究专门使用了纯活动面积模型,但在 FinFET 几何结构和布线面积考虑下,这些研究得出的结论仍然是正确的。这种正确性是由于这些研究调查的架构的沟道宽度较小,每个沟道约 30-60 个轨道。对于接近现代商用 FPGA 的通道宽度(每通道超过 1 到 2 百条轨道)的体系结构,我们的数据表明,在正确预测 FPGA 的实施面积时,通过详细布局考虑而确定的布线面积模型是对有源面积模型的重要补充。
{"title":"Evaluating the Impact of Using Multiple-Metal Layers on the Layout Area of Switch Blocks for Tile-Based FPGAs in FinFET 7nm","authors":"Sajjad Rostami Sani, Andy Ye","doi":"10.1145/3639055","DOIUrl":"https://doi.org/10.1145/3639055","url":null,"abstract":"<p>A new area model for estimating the layout area of switch blocks is introduced in this work. The model is based on a realistic layout strategy. As a result, it not only takes into consideration the active area that is needed to construct a switch block but also the number of metal layers available and the actual dimensions of these metals. The model assigns metal layers to the routing tracks in a way that reduces the number of vias that are needed to connect different routing tracks together while maintaining the tile-based structure of FPGAs. It also takes into account the wiring area required for buffer insertion for long wire segments. The model is evaluated based on the layouts constructed in ASAP7 FinFET 7nm Predictive Design Kit. We found that the new model, while specific to the layout strategy that it employs, improves upon the traditional active-based area estimation models by considering the growth of the metal area independently from the growth of the active area. As a result, the new model is able to more accurately estimate layout area by predicting when metal area will overtake active area as the number of routing tracks is increased. This ability allows the more accurate estimation of the true layout cost of FPGA fabrics at the early floor planning and architectural exploration stage; and this increase in accuracy can encourage a wider use of custom FPGA fabrics that target specific sets of benchmarks in future SOC designs. Furthermore, our data indicate that the conclusions drawn from several significant prior architectural studies remain to be correct under FinFET geometries and wiring area considerations despite their exclusive use of active-only area models. This correctness is due to the small channel widths, around 30-60 tracks per channel, of the architectures that these studies investigate. For architectures that approach the channel width of modern commercial FPGAs with over one to two hundreds tracks per channel, our data show that wiring area models justified by detailed layout considerations are an essential addition to active area models in the correct prediction of the implementation area of FPGAs.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"4 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139082978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CSAIL2019 Crypto-Puzzle Solver Architecture CSAIL2019 密码谜题求解器架构
IF 2.3 4区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-12-29 DOI: 10.1145/3639056
Sergey Gribok, Bogdan Pasca, Martin Langhammer

The CSAIL2019 time-lock puzzle is an unsolved cryptographic challenge introduced by Ron Rivest in 2019, replacing the solved LCS35 puzzle. Solving these types of puzzles requires large amounts of intrinsically sequential computations, with each iteration performing a very large (3072-bit for CSAIL2019) modular multiplication operation. The complexity of each iteration is several times greater than known FPGA implementations, and the number of iterations has been increased by about 1000x compared to LCS35. Because of the high complexity of this new puzzle, a number of intermediate, or milestone versions of the puzzle have been specified. In this article, we present several FPGA architectures for the CSAIL2019 solver, which we implement on a medium-sized Intel Agilex device. We develop a new multi-cycle modular multiplication method, which is flexible and can fit on a wide variety of sizes of current FPGAs. We introduce a class of multi-cycle squarer-based architectures that allow for better resource and area trade-offs. We also demonstrate a new approach for improving the fitting and timing closure of large, chip-filling arithmetic designs. We used the solver to compute the first 22 out of the 28 milestone solutions of the puzzle, which are the first reported results for this problem.

CSAIL2019 时锁谜题是罗恩-里维斯特(Ron Rivest)于 2019 年提出的一项尚未解决的密码挑战,它取代了已解决的 LCS35 谜题。解决这类谜题需要进行大量内在顺序计算,每次迭代都要执行非常大的(CSAIL2019 为 3072 位)模块乘法运算。每次迭代的复杂度是已知 FPGA 实现的数倍,与 LCS35 相比,迭代次数增加了约 1000 倍。由于这一新谜题的复杂性很高,因此已经指定了一些中间版本或里程碑版本。在本文中,我们介绍了 CSAIL2019 解算器的几种 FPGA 架构,并在中型英特尔 Agilex 器件上实现了这些架构。我们开发了一种新的多周期模块乘法,该方法非常灵活,可适用于当前各种尺寸的 FPGA。我们引入了一类基于多周期平方器的架构,可以更好地权衡资源和面积。我们还展示了一种改进大型芯片填充算术设计的拟合和时序闭合的新方法。我们使用求解器计算出了该难题 28 个里程碑式解法中的前 22 个,这是该问题的首个报告结果。
{"title":"CSAIL2019 Crypto-Puzzle Solver Architecture","authors":"Sergey Gribok, Bogdan Pasca, Martin Langhammer","doi":"10.1145/3639056","DOIUrl":"https://doi.org/10.1145/3639056","url":null,"abstract":"<p>The CSAIL2019 time-lock puzzle is an unsolved cryptographic challenge introduced by Ron Rivest in 2019, replacing the solved LCS35 puzzle. Solving these types of puzzles requires large amounts of intrinsically sequential computations, with each iteration performing a very large (3072-bit for CSAIL2019) modular multiplication operation. The complexity of each iteration is several times greater than known FPGA implementations, and the number of iterations has been increased by about 1000x compared to LCS35. Because of the high complexity of this new puzzle, a number of intermediate, or milestone versions of the puzzle have been specified. In this article, we present several FPGA architectures for the CSAIL2019 solver, which we implement on a medium-sized Intel Agilex device. We develop a new multi-cycle modular multiplication method, which is flexible and can fit on a wide variety of sizes of current FPGAs. We introduce a class of multi-cycle squarer-based architectures that allow for better resource and area trade-offs. We also demonstrate a new approach for improving the fitting and timing closure of large, chip-filling arithmetic designs. We used the solver to compute the first 22 out of the 28 milestone solutions of the puzzle, which are the first reported results for this problem.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139069353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Introduction to the Special Section on FPGA 2022 FPGA 2022 专节简介
IF 2.3 4区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-12-13 DOI: 10.1145/3618114
P. Ienne
{"title":"Introduction to the Special Section on FPGA 2022","authors":"P. Ienne","doi":"10.1145/3618114","DOIUrl":"https://doi.org/10.1145/3618114","url":null,"abstract":"","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"15 6","pages":"1 - 2"},"PeriodicalIF":2.3,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139004736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AEKA: FPGA Implementation of Area-Efficient Karatsuba Accelerator for Ring-Binary-LWE-based Lightweight PQC AEKA:为基于环二进制-LWE 的轻量级 PQC 实现面积效率高的 Karatsuba 加速器的 FPGA 实现
IF 2.3 4区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-12-11 DOI: 10.1145/3637215
Tianyou Bao, Pengzhou He, Jiafeng Xie, H S. Jacinto

Lightweight PQC-related research and development have gradually gained attention from the research community recently. Ring-Binary-Learning-with-Errors (RBLWE)-based encryption scheme (RBLWE-ENC), a promising lightweight PQC based on small parameter sets to fit related applications (but not in favor of deploying popular fast algorithms like number theoretic transform). To solve this problem, in this paper, we present a novel implementation of hardware acceleration for RBLWE-ENC based on Karatsuba algorithm, particularly on the field-programmable gate array (FPGA) platform. In detail, we have proposed an area-efficient Karatsuba Accelerator (AEKA) for RBLWE-ENC, based on three layers of innovative efforts. First of all, we reformulate the signal processing sequence within the major arithmetic component of the KA-based polynomial multiplication for RBLWE-ENC to obtain a new algorithm. Then, we have designed the proposed algorithm into a new hardware accelerator with several novel algorithm-to-architecture mapping techniques. Finally, we have conducted thorough complexity analysis and comparison to demonstrate the efficiency of the proposed accelerator, e.g., it involves 62.5% higher throughput and 60.2% less area-delay product (ADP) than the state-of-the-art design for n = 512 (Virtex-7 device, similar setup). The proposed AEKA design strategy is highly efficient on the FPGA devices, i.e., small resource usage with superior timing, which can be integrated with other necessary systems for lightweight-oriented high-performance applications (e.g., servers). The outcome of this work is also expected to generate impacts for lightweight PQC advancement.

与轻量级 PQC 相关的研究和开发近来逐渐受到研究界的关注。基于环二进制学习与错误(RBLWE)的加密方案(RBLWE-ENC),是一种基于小参数集的轻量级 PQC,适合相关应用(但不支持部署流行的快速算法,如数论变换),前景广阔。为了解决这个问题,我们在本文中提出了一种基于 Karatsuba 算法的 RBLWE-ENC 硬件加速新实现,特别是在现场可编程门阵列(FPGA)平台上。具体而言,我们基于三层创新努力,为 RBLWE-ENC 提出了一种面积效率高的 Karatsuba 加速器(AEKA)。首先,我们在基于 KA 的 RBLWE-ENC 多项式乘法的主要算术部分中重新制定了信号处理序列,从而获得了一种新算法。然后,我们利用几种新颖的算法到架构映射技术,将所提出的算法设计到一个新的硬件加速器中。最后,我们进行了全面的复杂性分析和比较,以证明所提加速器的效率,例如,在 n = 512(Virtex-7 器件,类似设置)的情况下,它的吞吐量比最先进的设计高 62.5%,面积-延迟积(ADP)比最先进的设计低 60.2%。所提出的 AEKA 设计策略在 FPGA 器件上具有很高的效率,即资源使用量小,时序性能优越,可与其他必要系统集成,用于面向轻量级的高性能应用(如服务器)。这项工作的成果也有望对轻量级 PQC 的发展产生影响。
{"title":"AEKA: FPGA Implementation of Area-Efficient Karatsuba Accelerator for Ring-Binary-LWE-based Lightweight PQC","authors":"Tianyou Bao, Pengzhou He, Jiafeng Xie, H S. Jacinto","doi":"10.1145/3637215","DOIUrl":"https://doi.org/10.1145/3637215","url":null,"abstract":"<p>Lightweight PQC-related research and development have gradually gained attention from the research community recently. Ring-Binary-Learning-with-Errors (RBLWE)-based encryption scheme (RBLWE-ENC), a promising lightweight PQC based on small parameter sets to fit related applications (but not in favor of deploying popular fast algorithms like number theoretic transform). To solve this problem, in this paper, we present a novel implementation of hardware acceleration for RBLWE-ENC based on Karatsuba algorithm, particularly on the field-programmable gate array (FPGA) platform. In detail, we have proposed an area-efficient Karatsuba Accelerator (AEKA) for RBLWE-ENC, based on three layers of innovative efforts. First of all, we reformulate the signal processing sequence within the major arithmetic component of the KA-based polynomial multiplication for RBLWE-ENC to obtain a new algorithm. Then, we have designed the proposed algorithm into a new hardware accelerator with several novel algorithm-to-architecture mapping techniques. Finally, we have conducted thorough complexity analysis and comparison to demonstrate the efficiency of the proposed accelerator, e.g., it involves 62.5% higher throughput and 60.2% less area-delay product (ADP) than the state-of-the-art design for <i>n</i> = 512 (Virtex-7 device, similar setup). The proposed AEKA design strategy is highly efficient on the FPGA devices, i.e., small resource usage with superior timing, which can be integrated with other necessary systems for lightweight-oriented high-performance applications (e.g., servers). The outcome of this work is also expected to generate impacts for lightweight PQC advancement.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"12 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138566009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Design, Calibration, and Evaluation of Real-Time Waveform Matching on an FPGA-based Digitizer at 10 GS/s 基于fpga的数字化仪10gs /s实时波形匹配的设计、校准和评估
IF 2.3 4区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-12-05 DOI: 10.1145/3635719
Jens Trautmann, Paul Krüger, Andreas Becher, Stefan Wildermann, Jürgen Teich

Digitizing side-channel signals at high sampling rates produces huge amounts of data, while side-channel analysis techniques only need those specific trace segments containing Cryptographic Operations (COs). For detecting these segments, waveform-matching techniques have been established comparing the signal with a template of the CO’s characteristic pattern. Real-time waveform matching requires highly parallel implementations as achieved by hardware design but also reconfigurability as provided by FPGAs to adapt the matching hardware to a specific CO pattern. However, currently proposed designs process the samples from analog-to-digital converters sequentially and can only process low sampling rates due to the limited clock speed of FPGAs.

In this paper, we present a parallel waveform-matching architecture capable of performing high-speed waveform matching on a high-end FPGA-based digitizer. We also present a workflow for calibrating the waveform-matching system to the specific pattern of the CO in the presence of hardware restrictions provided by the FPGA hardware. Our implementation enables waveform matching at 10 GS/s, offering a speedup of 50x compared to the fastest state-of-the-art implementation known to us. We demonstrate how to apply the technique for attacking the widespread XTS-AES algorithm using waveform matching to recover the encrypted tweak even in the presence of so-called systemic noise.

在高采样率下数字化侧信道信号会产生大量的数据,而侧信道分析技术只需要那些包含密码操作(COs)的特定跟踪段。为了检测这些片段,波形匹配技术已经建立,将信号与CO的特征模式模板进行比较。实时波形匹配需要硬件设计实现的高度并行实现,还需要fpga提供的可重构性,以使匹配硬件适应特定的CO模式。然而,目前提出的设计顺序处理模数转换器的采样,由于fpga的时钟速度有限,只能处理低采样率。在本文中,我们提出了一种能够在高端fpga数字化仪上执行高速波形匹配的并行波形匹配架构。我们还提出了一个工作流,用于在FPGA硬件提供的硬件限制的情况下将波形匹配系统校准到CO的特定模式。我们的实现实现了10 GS/s的波形匹配,与我们已知的最快的最先进的实现相比,提供了50倍的加速。我们演示了如何应用该技术来攻击广泛使用的XTS-AES算法,使用波形匹配来恢复加密调整,即使在存在所谓的系统噪声的情况下。
{"title":"Design, Calibration, and Evaluation of Real-Time Waveform Matching on an FPGA-based Digitizer at 10 GS/s","authors":"Jens Trautmann, Paul Krüger, Andreas Becher, Stefan Wildermann, Jürgen Teich","doi":"10.1145/3635719","DOIUrl":"https://doi.org/10.1145/3635719","url":null,"abstract":"<p>Digitizing side-channel signals at high sampling rates produces huge amounts of data, while side-channel analysis techniques only need those specific trace segments containing Cryptographic Operations (COs). For detecting these segments, waveform-matching techniques have been established comparing the signal with a template of the CO’s characteristic pattern. Real-time waveform matching requires highly parallel implementations as achieved by hardware design but also reconfigurability as provided by FPGAs to adapt the matching hardware to a specific CO pattern. However, currently proposed designs process the samples from analog-to-digital converters sequentially and can only process low sampling rates due to the limited clock speed of FPGAs. </p><p>In this paper, we present a parallel waveform-matching architecture capable of performing high-speed waveform matching on a high-end FPGA-based digitizer. We also present a workflow for calibrating the waveform-matching system to the specific pattern of the CO in the presence of hardware restrictions provided by the FPGA hardware. Our implementation enables waveform matching at 10 GS/s, offering a speedup of 50x compared to the fastest state-of-the-art implementation known to us. We demonstrate how to apply the technique for attacking the widespread XTS-AES algorithm using waveform matching to recover the encrypted tweak even in the presence of so-called systemic noise.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"5123 1 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Introduction to the Special Section on FCCM 2022 FCCM 2022特别部分介绍
IF 2.3 4区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-12-05 DOI: 10.1145/3632092
Jing Li, Martin Herbordt

No abstract available.

没有摘要。
{"title":"Introduction to the Special Section on FCCM 2022","authors":"Jing Li, Martin Herbordt","doi":"10.1145/3632092","DOIUrl":"https://doi.org/10.1145/3632092","url":null,"abstract":"<p>No abstract available.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"29 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A hardware design framework for computer vision models based on reconfigurable devices 基于可重构器件的计算机视觉模型硬件设计框架
IF 2.3 4区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-12-05 DOI: 10.1145/3635157
Zimeng Fan, Wei Hu, Fang Liu, Dian Xu, Hong Guo, Yanxiang He, Min Peng

In computer vision, the joint development of the algorithm and computing dimensions cannot be separated. Models and algorithms are constantly evolving, while hardware designs must adapt to new or updated algorithms. Reconfigurable devices are recognized as important platforms for computer vision applications because of their reconfigurability. There are two typical design approaches: customized and overlay design. However, existing work is unable to achieve both efficient performance and scalability to adapt to a wide range of models. To address both considerations, we propose a design framework based on reconfigurable devices to provide unified support for computer vision models. It provides software-programmable modules while leaving unit design space for problem-specific algorithms. Based on the proposed framework, we design a model mapping method and a hardware architecture with two processor arrays to enable dynamic and static reconfiguration, thereby relieving redesign pressure. In addition, resource consumption and efficiency can be balanced by adjusting the hyperparameter. In experiments on CNN, vision Transformer, and vision MLP models, our work’s throughput is improved by 18.8x–33.6x and 1.4x–2.0x compared to CPU and GPU. Compared to others on the same platform, accelerators based on our framework can better balance resource consumption and efficiency.

在计算机视觉中,算法和计算维数的共同发展是不可分割的。模型和算法不断发展,而硬件设计必须适应新的或更新的算法。可重构器件因其可重构性而被认为是计算机视觉应用的重要平台。有两种典型的设计方法:定制设计和覆盖设计。然而,现有的工作无法实现高效的性能和可扩展性,以适应广泛的模型。为了解决这两个问题,我们提出了一个基于可重构设备的设计框架,为计算机视觉模型提供统一的支持。它提供了软件可编程模块,同时为特定问题的算法留下了单元设计空间。基于所提出的框架,我们设计了一种模型映射方法和具有两个处理器阵列的硬件架构,以实现动态和静态重构,从而减轻了重新设计的压力。此外,可以通过调整超参数来平衡资源消耗和效率。在CNN、vision Transformer和vision MLP模型上的实验中,我们的工作吞吐量比CPU和GPU分别提高了18.8x - 33.6倍和1.4x - 2.0倍。与同一平台上的其他加速器相比,基于我们框架的加速器可以更好地平衡资源消耗和效率。
{"title":"A hardware design framework for computer vision models based on reconfigurable devices","authors":"Zimeng Fan, Wei Hu, Fang Liu, Dian Xu, Hong Guo, Yanxiang He, Min Peng","doi":"10.1145/3635157","DOIUrl":"https://doi.org/10.1145/3635157","url":null,"abstract":"<p>In computer vision, the joint development of the algorithm and computing dimensions cannot be separated. Models and algorithms are constantly evolving, while hardware designs must adapt to new or updated algorithms. Reconfigurable devices are recognized as important platforms for computer vision applications because of their reconfigurability. There are two typical design approaches: customized and overlay design. However, existing work is unable to achieve both efficient performance and scalability to adapt to a wide range of models. To address both considerations, we propose a design framework based on reconfigurable devices to provide unified support for computer vision models. It provides software-programmable modules while leaving unit design space for problem-specific algorithms. Based on the proposed framework, we design a model mapping method and a hardware architecture with two processor arrays to enable dynamic and static reconfiguration, thereby relieving redesign pressure. In addition, resource consumption and efficiency can be balanced by adjusting the hyperparameter. In experiments on CNN, vision Transformer, and vision MLP models, our work’s throughput is improved by 18.8x–33.6x and 1.4x–2.0x compared to CPU and GPU. Compared to others on the same platform, accelerators based on our framework can better balance resource consumption and efficiency.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"192 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design 基于fpga的高吞吐量目标检测算法与硬件协同设计
IF 2.3 4区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-12-04 DOI: 10.1145/3634919
Anupreetham Anupreetham, Mohamed Ibrahim, Mathew Hall, Andrew Boutros, Ajay Kuzhively, Abinash Mohanty, Eriko Nurvitadhi, Vaughn Betz, Yu Cao, Jae-sun Seo

Object detection and classification is a key task in many computer vision applications such as smart surveillance and autonomous vehicles. Recent advances in deep learning have significantly improved the quality of results achieved by these systems, making them more accurate and reliable in complex environments. Modern object detection systems make use of lightweight convolutional neural networks (CNNs) for feature extraction, coupled with single-shot multi-box detectors (SSDs) that generate bounding boxes around the identified objects along with their classification confidence scores. Subsequently, a non-maximum suppression (NMS) module removes any redundant detection boxes from the final output. Typical NMS algorithms must wait for all box predictions to be generated by the SSD-based feature extractor before processing them. This sequential dependency between box predictions and NMS results in a significant latency overhead and degrades the overall system throughput, even if a high-performance CNN accelerator is used for the SSD feature extraction component. In this paper, we present a novel pipelined NMS algorithm that eliminates this sequential dependency and associated NMS latency overhead. We then use our novel NMS algorithm to implement an end-to-end fully pipelined FPGA system for low-latency SSD-MobileNet-V1 object detection. Our system, implemented on an Intel Stratix 10 FPGA, runs at 400 MHz and achieves a throughput of 2,167 frames per second with an end-to-end batch-1 latency of 2.13 ms. Our system achieves 5.3 × higher throughput and 5 × lower latency compared to the best prior FPGA-based solution with comparable accuracy.

在智能监控和自动驾驶汽车等许多计算机视觉应用中,目标检测和分类是一项关键任务。深度学习的最新进展显著提高了这些系统所获得结果的质量,使它们在复杂环境中更加准确和可靠。现代目标检测系统使用轻量级卷积神经网络(cnn)进行特征提取,再加上单次多盒检测器(ssd),该检测器在识别的对象周围生成边界框以及它们的分类置信度得分。随后,非最大抑制(NMS)模块从最终输出中删除任何冗余检测框。典型的NMS算法必须等待基于ssd的特征提取器生成所有盒子预测,然后再处理它们。盒预测和NMS之间的这种顺序依赖关系导致了显著的延迟开销,并降低了整体系统吞吐量,即使SSD特征提取组件使用了高性能CNN加速器。在本文中,我们提出了一种新的流水线NMS算法,消除了这种顺序依赖和相关的NMS延迟开销。然后,我们使用我们的新颖NMS算法来实现端到端全流水线FPGA系统,用于低延迟SSD-MobileNet-V1对象检测。我们的系统在Intel Stratix 10 FPGA上实现,运行频率为400 MHz,吞吐量为每秒2167帧,端到端批处理延迟为2.13 ms。与先前基于fpga的最佳解决方案相比,我们的系统实现了5.3倍的高吞吐量和5倍的低延迟,并具有相当的精度。
{"title":"High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design","authors":"Anupreetham Anupreetham, Mohamed Ibrahim, Mathew Hall, Andrew Boutros, Ajay Kuzhively, Abinash Mohanty, Eriko Nurvitadhi, Vaughn Betz, Yu Cao, Jae-sun Seo","doi":"10.1145/3634919","DOIUrl":"https://doi.org/10.1145/3634919","url":null,"abstract":"<p>Object detection and classification is a key task in many computer vision applications such as smart surveillance and autonomous vehicles. Recent advances in deep learning have significantly improved the quality of results achieved by these systems, making them more accurate and reliable in complex environments. Modern object detection systems make use of lightweight convolutional neural networks (CNNs) for feature extraction, coupled with single-shot multi-box detectors (SSDs) that generate bounding boxes around the identified objects along with their classification confidence scores. Subsequently, a non-maximum suppression (NMS) module removes any redundant detection boxes from the final output. Typical NMS algorithms must wait for all box predictions to be generated by the SSD-based feature extractor before processing them. This sequential dependency between box predictions and NMS results in a significant latency overhead and degrades the overall system throughput, even if a high-performance CNN accelerator is used for the SSD feature extraction component. In this paper, we present a novel pipelined NMS algorithm that eliminates this sequential dependency and associated NMS latency overhead. We then use our novel NMS algorithm to implement an end-to-end fully pipelined FPGA system for low-latency SSD-MobileNet-V1 object detection. Our system, implemented on an Intel Stratix 10 FPGA, runs at 400 MHz and achieves a throughput of 2,167 frames per second with an end-to-end batch-1 latency of 2.13 ms. Our system achieves 5.3 × higher throughput and 5 × lower latency compared to the best prior FPGA-based solution with comparable accuracy.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"8 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Transactions on Reconfigurable Technology and Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1