ACM Transactions on Reconfigurable Technology and Systems最新文献_第2页

HLPerf: Demystifying the Performance of HLS-based Graph Neural Networks with Dataflow Architectures HLPerf：利用数据流架构解密基于 HLS 的图神经网络性能

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-04-02 DOI: 10.1145/3655627

Chenfeng Zhao, Clayton J. Faber, Roger D. Chamberlain, Xuan Zhang

The development of FPGA-based applications using HLS is fraught with performance pitfalls and large design space exploration times. These issues are exacerbated when the application is complicated and its performance is dependent on the input data set, as is often the case with graph neural network approaches to machine learning. Here, we introduce HLPerf, an open-source, simulation-based performance evaluation framework for dataflow architectures that both supports early exploration of the design space and shortens the performance evaluation cycle. We apply the methodology to GNNHLS, an HLS-based graph neural network benchmark containing 6 commonly used graph neural network models and 4 datasets with distinct topologies and scales. The results show that HLPerf achieves over 10 000 × average simulation acceleration relative to RTL simulation and over 400 × acceleration relative to state-of-the-art cycle-accurate tools at the cost of 7% mean error rate relative to actual FPGA implementation performance. This acceleration positions HLPerf as a viable component in the design cycle.

使用 HLS 开发基于 FPGA 的应用程序充满了性能隐患和漫长的设计空间探索时间。当应用复杂且其性能依赖于输入数据集时，这些问题就会更加严重，机器学习的图神经网络方法通常就是这种情况。在这里，我们介绍了 HLPerf，这是一个开源的、基于仿真的数据流架构性能评估框架，它既能支持设计空间的早期探索，又能缩短性能评估周期。我们将该方法应用于 GNNHLS，这是一个基于 HLS 的图神经网络基准，包含 6 个常用图神经网络模型和 4 个具有不同拓扑结构和规模的数据集。结果表明，相对于 RTL 仿真，HLPerf 实现了超过 10,000 倍的平均仿真加速度，相对于最先进的周期精确工具，实现了超过 400 倍的加速度，而代价是相对于实际 FPGA 实现性能的 7% 平均错误率。这种加速度将 HLPerf 定位为设计周期中的一个可行组件。

{"title":"HLPerf: Demystifying the Performance of HLS-based Graph Neural Networks with Dataflow Architectures","authors":"Chenfeng Zhao, Clayton J. Faber, Roger D. Chamberlain, Xuan Zhang","doi":"10.1145/3655627","DOIUrl":"https://doi.org/10.1145/3655627","url":null,"abstract":"The development of FPGA-based applications using HLS is fraught with performance pitfalls and large design space exploration times. These issues are exacerbated when the application is complicated and its performance is dependent on the input data set, as is often the case with graph neural network approaches to machine learning. Here, we introduce HLPerf, an open-source, simulation-based performance evaluation framework for dataflow architectures that both supports early exploration of the design space and shortens the performance evaluation cycle. We apply the methodology to GNNHLS, an HLS-based graph neural network benchmark containing 6 commonly used graph neural network models and 4 datasets with distinct topologies and scales. The results show that HLPerf achieves over 10 000 × average simulation acceleration relative to RTL simulation and over 400 × acceleration relative to state-of-the-art cycle-accurate tools at the cost of 7% mean error rate relative to actual FPGA implementation performance. This acceleration positions HLPerf as a viable component in the design cycle.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"36 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PTME: A Regular Expression Matching Engine Based on Speculation and Enumerative Computation on FPGA PTME：基于 FPGA 猜测和枚举计算的正则表达式匹配引擎

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-04-01 DOI: 10.1145/3655626

Mingqian Sun, Guangwei Xie, Fan Zhang, Wei Guo, Xitian Fan, Tianyang Li, Li Chen, Jiayu Du

Fast regular expression matching is an essential task for deep packet inspection. In previous works, the regular expression matching engine on FPGA struggled to achieve an ideal balance between resource consumption and throughput. Speculation and enumerative computation exploits the statistical properties of deterministic finite automata, allowing for more efficient pattern matching. Existing related designs mostly revolve around vector instructions and multiple processors/cores or SIMD instruction sets, with a lack of implementation on FPGA platforms. We design a parallelized two-character matching engine on FPGA for efficiently fast filtering off fields with no pattern features. We transform the state transitions with sequential dependencies to the existing problem of elements in one set, enabling the proposed design to achieve high throughput with low resource consumption and support dynamic updates. Results show that compared with the traditional DFA matching, with a maximum resource consumption of 25% for on-chip FFs (74323/1045440) and LUTs (123902/522720), there is an improvement in throughput of 8.08-229.96 × speedup and 87.61-99.56% speed-up(percentage improvement) for normal traffic, and 11.73-39.59 × speedup and 91.47-97.47% speed-up(percentage improvement) for traffic with high-frequency match hits. Compared with the state-of-the-art similar implementation, our circuit on a single FPGA chip is superior to existing multi-core designs.

快速正则表达式匹配是深度数据包检测的一项基本任务。在以前的工作中，FPGA 上的正则表达式匹配引擎一直在努力实现资源消耗和吞吐量之间的理想平衡。猜测和枚举计算利用了确定性有限自动机的统计特性，可实现更高效的模式匹配。现有的相关设计大多围绕矢量指令和多处理器/内核或 SIMD 指令集，缺乏在 FPGA 平台上的实现。我们在 FPGA 上设计了一个并行化的双字符匹配引擎，可以高效快速地过滤掉没有模式特征的字段。我们将具有顺序依赖性的状态转换转换为现有的元素在一个集合中的问题，使所提出的设计能够以较低的资源消耗实现较高的吞吐量，并支持动态更新。结果表明，与传统的 DFA 匹配相比，在片上 FF（74323/1045440）和 LUT（123902/522720）的最大资源消耗为 25% 的情况下，正常流量的吞吐量提高了 8.08-229.96 倍，速度提高了 87.61-99.56%（百分比提高）；高频匹配命中流量的吞吐量提高了 11.73-39.59 倍，速度提高了 91.47-97.47%（百分比提高）。与最先进的同类实现相比，我们在单 FPGA 芯片上的电路优于现有的多核设计。

{"title":"PTME: A Regular Expression Matching Engine Based on Speculation and Enumerative Computation on FPGA","authors":"Mingqian Sun, Guangwei Xie, Fan Zhang, Wei Guo, Xitian Fan, Tianyang Li, Li Chen, Jiayu Du","doi":"10.1145/3655626","DOIUrl":"https://doi.org/10.1145/3655626","url":null,"abstract":"Fast regular expression matching is an essential task for deep packet inspection. In previous works, the regular expression matching engine on FPGA struggled to achieve an ideal balance between resource consumption and throughput. Speculation and enumerative computation exploits the statistical properties of deterministic finite automata, allowing for more efficient pattern matching. Existing related designs mostly revolve around vector instructions and multiple processors/cores or SIMD instruction sets, with a lack of implementation on FPGA platforms. We design a parallelized two-character matching engine on FPGA for efficiently fast filtering off fields with no pattern features. We transform the state transitions with sequential dependencies to the existing problem of elements in one set, enabling the proposed design to achieve high throughput with low resource consumption and support dynamic updates. Results show that compared with the traditional DFA matching, with a maximum resource consumption of 25% for on-chip FFs (74323/1045440) and LUTs (123902/522720), there is an improvement in throughput of 8.08-229.96 × speedup and 87.61-99.56% speed-up(percentage improvement) for normal traffic, and 11.73-39.59 × speedup and 91.47-97.47% speed-up(percentage improvement) for traffic with high-frequency match hits. Compared with the state-of-the-art similar implementation, our circuit on a single FPGA chip is superior to existing multi-core designs.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"27 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Design and implementation of hardware-software architecture based on hashes for SPHINCS+ 为 SPHINCS+ 设计和实施基于哈希值的软硬件架构

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-03-27 DOI: 10.1145/3653459

Jonathan López-Valdivieso, René Cumplido

Advances in quantum computing have posed a future threat to today’s cryptography. With the advent of these quantum computers, security could be compromised. Therefore, the National Institute of Standards and Technology (NIST) has issued a request for proposals to standardize algorithms for post-quantum cryptography (PQC), which is considered difficult to solve for both classical and quantum computers. Among the proposed technologies, the most popular choices are lattice-based (shortest vector problem) and hash-based approaches. Other important categories are public key cryptography (PKE) and digital signatures.

Within the realm of digital signatures lies SPHINCS+. However, there are few implementations of this scheme in hardware architectures. In this article, we present a hardware-software architecture for the SPHINCS+ scheme. We utilized a free RISC-V (Reduced Instruction Set Computer) processor synthesized on a Field Programmable Gate Array (FPGA), primarily integrating two accelerator modules for Keccak-1600 and the Haraka hash function. Additionally, modifications were made to the processor to accommodate the execution of these added modules. Our implementation yielded a 15-fold increase in performance with the SHAKE-256 function and nearly 90-fold improvement when using Haraka, compared to the reference software. Moreover, it is more compact compared to related works. This implementation was realized on a Xilinx FPGA Arty S7: Spartan-7.

量子计算的进步对当今的密码学构成了未来的威胁。随着这些量子计算机的出现，安全性可能会受到损害。因此，美国国家标准与技术研究院（NIST）发布了一份提案征集书，以规范后量子密码学（PQC）的算法，因为经典计算机和量子计算机都认为后量子密码学很难解决。在提议的技术中，最受欢迎的是基于网格的方法（最短向量问题）和基于哈希的方法。其他重要类别包括公钥加密（PKE）和数字签名。SPHINCS+ 属于数字签名领域。然而，该方案在硬件架构中的实现却很少。在本文中，我们介绍了 SPHINCS+ 方案的硬件软件架构。我们利用在现场可编程门阵列（FPGA）上合成的免费 RISC-V（精简指令集计算机）处理器，主要集成了 Keccak-1600 和 Haraka 哈希函数的两个加速器模块。此外，还对处理器进行了修改，以适应这些新增模块的执行。与参考软件相比，我们使用 SHAKE-256 函数实现的性能提高了 15 倍，使用 Haraka 实现的性能提高了近 90 倍。此外，与相关作品相比，它的结构更加紧凑。该实现是在 Xilinx FPGA Arty S7: Spartan-7 上实现的。

{"title":"Design and implementation of hardware-software architecture based on hashes for SPHINCS+","authors":"Jonathan López-Valdivieso, René Cumplido","doi":"10.1145/3653459","DOIUrl":"https://doi.org/10.1145/3653459","url":null,"abstract":"Advances in quantum computing have posed a future threat to today’s cryptography. With the advent of these quantum computers, security could be compromised. Therefore, the National Institute of Standards and Technology (NIST) has issued a request for proposals to standardize algorithms for post-quantum cryptography (PQC), which is considered difficult to solve for both classical and quantum computers. Among the proposed technologies, the most popular choices are lattice-based (shortest vector problem) and hash-based approaches. Other important categories are public key cryptography (PKE) and digital signatures. Within the realm of digital signatures lies SPHINCS+. However, there are few implementations of this scheme in hardware architectures. In this article, we present a hardware-software architecture for the SPHINCS+ scheme. We utilized a free RISC-V (Reduced Instruction Set Computer) processor synthesized on a Field Programmable Gate Array (FPGA), primarily integrating two accelerator modules for Keccak-1600 and the Haraka hash function. Additionally, modifications were made to the processor to accommodate the execution of these added modules. Our implementation yielded a 15-fold increase in performance with the SHAKE-256 function and nearly 90-fold improvement when using Haraka, compared to the reference software. Moreover, it is more compact compared to related works. This implementation was realized on a Xilinx FPGA Arty S7: Spartan-7.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"45 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140316824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FADO: Floorplan-Aware Directive Optimization Based on Synthesis and Analytical Models for High-Level Synthesis Designs on Multi-Die FPGAs FADO：基于合成和分析模型的平面图感知指令优化，适用于多芯片 FPGA 上的高层合成设计

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-03-20 DOI: 10.1145/3653458

Linfeng Du, Tingyuan Liang, Xiaofeng Zhou, Jinming Ge, Shangkun Li, Sharad Sinha, Jieru Zhao, Zhiyao Xie, Wei Zhang

Multi-die FPGAs are widely adopted for large-scale accelerators, but optimizing high-level synthesis designs on these FPGAs faces two challenges. First, the delay caused by die-crossing nets creates an NP-hard floorplanning problem. Second, traditional directive optimization cannot consider resource constraints on each die or the timing issue incurred by the die-crossings. Furthermore, the high algorithmic complexity and the large scale lead to extended runtime for legalizing the floorplan of HLS designs under different directive configurations.

To co-optimize the directives and floorplan of HLS designs on multi-die FPGAs, we formulate the co-search based on bin-packing variants and present two iterative optimization flows. The first (FADO 1.0) relies on a pre-built QoR library. It involves a greedy, latency-bottleneck-guided directive search and an incremental floorplan legalization. Compared with a global floorplanning solution, it takes 693X ∼ 4925X shorter search time and achieves 1.16X ∼ 8.78X better design performance, measured in workload execution time.

To remove the time-consuming QoR library generation, the second flow (FADO 2.0) integrates an analytical QoR model and redesigns the directive search to accelerate convergence. Through experiments on mixed dataflow and non-dataflow designs, compared with 1.0, FADO 2.0 further yields a 1.40X better design performance on average after implementation on the Alveo U250 FPGA.

大规模加速器广泛采用多芯片 FPGA，但在这些 FPGA 上优化高级综合设计面临两个挑战。首先，裸片交叉网引起的延迟造成了一个 NP 难的平面规划问题。其次，传统的指令优化无法考虑每个芯片上的资源限制或芯片交叉带来的时序问题。此外，算法复杂度高、规模大，导致在不同指令配置下，HLS 设计平面图合法化的运行时间延长。为了在多芯片 FPGA 上共同优化 HLS 设计的指令和平面图，我们制定了基于 bin-packing 变体的共同搜索，并提出了两个迭代优化流程。第一种流程（FADO 1.0）依赖于预构建的 QoR 库。它包括贪婪的、延迟瓶颈引导的指令搜索和增量平面图合法化。与全局平面规划解决方案相比，它的搜索时间缩短了 693X ～ 4925X，设计性能提高了 1.16X ～ 8.78X（以工作负载执行时间计算）。为了消除耗时的 QoR 库生成，第二个流程（FADO 2.0）集成了分析 QoR 模型，并重新设计了指令搜索以加速收敛。通过对混合数据流和非数据流设计的实验，与 1.0 相比，FADO 2.0 在 Alveo U250 FPGA 上实现后，设计性能平均提高了 1.40 倍。

{"title":"FADO: Floorplan-Aware Directive Optimization Based on Synthesis and Analytical Models for High-Level Synthesis Designs on Multi-Die FPGAs","authors":"Linfeng Du, Tingyuan Liang, Xiaofeng Zhou, Jinming Ge, Shangkun Li, Sharad Sinha, Jieru Zhao, Zhiyao Xie, Wei Zhang","doi":"10.1145/3653458","DOIUrl":"https://doi.org/10.1145/3653458","url":null,"abstract":"Multi-die FPGAs are widely adopted for large-scale accelerators, but optimizing high-level synthesis designs on these FPGAs faces two challenges. First, the delay caused by die-crossing nets creates an NP-hard floorplanning problem. Second, traditional directive optimization cannot consider resource constraints on each die or the timing issue incurred by the die-crossings. Furthermore, the high algorithmic complexity and the large scale lead to extended runtime for legalizing the floorplan of HLS designs under different directive configurations. To co-optimize the directives and floorplan of HLS designs on multi-die FPGAs, we formulate the co-search based on bin-packing variants and present two iterative optimization flows. The first (FADO 1.0) relies on a pre-built QoR library. It involves a greedy, latency-bottleneck-guided directive search and an incremental floorplan legalization. Compared with a global floorplanning solution, it takes 693X ∼ 4925X shorter search time and achieves 1.16X ∼ 8.78X better design performance, measured in workload execution time. To remove the time-consuming QoR library generation, the second flow (FADO 2.0) integrates an analytical QoR model and redesigns the directive search to accelerate convergence. Through experiments on mixed dataflow and non-dataflow designs, compared with 1.0, FADO 2.0 further yields a 1.40X better design performance on average after implementation on the Alveo U250 FPGA.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"22 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140167500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Designing an IEEE-compliant FPU that supports configurable precision for soft processors 为软处理器设计支持可配置精度的 IEEE 兼容型 FPU

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-03-15 DOI: 10.1145/3650036

Chris Keilbart, Yuhui Gao, Martin Chua, Eric Matthews, Steven J.E. Wilton, Lesley Shannon

Field Programmable Gate Arrays (FPGAs) are commonly used to accelerate floating-point (FP) applications. Although researchers have extensively studied FPGA FP implementations, existing work has largely focused on standalone operators and frequency-optimized designs. These works are not suitable for FPGA soft processors which are more sensitive to latency, impose a lower frequency ceiling, and require IEEE FP standard compliance. We present an open-source floating-point unit (FPU) for FPGA RISC-V soft processors that is fully IEEE compliant with configurable levels of FP precision. Our design emphasizes runtime performance with 25% lower latency in the most common instructions compared to previous works while maintaining efficient resource utilization.

Our FPU also allows users to explore various mantissa widths without having to rewrite or recompile their algorithms. We use this to investigate the scalability of our reduced-precision FPU across numerous microbenchmark functions as well as more complex case studies. Our experiments show that applications like the discrete cosine transformation and the Black-Scholes model can realize a speedup of more than 1.35x in conjunction with a 43% and 35% reduction in lookup table and flip-flop resources while experiencing less than a 0.025% average loss in numerical accuracy with a 16-bit mantissa width.

现场可编程门阵列（FPGA）通常用于加速浮点（FP）应用。虽然研究人员对 FPGA FP 实现进行了广泛研究，但现有工作主要集中在独立运算器和频率优化设计上。这些工作不适合 FPGA 软处理器，因为软处理器对延迟更敏感，频率上限更低，而且需要符合 IEEE FP 标准。我们为 FPGA RISC-V 软处理器提出了一种开源浮点运算单元 (FPU)，它完全符合 IEEE 标准，具有可配置的 FP 精度水平。我们的设计强调运行时性能，与以前的作品相比，最常用指令的延迟降低了 25%，同时保持了高效的资源利用率。我们的 FPU 还允许用户探索各种尾数宽度，而无需重写或重新编译算法。我们借此研究了我们的减精度 FPU 在众多微基准函数以及更复杂的案例研究中的可扩展性。我们的实验表明，离散余弦变换和布莱克-斯科尔斯模型等应用的速度提高了 1.35 倍以上，同时查找表和触发器资源分别减少了 43% 和 35%，而 16 位尾数宽度的数值精度平均损失不到 0.025%。

{"title":"Designing an IEEE-compliant FPU that supports configurable precision for soft processors","authors":"Chris Keilbart, Yuhui Gao, Martin Chua, Eric Matthews, Steven J.E. Wilton, Lesley Shannon","doi":"10.1145/3650036","DOIUrl":"https://doi.org/10.1145/3650036","url":null,"abstract":"Field Programmable Gate Arrays (FPGAs) are commonly used to accelerate floating-point (FP) applications. Although researchers have extensively studied FPGA FP implementations, existing work has largely focused on standalone operators and frequency-optimized designs. These works are not suitable for FPGA soft processors which are more sensitive to latency, impose a lower frequency ceiling, and require IEEE FP standard compliance. We present an open-source floating-point unit (FPU) for FPGA RISC-V soft processors that is fully IEEE compliant with configurable levels of FP precision. Our design emphasizes runtime performance with 25% lower latency in the most common instructions compared to previous works while maintaining efficient resource utilization. Our FPU also allows users to explore various mantissa widths without having to rewrite or recompile their algorithms. We use this to investigate the scalability of our reduced-precision FPU across numerous microbenchmark functions as well as more complex case studies. Our experiments show that applications like the discrete cosine transformation and the Black-Scholes model can realize a speedup of more than 1.35x in conjunction with a 43% and 35% reduction in lookup table and flip-flop resources while experiencing less than a 0.025% average loss in numerical accuracy with a 16-bit mantissa width.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"18 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140152043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

L-FNNG: Accelerating Large-Scale KNN Graph Construction on CPU-FPGA Heterogeneous Platform L-FNNG：在 CPU-FPGA 异构平台上加速大规模 KNN 图构建

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-03-14 DOI: 10.1145/3652609

Chaoqiang Liu, Xiaofei Liao, Long Zheng, Yu Huang, Haifeng Liu, Yi Zhang, Haiheng He, Haoyan Huang, Jingyi Zhou, Hai Jin

Due to the high complexity of constructing exact k-nearest neighbor graphs, approximate construction has become a popular research topic. The NN-Descent algorithm is one of the representative in-memory algorithms. To effectively handle large datasets, existing state-of-the-art solutions combine the divide-and-conquer approach and the NN-Descent algorithm, where large datasets are divided into multiple partitions, and a subgraph is constructed for each partition before all the subgraphs are merged, reducing the memory pressure significantly. However, such solutions fail to address inefficiencies in large-scale k-nearest neighbor graph construction. In this paper, we propose L-FNNG, a novel solution for accelerating large-scale k-nearest neighbor graph construction on CPU-FPGA heterogeneous platform. The CPU is responsible for dividing data and determining the order of partition processing, while the FPGA executes all construction tasks to utilize the acceleration capability fully. To accelerate the execution of construction tasks, we design an efficient FPGA accelerator, which includes the Block-based Scheduling (BS) and Useless Computation Aborting (UCA) techniques to address the problems of memory access and computation in the NN-Descent algorithm. We also propose an efficient scheduling strategy that includes a KD-tree-based data partitioning method and a hierarchical processing method to address scheduling inefficiency. We evaluate L-FNNG on a Xilinx Alveo U280 board hosted by a 64-core Xeon server. On multiple large-scale datasets, L-FNNG achieves, on average, 2.3 × construction speedup over the state-of-the-art GPU-based solution.

由于构建精确的 k 近邻图非常复杂，近似构建已成为一个热门研究课题。NN-Descent 算法是具有代表性的内存算法之一。为了有效处理大型数据集，现有的先进解决方案结合了分而治之法和 NN-Descent 算法，即将大型数据集划分为多个分区，并在合并所有子图之前为每个分区构建一个子图，从而大大降低了内存压力。然而，这类解决方案无法解决大规模 k 近邻图构建中的低效问题。在本文中，我们提出了 L-FNNG，一种在 CPU-FPGA 异构平台上加速大规模 k 近邻图构建的新型解决方案。CPU 负责划分数据和确定分区处理顺序，而 FPGA 则执行所有构建任务，以充分发挥加速能力。为了加速构建任务的执行，我们设计了一种高效的 FPGA 加速器，其中包括基于块的调度（BS）和无用计算中止（UCA）技术，以解决 NN-Descent 算法中的内存访问和计算问题。我们还提出了一种高效的调度策略，包括基于 KD 树的数据分区方法和分层处理方法，以解决调度效率低下的问题。我们在由 64 核至强服务器托管的赛灵思 Alveo U280 板上对 L-FNNG 进行了评估。在多个大规模数据集上，L-FNNG 与最先进的基于 GPU 的解决方案相比，平均实现了 2.3 倍的计算速度提升。

{"title":"L-FNNG: Accelerating Large-Scale KNN Graph Construction on CPU-FPGA Heterogeneous Platform","authors":"Chaoqiang Liu, Xiaofei Liao, Long Zheng, Yu Huang, Haifeng Liu, Yi Zhang, Haiheng He, Haoyan Huang, Jingyi Zhou, Hai Jin","doi":"10.1145/3652609","DOIUrl":"https://doi.org/10.1145/3652609","url":null,"abstract":"Due to the high complexity of constructing exact k-nearest neighbor graphs, approximate construction has become a popular research topic. The NN-Descent algorithm is one of the representative in-memory algorithms. To effectively handle large datasets, existing state-of-the-art solutions combine the divide-and-conquer approach and the NN-Descent algorithm, where large datasets are divided into multiple partitions, and a subgraph is constructed for each partition before all the subgraphs are merged, reducing the memory pressure significantly. However, such solutions fail to address inefficiencies in large-scale k-nearest neighbor graph construction. In this paper, we propose L-FNNG, a novel solution for accelerating large-scale k-nearest neighbor graph construction on CPU-FPGA heterogeneous platform. The CPU is responsible for dividing data and determining the order of partition processing, while the FPGA executes all construction tasks to utilize the acceleration capability fully. To accelerate the execution of construction tasks, we design an efficient FPGA accelerator, which includes the Block-based Scheduling (BS) and Useless Computation Aborting (UCA) techniques to address the problems of memory access and computation in the NN-Descent algorithm. We also propose an efficient scheduling strategy that includes a KD-tree-based data partitioning method and a hierarchical processing method to address scheduling inefficiency. We evaluate L-FNNG on a Xilinx Alveo U280 board hosted by a 64-core Xeon server. On multiple large-scale datasets, L-FNNG achieves, on average, 2.3 × construction speedup over the state-of-the-art GPU-based solution.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"29 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140125751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DONGLE 2.0: Direct FPGA-Orchestrated NVMe Storage for HLS DONGLE 2.0：面向 HLS 的 FPGA 直接协调 NVMe 存储

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-03-05 DOI: 10.1145/3650038

Linus Y. Wong, Jialiang Zhang, Jing (Jane) Li

Rapid growth in data size poses significant computational and memory challenges to data processing. FPGA accelerators and near-storage processing have emerged as compelling solutions for tackling the growing computational and memory requirements. Many FPGA-based accelerators have shown to be effective in processing large data sets by leveraging the storage capability of either host-attached or FPGA-attached storage devices. However, the current HLS development environment does not allow direct access to host- or FPGA-attached NVMe storage from the HLS code. As such, users must frequently hand off between HLS and host code to access data in storage, and such a process requires tedious programming to ensure functional correctness. Moreover, since the HLS code uses radically different methods to access storage compared to DRAM, the HLS codebase targeting DRAM-based platforms cannot be easily ported to NVMe-based platforms, resulting in limited code portability and reusability. Furthermore, frequent suspension of HLS kernel and synchronization between CPU and FPGA introduce significant latency overhead and require sophisticated scheduling mechanisms to hide latency.

To address these challenges, we propose a new HLS storage interface named DONGLE 2.0 that enables direct FPGA-orchestrated NVMe storage access. By providing a unified interface for storage and memory access, DONGLE 2.0 allows a single-source HLS program to target multiple memory/storage devices, thus making the codebase cleaner, portable, and more efficient. DONGLE 2.0 is an extension to DONGLE 1.0 [1] but adds support for host-attached storage. While its primary focus is still on FPGA NVMe access in near-storage configurations, the added host storage support ensures its compatibility with platforms that lack native support for FPGA-attached NVMe storage. We implemented a prototype of DONGLE 2.0 using an AMD/Xilinx Alveo U200 FPGA and Solidigm DC-P4610 SSD. Our evaluation on various workloads showed a geometric mean speed-up of 2.3 × and a reduction in lines of code by 2.4 × compared to the state-of-the-art commercial platform when using FPGA-attached NVMe storage. Moreover, DONGLE 2.0 demonstrated a geometric mean speed-up of 1.5 × and a reduction in lines of code by 2.4 × compared to the state-of-the-art commercial platform when using host-attached NVMe storage.

数据规模的快速增长给数据处理带来了巨大的计算和内存挑战。FPGA 加速器和近存储处理已成为应对不断增长的计算和内存需求的引人注目的解决方案。许多基于 FPGA 的加速器通过利用主机连接或 FPGA 连接存储设备的存储能力，在处理大型数据集方面表现出很好的效果。然而，当前的 HLS 开发环境不允许 HLS 代码直接访问主机或 FPGA 附加 NVMe 存储。因此，用户必须经常在 HLS 和主机代码之间切换，才能访问存储中的数据，而这一过程需要繁琐的编程来确保功能的正确性。此外，由于 HLS 代码使用的存储访问方法与 DRAM 截然不同，因此基于 DRAM 平台的 HLS 代码库无法轻松移植到基于 NVMe 的平台，导致代码的可移植性和可重用性受到限制。此外，HLS 内核的频繁暂停以及 CPU 和 FPGA 之间的同步会带来巨大的延迟开销，需要复杂的调度机制来隐藏延迟。为了应对这些挑战，我们提出了一种名为 DONGLE 2.0 的新型 HLS 存储接口，它可以实现直接的 FPGA 协调 NVMe 存储访问。通过为存储和内存访问提供统一接口，DONGLE 2.0 允许单源 HLS 程序针对多个内存/存储设备，从而使代码库更加简洁、可移植和高效。DONGLE 2.0 是对 DONGLE 1.0 [1] 的扩展，但增加了对主机附加存储的支持。虽然它的主要重点仍然是近存储配置中的 FPGA NVMe 访问，但新增的主机存储支持确保了它与缺乏 FPGA 附加 NVMe 存储原生支持的平台的兼容性。我们使用 AMD/Xilinx Alveo U200 FPGA 和 Solidigm DC-P4610 SSD 实现了 DONGLE 2.0 的原型。我们对各种工作负载进行的评估显示，在使用 FPGA 附加 NVMe 存储时，与最先进的商业平台相比，几何平均速度提高了 2.3 倍，代码行数减少了 2.4 倍。此外，与最先进的商业平台相比，DONGLE 2.0 在使用主机连接的 NVMe 存储时的几何平均速度提高了 1.5 倍，代码行数减少了 2.4 倍。

{"title":"DONGLE 2.0: Direct FPGA-Orchestrated NVMe Storage for HLS","authors":"Linus Y. Wong, Jialiang Zhang, Jing (Jane) Li","doi":"10.1145/3650038","DOIUrl":"https://doi.org/10.1145/3650038","url":null,"abstract":"Rapid growth in data size poses significant computational and memory challenges to data processing. FPGA accelerators and near-storage processing have emerged as compelling solutions for tackling the growing computational and memory requirements. Many FPGA-based accelerators have shown to be effective in processing large data sets by leveraging the storage capability of either host-attached or FPGA-attached storage devices. However, the current HLS development environment does not allow direct access to host- or FPGA-attached NVMe storage from the HLS code. As such, users must frequently hand off between HLS and host code to access data in storage, and such a process requires tedious programming to ensure functional correctness. Moreover, since the HLS code uses radically different methods to access storage compared to DRAM, the HLS codebase targeting DRAM-based platforms cannot be easily ported to NVMe-based platforms, resulting in limited code portability and reusability. Furthermore, frequent suspension of HLS kernel and synchronization between CPU and FPGA introduce significant latency overhead and require sophisticated scheduling mechanisms to hide latency. To address these challenges, we propose a new HLS storage interface named DONGLE 2.0 that enables direct FPGA-orchestrated NVMe storage access. By providing a unified interface for storage and memory access, DONGLE 2.0 allows a single-source HLS program to target multiple memory/storage devices, thus making the codebase cleaner, portable, and more efficient. DONGLE 2.0 is an extension to DONGLE 1.0 [1] but adds support for host-attached storage. While its primary focus is still on FPGA NVMe access in near-storage configurations, the added host storage support ensures its compatibility with platforms that lack native support for FPGA-attached NVMe storage. We implemented a prototype of DONGLE 2.0 using an AMD/Xilinx Alveo U200 FPGA and Solidigm DC-P4610 SSD. Our evaluation on various workloads showed a geometric mean speed-up of 2.3 × and a reduction in lines of code by 2.4 × compared to the state-of-the-art commercial platform when using FPGA-attached NVMe storage. Moreover, DONGLE 2.0 demonstrated a geometric mean speed-up of 1.5 × and a reduction in lines of code by 2.4 × compared to the state-of-the-art commercial platform when using host-attached NVMe storage.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"32 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140047928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ScalaBFS2: A High Performance BFS Accelerator on an HBM-enhanced FPGA Chip ScalaBFS2：基于 HBM 增强型 FPGA 芯片的高性能 BFS 加速器

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-02-29 DOI: 10.1145/3650037

Kexin Li, Shaoxian Xu, Zhiyuan Shao, Ran Zheng, Xiaofei Liao, Hai Jin

The introduction of High Bandwidth Memory (HBM) to the FPGA chip makes it possible for an FPGA-based accelerator to leverage the huge memory bandwidth of HBM to improve its performance when implementing a specific algorithm, which is especially true for the Breadth-First Search (BFS) algorithm that demands a high bandwidth on accessing the graph data stored in memory. Different from traditional FPGA-DRAM platforms where memory bandwidth is the precious resource due to the limited DRAM channels, FPGA chips equipped with HBM have much higher memory bandwidths provided by the large quantities of HBM channels, but still limited amount of logic (LUT, FF, and BRAM/URAM) resources. Therefore, the key to design a high performance BFS accelerator on an HBM-enhanced FPGA chip is to efficiently use the logic resources to build as many as possible Processing Elements (PEs), and configure them flexibly to obtain as high as possible effective memory bandwidth that is useful to the algorithm from the HBM, rather than partially emphasizing the absolute memory bandwidth. To exploit as high as possible effective bandwidth from the HBM, ScalaBFS2 conducts BFS in graphs with the vertex-centric manner, and proposes designs, including the independent module (HBM Reader) for memory accessing, multi-layer crossbar, and PEs that implement hybrid mode (i.e., capable of working in both push and pull modes) algorithm processing, to utilize the FPGA logic resources efficiently. Consequently, ScalaBFS2 is able to build up to 128 PEs on the XCU280 FPGA chip (produced with the 16nm process and configured with two HBM2 stacks) of a Xilinx Alveo U280 board, and achieves the performance of 56.92 GTEPS (Giga Traversed Edges Per Second) by fully using its 32 HBM memory channels. Compared with the state-of-the-art graph processing system (i.e., ReGraph) built on top of the same board, ScalaBFS2 achieves 2.52x ∼ 4.40x performance speedups. Moreover, when compared with Gunrock running on an Nvidia A100 GPU that is produced with the 7nm process and configured with five HBM2e stacks, ScalaBFS2 achieves 1.34x ∼ 2.40x speedups on absolute performance, and 7.35x ∼ 13.18x speedups on power efficiency.

在 FPGA 芯片中引入高带宽内存 (HBM)，使得基于 FPGA 的加速器在执行特定算法时可以利用 HBM 的巨大内存带宽来提高性能，这对于访问存储在内存中的图形数据时需要高带宽的广度优先搜索 (BFS) 算法来说尤其如此。与传统的 FPGA-DRAM 平台不同，传统的 FPGA-DRAM 平台由于 DRAM 通道有限，因此内存带宽是宝贵的资源，而配备 HBM 的 FPGA 芯片由于拥有大量的 HBM 通道，因此内存带宽要高得多，但逻辑（LUT、FF 和 BRAM/URAM）资源仍然有限。因此，在 HBM 增强型 FPGA 芯片上设计高性能 BFS 加速器的关键是有效利用逻辑资源，构建尽可能多的处理单元 (PE)，并灵活配置这些处理单元，以便从 HBM 中获得对算法有用的尽可能高的有效内存带宽，而不是片面强调绝对内存带宽。为了尽可能利用 HBM 的有效带宽，ScalaBFS2 以顶点为中心在图中进行 BFS，并提出了包括用于内存访问的独立模块（HBM 阅读器）、多层交叉条和实现混合模式（即能够在推模式和拉模式下工作）算法处理的 PE 等设计，以有效利用 FPGA 逻辑资源。因此，ScalaBFS2 能够在 Xilinx Alveo U280 板的 XCU280 FPGA 芯片（采用 16nm 工艺生产，配置了两个 HBM2 堆栈）上构建多达 128 个 PE，并通过充分利用其 32 个 HBM 内存通道实现了 56.92 GTEPS（每秒千兆遍历边）的性能。与基于同一板卡的最先进图处理系统（即 ReGraph）相比，ScalaBFS2 的性能提升了 2.52 倍～4.40 倍。此外，与运行在采用 7nm 工艺生产并配置了五个 HBM2e 堆栈的 Nvidia A100 GPU 上的 Gunrock 相比，ScalaBFS2 的绝对性能提高了 1.34 倍 ∼ 2.40 倍，能效提高了 7.35 倍 ∼ 13.18 倍。

{"title":"ScalaBFS2: A High Performance BFS Accelerator on an HBM-enhanced FPGA Chip","authors":"Kexin Li, Shaoxian Xu, Zhiyuan Shao, Ran Zheng, Xiaofei Liao, Hai Jin","doi":"10.1145/3650037","DOIUrl":"https://doi.org/10.1145/3650037","url":null,"abstract":"The introduction of High Bandwidth Memory (HBM) to the FPGA chip makes it possible for an FPGA-based accelerator to leverage the huge memory bandwidth of HBM to improve its performance when implementing a specific algorithm, which is especially true for the Breadth-First Search (BFS) algorithm that demands a high bandwidth on accessing the graph data stored in memory. Different from traditional FPGA-DRAM platforms where memory bandwidth is the precious resource due to the limited DRAM channels, FPGA chips equipped with HBM have much higher memory bandwidths provided by the large quantities of HBM channels, but still limited amount of logic (LUT, FF, and BRAM/URAM) resources. Therefore, the key to design a high performance BFS accelerator on an HBM-enhanced FPGA chip is to efficiently use the logic resources to build as many as possible Processing Elements (PEs), and configure them flexibly to obtain as high as possible effective memory bandwidth that is useful to the algorithm from the HBM, rather than partially emphasizing the absolute memory bandwidth. To exploit as high as possible effective bandwidth from the HBM, ScalaBFS2 conducts BFS in graphs with the vertex-centric manner, and proposes designs, including the independent module (HBM Reader) for memory accessing, multi-layer crossbar, and PEs that implement hybrid mode (i.e., capable of working in both push and pull modes) algorithm processing, to utilize the FPGA logic resources efficiently. Consequently, ScalaBFS2 is able to build up to 128 PEs on the XCU280 FPGA chip (produced with the 16nm process and configured with two HBM2 stacks) of a Xilinx Alveo U280 board, and achieves the performance of 56.92 GTEPS (Giga Traversed Edges Per Second) by fully using its 32 HBM memory channels. Compared with the state-of-the-art graph processing system (i.e., ReGraph) built on top of the same board, ScalaBFS2 achieves 2.52x ∼ 4.40x performance speedups. Moreover, when compared with Gunrock running on an Nvidia A100 GPU that is produced with the 7nm process and configured with five HBM2e stacks, ScalaBFS2 achieves 1.34x ∼ 2.40x speedups on absolute performance, and 7.35x ∼ 13.18x speedups on power efficiency.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"33 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140002890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AxOMaP: Designing FPGA-based Approximate Arithmetic Operators using Mathematical Programming AxOMaP：利用数学编程设计基于 FPGA 的近似算术运算器

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-02-19 DOI: 10.1145/3648694

Siva Satyendra Sahoo, Salim Ullah, Akash Kumar

With the increasing application of machine learning (ML) algorithms in embedded systems, there is a rising necessity to design low-cost computer arithmetic for these resource-constrained systems. As a result, emerging models of computation, such as approximate and stochastic computing, that leverage the inherent error-resilience of such algorithms are being actively explored for implementing ML inference on resource-constrained systems. Approximate computing (AxC) aims to provide disproportionate gains in the power, performance, and area (PPA) of an application by allowing some level of reduction in its behavioral accuracy (BEHAV). Using approximate operators (AxOs) for computer arithmetic forms one of the more prevalent methods of implementing AxC. AxOs provide the additional scope for finer granularity of optimization, compared to only precision scaling of computer arithmetic. To this end, the design of platform-specific and cost-efficient approximate operators forms an important research goal. Recently, multiple works have reported the use of AI/ML-based approaches for synthesizing novel FPGA-based AxOs. However, most of such works limit the use of AI/ML to designing ML-based surrogate functions that are used during iterative optimization processes. To this end, we propose a novel data analysis-driven mathematical programming-based approach to synthesizing approximate operators for FPGAs. Specifically, we formulate mixed integer quadratically constrained programsbased on the results of correlation analysis of the characterization data and use the solutions to enable a more directed search approach for evolutionary optimization algorithms. Compared to traditional evolutionary algorithms-based optimization, we report up to 21% improvement in the hypervolume, for joint optimization of PPA and BEHAV, in the design of signed 8-bit multipliers. Further, we report up to 27% better hypervolume than other state-of-the-art approaches to DSE for FPGA-based application-specific AxOs.

随着机器学习（ML）算法在嵌入式系统中的应用日益广泛，为这些资源受限的系统设计低成本计算机运算的必要性也日益凸显。因此，人们正在积极探索近似计算和随机计算等新兴计算模型，以利用这些算法固有的抗错能力，在资源受限的系统中实现 ML 推断。近似计算（AxC）旨在通过在一定程度上降低应用的行为准确性（BEHAV），使应用的功耗、性能和面积（PPA）获得不成比例的提升。在计算机运算中使用近似算子（AxOs）是实现 AxC 的最普遍方法之一。与计算机运算的精度缩放相比，近似算子为更精细的优化提供了额外的空间。为此，设计特定平台且具有成本效益的近似算子成为一项重要的研究目标。最近，有多项研究报告了使用基于人工智能/ML 的方法合成基于 FPGA 的新型近似算子。然而，大多数此类研究都将人工智能/近似算子的使用局限于设计基于近似算子的代用函数，这些函数在迭代优化过程中使用。为此，我们提出了一种基于数据分析驱动的数学编程新方法，用于合成 FPGA 的近似算子。具体来说，我们根据表征数据的相关性分析结果制定混合整数二次约束程序，并利用这些解决方案为进化优化算法提供更有方向性的搜索方法。与基于进化算法的传统优化方法相比，我们发现在设计带符号 8 位乘法器时，通过 PPA 和 BEHAV 的联合优化，超体积提高了 21%。此外，对于基于 FPGA 的特定应用 AxO，我们报告的超体积比其他最先进的 DSE 方法提高了 27%。

{"title":"AxOMaP: Designing FPGA-based Approximate Arithmetic Operators using Mathematical Programming","authors":"Siva Satyendra Sahoo, Salim Ullah, Akash Kumar","doi":"10.1145/3648694","DOIUrl":"https://doi.org/10.1145/3648694","url":null,"abstract":"With the increasing application of machine learning (ML) algorithms in embedded systems, there is a rising necessity to design low-cost computer arithmetic for these resource-constrained systems. As a result, emerging models of computation, such as approximate and stochastic computing, that leverage the inherent error-resilience of such algorithms are being actively explored for implementing ML inference on resource-constrained systems. Approximate computing (AxC) aims to provide disproportionate gains in the power, performance, and area (PPA) of an application by allowing some level of reduction in its behavioral accuracy (BEHAV). Using approximate operators (AxOs) for computer arithmetic forms one of the more prevalent methods of implementing AxC. AxOs provide the additional scope for finer granularity of optimization, compared to only precision scaling of computer arithmetic. To this end, the design of platform-specific and cost-efficient approximate operators forms an important research goal. Recently, multiple works have reported the use of AI/ML-based approaches for synthesizing novel FPGA-based AxOs. However, most of such works limit the use of AI/ML to designing ML-based surrogate functions that are used during iterative optimization processes. To this end, we propose a novel data analysis-driven mathematical programming-based approach to synthesizing approximate operators for FPGAs. Specifically, we formulate mixed integer quadratically constrained programs\u0000based on the results of correlation analysis of the characterization data and use the solutions to enable a more directed search approach for evolutionary optimization algorithms. Compared to traditional evolutionary algorithms-based optimization, we report up to 21% improvement in the hypervolume, for joint optimization of PPA and BEHAV, in the design of signed 8-bit multipliers. Further, we report up to 27% better hypervolume than other state-of-the-art approaches to DSE for FPGA-based application-specific AxOs.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"12 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139928565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Introduction to the FPL 2021 Special Section FPL 2021 特别部分简介

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-02-12 DOI: 10.1145/3635115

Diana Göhringer, Georgios Keramidas, Akash Kumar

引用次数: 0