IEEE Transactions on Computers最新文献_第4页

Compressed Test Pattern Generation for Deep Neural Networks 深度神经网络的压缩测试模式生成

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-09-11 DOI: 10.1109/TC.2024.3457738

Dina A. Moussa;Michael Hefenbrock;Mehdi Tahoori

Deep neural networks (DNNs) have emerged as an effective approach in many artificial intelligence tasks. Several specialized accelerators are often used to enhance DNN's performance and lower their energy costs. However, the presence of faults can drastically impair the performance and accuracy of these accelerators. Usually, many test patterns are required for certain types of faults to reach a target fault coverage, which in turn hence increases the testing overhead and storage cost, particularly for in-field testing. For this reason, compression is typically done after test generation step to reduce the storage cost for the generated test patterns. However, compression is more efficient when considered in an earlier stage. This paper generates the test pattern in a compressed form to require less storage. This is done by generating all test patterns as a linear combination of a set of jointly used test patterns (basis), for which only the coefficients need to be stored. The fault coverage achieved by the generated test patterns is compared to that of the adversarial and randomly generated test images. The experimental results showed that our proposed test pattern outperformed and achieved high fault coverage (up to 99.99%) and a high compression ratio (up to 307.2

$times$

).

深度神经网络（dnn）已成为许多人工智能任务的有效方法。为了提高深度神经网络的性能和降低能量消耗，通常会使用几种专门的加速器。然而，故障的存在会严重损害这些加速器的性能和精度。通常，某些类型的故障需要许多测试模式来达到目标故障覆盖，这反过来又增加了测试开销和存储成本，特别是对于现场测试。由于这个原因，压缩通常在测试生成步骤之后进行，以减少生成的测试模式的存储成本。但是，在较早的阶段考虑压缩是更有效的。本文以压缩形式生成测试模式，以减少存储空间。这是通过将所有测试模式生成为一组联合使用的测试模式（基）的线性组合来完成的，其中只需要存储系数。将生成的测试模式所获得的故障覆盖率与对抗性和随机生成的测试图像进行比较。实验结果表明，我们提出的测试模式表现优异，实现了高故障覆盖率（高达99.99%）和高压缩比（高达307.2$times$）。

{"title":"Compressed Test Pattern Generation for Deep Neural Networks","authors":"Dina A. Moussa;Michael Hefenbrock;Mehdi Tahoori","doi":"10.1109/TC.2024.3457738","DOIUrl":"10.1109/TC.2024.3457738","url":null,"abstract":"Deep neural networks (DNNs) have emerged as an effective approach in many artificial intelligence tasks. Several specialized accelerators are often used to enhance DNN's performance and lower their energy costs. However, the presence of faults can drastically impair the performance and accuracy of these accelerators. Usually, many test patterns are required for certain types of faults to reach a target fault coverage, which in turn hence increases the testing overhead and storage cost, particularly for in-field testing. For this reason, compression is typically done after test generation step to reduce the storage cost for the generated test patterns. However, compression is more efficient when considered in an earlier stage. This paper generates the test pattern in a compressed form to require less storage. This is done by generating all test patterns as a linear combination of a set of jointly used test patterns (basis), for which only the coefficients need to be stored. The fault coverage achieved by the generated test patterns is compared to that of the adversarial and randomly generated test images. The experimental results showed that our proposed test pattern outperformed and achieved high fault coverage (up to 99.99%) and a high compression ratio (up to 307.2\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000).","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"307-315"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CUSPX: Efficient GPU Implementations of Post-Quantum Signature SPHINCS+ CUSPX：后量子签名 SPHINCS+ 的高效 GPU 实现

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-09-11 DOI: 10.1109/TC.2024.3457736

Ziheng Wang;Xiaoshe Dong;Heng Chen;Yan Kang;Qiang Wang

Quantum computers pose a serious threat to existing cryptographic systems. While Post-Quantum Cryptography (PQC) offers resilience against quantum attacks, its performance limitations often hinder widespread adoption. Among the three National Institute of Standards and Technology (NIST)-selected general-purpose PQC schemes, SPHINCS

${}^{+}$

is particularly susceptible to these limitations. We introduce CUSPX (CUDA SPHINCS

${}^{+}$

), the first large-scale parallel implementation of SPHINCS

${}^{+}$

capable of running across 10,000 cores. CUSPX leverages a novel three-level parallelism framework, applying it to algorithmic parallelism, data parallelism, and hybrid parallelism. Notably, CUSPX introduces parallel Merkle tree construction algorithms for arbitrary parallel scales and several load-balancing solutions, further enhancing performance. By treating tasks parallelism as the top level of parallelism, CUSPX provides a four-level parallel scheme that can run with any number of tasks. Evaluated on a single GeForce RTX 3090 using the SPHINCS

${}^{+}$

-SHA-256-128s-simple parameter set, CUSPX achieves a single task's signature generation latency of 0.67 ms, demonstrating a 5,105

$times$

speedup over a single-thread version and an 18.50

$times$

speedup over the previous fastest implementation.

量子计算机对现有的密码系统构成严重威胁。虽然后量子密码学（PQC）提供了抗量子攻击的弹性，但其性能限制往往阻碍了广泛采用。在美国国家标准与技术研究所（NIST）选定的三个通用PQC方案中，SPHINCS${}^{+}$特别容易受到这些限制的影响。我们引入CUSPX (CUDA SPHINCS${}^{+}$)，这是SPHINCS${}^{+}$的第一个大规模并行实现，能够在10,000个内核上运行。CUSPX利用一种新的三层并行框架，将其应用于算法并行、数据并行和混合并行。值得注意的是，CUSPX引入了任意并行尺度的并行Merkle树构建算法和几种负载平衡解决方案，进一步提高了性能。通过将任务并行性视为并行性的顶层，CUSPX提供了一个可以运行任意数量任务的四层并行方案。在单个GeForce RTX 3090上使用SPHINCS${}^{+}$- sha -256-128s简单参数集进行评估，CUSPX实现了单个任务签名生成延迟为0.67 ms，比单线程版本加速了5,105$times$，比之前最快的实现加速了18.50$times$。

{"title":"CUSPX: Efficient GPU Implementations of Post-Quantum Signature SPHINCS+","authors":"Ziheng Wang;Xiaoshe Dong;Heng Chen;Yan Kang;Qiang Wang","doi":"10.1109/TC.2024.3457736","DOIUrl":"10.1109/TC.2024.3457736","url":null,"abstract":"Quantum computers pose a serious threat to existing cryptographic systems. While Post-Quantum Cryptography (PQC) offers resilience against quantum attacks, its performance limitations often hinder widespread adoption. Among the three National Institute of Standards and Technology (NIST)-selected general-purpose PQC schemes, SPHINCS\u0000<inline-formula><tex-math>${}^{+}$</tex-math></inline-formula>\u0000 is particularly susceptible to these limitations. We introduce CUSPX (\u0000CU\u0000DA \u0000SP\u0000HIN\u0000CS\u0000 \u0000<inline-formula><tex-math>${}^{+}$</tex-math></inline-formula>\u0000), the first large-scale parallel implementation of SPHINCS\u0000<inline-formula><tex-math>${}^{+}$</tex-math></inline-formula>\u0000 capable of running across 10,000 cores. CUSPX leverages a novel three-level parallelism framework, applying it to \u0000algorithmic parallelism\u0000, \u0000data parallelism\u0000, and \u0000hybrid parallelism\u0000. Notably, CUSPX introduces parallel Merkle tree construction algorithms for arbitrary parallel scales and several load-balancing solutions, further enhancing performance. By treating tasks parallelism as the top level of parallelism, CUSPX provides a four-level parallel scheme that can run with any number of tasks. Evaluated on a single GeForce RTX 3090 using the SPHINCS\u0000<inline-formula><tex-math>${}^{+}$</tex-math></inline-formula>\u0000-SHA-256-128s-simple parameter set, CUSPX achieves a single task's signature generation latency of 0.67 ms, demonstrating a 5,105\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 speedup over a single-thread version and an 18.50\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 speedup over the previous fastest implementation.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"15-28"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Component Dependencies Based Network-on-Chip Test 基于组件依赖关系的片上网络测试

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-09-11 DOI: 10.1109/TC.2024.3457732

Letian Huang;Tianjin Zhao;Ziren Wang;Junkai Zhan;Junshi Wang;Xiaohang Wang

On-line test of NoC is essential for its reliability. This paper proposed an integral test solution for on-line test of NoC to reduce the test cost and improve the reliability of NOC. The test solution includes a new partitioning method, as well as a test method and a test schedule which are based on the proposed partitioning method. The new partitioning method partitions the NoC into a new type of basis unit under test (UUT) named as interdependent components based unit under test (iDC-UUT), which applies component test methods. The iDC-UUT have very low level of functional interdependency and simple physical connection, which results in small test overhead and high test coverage. The proposed test method consists of DFT architecture, test wrapper and test vectors, which can speed-up the test procedure and further improve the test coverage. The proposed test schedule reduces the blockage probability of data packets during testing by increasing the degree of test disorder, so as to further reduce the test cost. Experimental results show that the proposed test solution reduces power and area by 12.7% and 22.7% over an existing test solution. The average latency is reduced by 22.6% to 38.4% over the existing test solution.

NoC 的在线测试对其可靠性至关重要。本文提出了 NoC 在线测试的整体测试解决方案，以降低测试成本，提高 NOC 的可靠性。该测试解决方案包括一种新的分区方法，以及基于所提分区方法的测试方法和测试计划。新的分区方法将 NoC 划分为一种新型的被测基础单元（UUT），命名为基于组件的被测单元（iDC-UUT），采用组件测试方法。iDC-UUT 的功能相互依赖性很低，物理连接简单，因此测试开销小，测试覆盖率高。拟议的测试方法由 DFT 架构、测试封装器和测试矢量组成，可加快测试过程并进一步提高测试覆盖率。建议的测试计划通过增加测试无序度来降低测试过程中数据包的阻塞概率，从而进一步降低测试成本。实验结果表明，与现有测试方案相比，建议的测试方案在功耗和面积上分别降低了 12.7% 和 22.7%。平均延迟比现有测试方案减少了 22.6% 至 38.4%。

{"title":"Component Dependencies Based Network-on-Chip Test","authors":"Letian Huang;Tianjin Zhao;Ziren Wang;Junkai Zhan;Junshi Wang;Xiaohang Wang","doi":"10.1109/TC.2024.3457732","DOIUrl":"10.1109/TC.2024.3457732","url":null,"abstract":"On-line test of NoC is essential for its reliability. This paper proposed an integral test solution for on-line test of NoC to reduce the test cost and improve the reliability of NOC. The test solution includes a new partitioning method, as well as a test method and a test schedule which are based on the proposed partitioning method. The new partitioning method partitions the NoC into a new type of basis unit under test (UUT) named as interdependent components based unit under test (iDC-UUT), which applies component test methods. The iDC-UUT have very low level of functional interdependency and simple physical connection, which results in small test overhead and high test coverage. The proposed test method consists of DFT architecture, test wrapper and test vectors, which can speed-up the test procedure and further improve the test coverage. The proposed test schedule reduces the blockage probability of data packets during testing by increasing the degree of test disorder, so as to further reduce the test cost. Experimental results show that the proposed test solution reduces power and area by 12.7% and 22.7% over an existing test solution. The average latency is reduced by 22.6% to 38.4% over the existing test solution.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2805-2816"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FLALM: A Flexible Low Area-Latency Montgomery Modular Multiplication on FPGA FLALM： FPGA 上灵活的低面积-延迟蒙哥马利模块化乘法器

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-09-11 DOI: 10.1109/TC.2024.3457739

Yujun Xie;Yuan Liu;Xin Zheng;Bohan Lan;Dengyun Lei;Dehao Xiang;Shuting Cai;Xiaoming Xiong

Montgomery Modular Multiplication (MMM) is widely used in many public key cryptography systems. This paper presents a Flexible Low Area-Latency MMM (FLALM) implementation, which supports Generic Montgomery Modular Multiplication (GMM) and Square Montgomery Modular Multiplication (SMM) operations. A new SMM schedule for the Finely Integrated Product Scanning (FIPS) GMM algorithm is proposed to accelerate SMM with tiny additional design. Furthermore, a new FIPS dual-schedule is proposed to solve the data hazards of this algorithm. Finally, we explore the trade-off between area and latency, and present the FLALM to accelerate GMM and SMM. The FLALM is implemented on FPGA (Virtex-7 platform). The results show that the area*latency (AL) value of FLALM (wordsize

$w$

=128) is 38.1% and 44.7% better than the previous state-of-art scalable references when performing 1024-bit and 2048-bit GMM, respectively. Moreover, when computing SMM, the advantage of AL value is raised to 73.7% and 86.3% respectively.

蒙哥马利模乘法（MMM）被广泛应用于许多公钥加密系统中。本文提出了一种灵活的低区域延迟MMM （FLALM）实现，它支持通用蒙哥马利模乘法（GMM）和平方蒙哥马利模乘法（SMM）操作。针对精细集成产品扫描（FIPS） GMM算法，提出了一种新的SMM调度方案，以减少附加设计，加快SMM的速度。在此基础上，提出了一种新的FIPS双调度算法来解决该算法的数据危害问题。最后，我们探讨了面积和延迟之间的权衡，并提出了FLALM来加速GMM和SMM。FLALM是在FPGA （Virtex-7平台）上实现的。结果表明，在执行1024位和2048位GMM时，FLALM （wordsize $w$=128）的area*latency （AL）值分别比现有的可扩展参考文献高38.1%和44.7%。在计算SMM时，AL值的优势分别提高到73.7%和86.3%。

引用次数: 0

Exploiting Structured Feature and Runtime Isolation for High-Performant Recommendation Serving 利用结构化特征和运行时隔离实现高效推荐服务

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-28 DOI: 10.1109/TC.2024.3449749

Xin You;Hailong Yang;Siqi Wang;Tao Peng;Chen Ding;Xinyuan Li;Bangduo Chen;Zhongzhi Luan;Tongxuan Liu;Yong Li;Depei Qian

Recommendation serving with deep learning models is one of the most valuable services of modern E-commerce companies. In production, to accommodate billions of recommendation queries with stringent service level agreements, high-performant recommendation serving systems play an essential role in meeting such daunting demand. Unfortunately, existing model serving frameworks fail to achieve efficient serving due to unique challenges such as 1) the input format mismatch between service needs and the model's ability and 2) heavy software contentions to concurrently execute the constrained operations. To address the above challenges, we propose RecServe, a high-performant serving system for recommendation with the optimized design of structured features and SessionGroups for recommendation serving. With structured features, RecServe packs single-user-multiple-candidates inputs by semi-automatically transforming computation graphs with annotated input tensors, which can significantly reduce redundant network transmission, data movements, and useless computations. With session group, RecServe further adopts resource isolations for multiple compute streams and cost-aware operator scheduler with critical-path-based schedule policy to enable concurrent kernel execution, further improving serving throughput. The experiment results demonstrate that RecServe can achieve maximum performance speedups of 12.3

$boldsymbol{times}$

and

$22.0boldsymbol{times}$

compared to the state-of-the-art serving system on CPU and GPU platforms, respectively.

利用深度学习模型提供推荐服务是现代电子商务公司最有价值的服务之一。在生产过程中，为了满足数十亿次推荐查询和严格的服务水平协议，高性能的推荐服务系统在满足如此巨大的需求方面发挥着至关重要的作用。遗憾的是，现有的模型服务框架无法实现高效服务，原因在于存在以下独特的挑战：1）服务需求与模型能力之间的输入格式不匹配；2）同时执行受限操作的软件任务繁重。针对上述挑战，我们提出了一个高性能的推荐服务系统 RecServe，该系统对结构化特征和会话组进行了优化设计，以提供推荐服务。利用结构化特征，RecServe 通过半自动转换带有注释的输入张量的计算图来打包单用户-多候选输入，这可以大大减少冗余的网络传输、数据移动和无用的计算。在会话组的基础上，RecServe 进一步采用了多个计算流的资源隔离和基于临界路径调度策略的成本感知操作员调度器，以实现并发内核执行，从而进一步提高服务吞吐量。实验结果表明，与CPU和GPU平台上最先进的服务系统相比，RecServe的最高性能分别提高了12.3倍和22.0倍。

{"title":"Exploiting Structured Feature and Runtime Isolation for High-Performant Recommendation Serving","authors":"Xin You;Hailong Yang;Siqi Wang;Tao Peng;Chen Ding;Xinyuan Li;Bangduo Chen;Zhongzhi Luan;Tongxuan Liu;Yong Li;Depei Qian","doi":"10.1109/TC.2024.3449749","DOIUrl":"10.1109/TC.2024.3449749","url":null,"abstract":"Recommendation serving with deep learning models is one of the most valuable services of modern E-commerce companies. In production, to accommodate billions of recommendation queries with stringent service level agreements, high-performant recommendation serving systems play an essential role in meeting such daunting demand. Unfortunately, existing model serving frameworks fail to achieve efficient serving due to unique challenges such as 1) the input format mismatch between service needs and the model's ability and 2) heavy software contentions to concurrently execute the constrained operations. To address the above challenges, we propose \u0000RecServe\u0000, a high-performant serving system for recommendation with the optimized design of \u0000structured features\u0000 and \u0000SessionGroups\u0000 for recommendation serving. With \u0000structured features\u0000, \u0000RecServe\u0000 packs single-user-multiple-candidates inputs by semi-automatically transforming computation graphs with annotated input tensors, which can significantly reduce redundant network transmission, data movements, and useless computations. With \u0000session group\u0000, \u0000RecServe\u0000 further adopts resource isolations for multiple compute streams and cost-aware operator scheduler with critical-path-based schedule policy to enable concurrent kernel execution, further improving serving throughput. The experiment results demonstrate that \u0000RecServe\u0000 can achieve maximum performance speedups of 12.3\u0000<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$22.0boldsymbol{times}$</tex-math></inline-formula>\u0000 compared to the state-of-the-art serving system on CPU and GPU platforms, respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2474-2487"},"PeriodicalIF":3.6,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Falic: An FPGA-Based Multi-Scalar Multiplication Accelerator for Zero-Knowledge Proof 法利克基于 FPGA 的零知识证明多乘法加速器

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-23 DOI: 10.1109/TC.2024.3449121

Yongkui Yang;Zhenyan Lu;Jingwei Zeng;Xingguo Liu;Xuehai Qian;Zhibin Yu

In this paper, we propose Falic, a novel FPGA-based accelerator to accelerate multi-scalar multiplication (MSM), the most time-consuming phase of zk-SNARK proof generation. Falic innovates three techniques. First, it leverages globally asynchronous locally synchronous (GALS) strategy to build multiple small and lightweight MSM cores to parallelize the independent inner product computation on different portions of the scalar vector and point vector. Second, each MSM core contains just one large-integer modular multiplier (LIMM) that is multiplexed to perform the point additions (PADDs) generated during MSM. We strike a balance between the throughput and hardware cost by batching the appropriate number of PADDs and selecting the computation graph of PADD with proper parallelism degree. Finally, the performance is further improved by a simple cache structure that enables the computation reuse. We implement Falic on two different FPGAs with different hardware resources, i.e., the Xilinx U200 and Xilinx U250. Compared to the prior FPGA-based accelerator, Falic improves the MSM throughput by

$3.9boldsymbol{times}$

. Experimental results also show that Falic achieves a throughput speedup of up to

$1.62boldsymbol{times}$

and saves as much as

$8.5boldsymbol{times}$

energy compared to an RTX 2080Ti GPU.

本文提出了一种基于 FPGA 的新型加速器 Falic，用于加速多标量乘法 (MSM)，这是 zk-SNARK 证明生成过程中最耗时的阶段。Falic 创新了三种技术。首先，它利用全局异步局部同步（GALS）策略构建了多个小型轻量级 MSM 内核，以并行处理标量向量和点向量不同部分的独立内积计算。其次，每个 MSM 内核仅包含一个大整数模块乘法器 (LIMM)，该乘法器被复用以执行 MSM 期间生成的点加法 (PADD)。我们通过批处理适当数量的 PADD 和选择具有适当并行度的 PADD 计算图，在吞吐量和硬件成本之间取得平衡。最后，简单的缓存结构实现了计算的重复使用，从而进一步提高了性能。我们在两种具有不同硬件资源的 FPGA（即 Xilinx U200 和 Xilinx U250）上实现了 Falic。与之前基于 FPGA 的加速器相比，Falic 将 MSM 吞吐量提高了 3.9 美元（boldsymbol{times}$）。实验结果还显示，与 RTX 2080Ti GPU 相比，Falic 实现了高达 1.62 美元的吞吐量加速，并节省了高达 8.5 美元的能耗。

{"title":"Falic: An FPGA-Based Multi-Scalar Multiplication Accelerator for Zero-Knowledge Proof","authors":"Yongkui Yang;Zhenyan Lu;Jingwei Zeng;Xingguo Liu;Xuehai Qian;Zhibin Yu","doi":"10.1109/TC.2024.3449121","DOIUrl":"10.1109/TC.2024.3449121","url":null,"abstract":"In this paper, we propose Falic, a novel FPGA-based accelerator to accelerate multi-scalar multiplication (MSM), the most time-consuming phase of zk-SNARK proof generation. Falic innovates three techniques. First, it leverages globally asynchronous locally synchronous (GALS) strategy to build multiple small and lightweight MSM cores to parallelize the independent inner product computation on different portions of the scalar vector and point vector. Second, each MSM core contains just one large-integer modular multiplier (LIMM) that is multiplexed to perform the point additions (PADDs) generated during MSM. We strike a balance between the throughput and hardware cost by batching the appropriate number of PADDs and selecting the computation graph of PADD with proper parallelism degree. Finally, the performance is further improved by a simple cache structure that enables the computation reuse. We implement Falic on two different FPGAs with different hardware resources, i.e., the Xilinx U200 and Xilinx U250. Compared to the prior FPGA-based accelerator, Falic improves the MSM throughput by \u0000<inline-formula><tex-math>$3.9boldsymbol{times}$</tex-math></inline-formula>\u0000. Experimental results also show that Falic achieves a throughput speedup of up to \u0000<inline-formula><tex-math>$1.62boldsymbol{times}$</tex-math></inline-formula>\u0000 and saves as much as \u0000<inline-formula><tex-math>$8.5boldsymbol{times}$</tex-math></inline-formula>\u0000 energy compared to an RTX 2080Ti GPU.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2791-2804"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HGNAS: Hardware-Aware Graph Neural Architecture Search for Edge Devices HGNAS：面向边缘设备的硬件感知图神经架构搜索

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-23 DOI: 10.1109/TC.2024.3449108

Ao Zhou;Jianlei Yang;Yingjie Qi;Tong Qiao;Yumeng Shi;Cenlin Duan;Weisheng Zhao;Chunming Hu

Graph Neural Networks (GNNs) are becoming increasingly popular for graph-based learning tasks such as point cloud processing due to their state-of-the-art (SOTA) performance. Nevertheless, the research community has primarily focused on improving model expressiveness, lacking consideration of how to design efficient GNN models for edge scenarios with real-time requirements and limited resources. Examining existing GNN models reveals varied execution across platforms and frequent Out-Of-Memory (OOM) problems, highlighting the need for hardware-aware GNN design. To address this challenge, this work proposes a novel hardware-aware graph neural architecture search framework tailored for resource constraint edge devices, namely HGNAS. To achieve hardware awareness, HGNAS integrates an efficient GNN hardware performance predictor that evaluates the latency and peak memory usage of GNNs in milliseconds. Meanwhile, we study GNN memory usage during inference and offer a peak memory estimation method, enhancing the robustness of architecture evaluations when combined with predictor outcomes. Furthermore, HGNAS constructs a fine-grained design space to enable the exploration of extreme performance architectures by decoupling the GNN paradigm. In addition, the multi-stage hierarchical search strategy is leveraged to facilitate the navigation of huge candidates, which can reduce the single search time to a few GPU hours. To the best of our knowledge, HGNAS is the first automated GNN design framework for edge devices, and also the first work to achieve hardware awareness of GNNs across different platforms. Extensive experiments across various applications and edge devices have proven the superiority of HGNAS. It can achieve up to a

$10.6boldsymbol{times}$

speedup and an

$82.5%$

peak memory reduction with negligible accuracy loss compared to DGCNN on ModelNet40.

图神经网络（GNN）因其最先进的（SOTA）性能，在基于图的学习任务（如点云处理）中越来越受欢迎。然而，研究界主要关注的是提高模型的表达能力，而没有考虑如何为具有实时性要求和资源有限的边缘场景设计高效的 GNN 模型。对现有 GNN 模型的研究表明，不同平台的执行情况各不相同，经常出现内存不足（OOM）问题，这凸显了硬件感知 GNN 设计的必要性。为应对这一挑战，本研究提出了一种为资源受限的边缘设备量身定制的新型硬件感知图神经架构搜索框架，即 HGNAS。为实现硬件感知，HGNAS 集成了高效的 GNN 硬件性能预测器，能以毫秒为单位评估 GNN 的延迟和内存使用峰值。同时，我们研究了 GNN 在推理过程中的内存使用情况，并提供了一种峰值内存估算方法，在与预测器结果相结合时增强了架构评估的鲁棒性。此外，HGNAS 还构建了一个细粒度设计空间，通过解耦 GNN 范式，探索极限性能架构。此外，HGNAS 还利用多级分层搜索策略，方便浏览庞大的候选方案，从而将单次搜索时间缩短到几个 GPU 小时。据我们所知，HGNAS 是首个面向边缘设备的自动 GNN 设计框架，也是首个实现跨不同平台 GNN 硬件感知的工作。跨各种应用和边缘设备的广泛实验证明了 HGNAS 的优越性。与 ModelNet40 上的 DGCNN 相比，HGNAS 的速度提高了 10.6 美元，内存峰值减少了 82.5%，精度损失几乎可以忽略不计。

{"title":"HGNAS: Hardware-Aware Graph Neural Architecture Search for Edge Devices","authors":"Ao Zhou;Jianlei Yang;Yingjie Qi;Tong Qiao;Yumeng Shi;Cenlin Duan;Weisheng Zhao;Chunming Hu","doi":"10.1109/TC.2024.3449108","DOIUrl":"10.1109/TC.2024.3449108","url":null,"abstract":"Graph Neural Networks (GNNs) are becoming increasingly popular for graph-based learning tasks such as point cloud processing due to their state-of-the-art (SOTA) performance. Nevertheless, the research community has primarily focused on improving model expressiveness, lacking consideration of how to design efficient GNN models for edge scenarios with real-time requirements and limited resources. Examining existing GNN models reveals varied execution across platforms and frequent Out-Of-Memory (OOM) problems, highlighting the need for hardware-aware GNN design. To address this challenge, this work proposes a novel hardware-aware graph neural architecture search framework tailored for resource constraint edge devices, namely HGNAS. To achieve hardware awareness, HGNAS integrates an efficient GNN hardware performance predictor that evaluates the latency and peak memory usage of GNNs in milliseconds. Meanwhile, we study GNN memory usage during inference and offer a peak memory estimation method, enhancing the robustness of architecture evaluations when combined with predictor outcomes. Furthermore, HGNAS constructs a fine-grained design space to enable the exploration of extreme performance architectures by decoupling the GNN paradigm. In addition, the multi-stage hierarchical search strategy is leveraged to facilitate the navigation of huge candidates, which can reduce the single search time to a few GPU hours. To the best of our knowledge, HGNAS is the first automated GNN design framework for edge devices, and also the first work to achieve hardware awareness of GNNs across different platforms. Extensive experiments across various applications and edge devices have proven the superiority of HGNAS. It can achieve up to a \u0000<inline-formula><tex-math>$10.6boldsymbol{times}$</tex-math></inline-formula>\u0000 speedup and an \u0000<inline-formula><tex-math>$82.5%$</tex-math></inline-formula>\u0000 peak memory reduction with negligible accuracy loss compared to DGCNN on ModelNet40.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2693-2707"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enabling Efficient Deep Learning on MCU With Transient Redundancy Elimination 通过消除瞬态冗余在 MCU 上实现高效深度学习

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-23 DOI: 10.1109/TC.2024.3449102

Jiesong Liu;Feng Zhang;Jiawei Guan;Hsin-Hsuan Sung;Xiaoguang Guo;Saiqin Long;Xiaoyong Du;Xipeng Shen

Deploying deep neural networks (DNNs) with satisfactory performance in resource-constrained environments is challenging. This is especially true of microcontrollers due to their tight space and computational capabilities. However, there is a growing demand for DNNs on microcontrollers, as executing large DNNs on microcontrollers is critical to reducing energy consumption, increasing performance efficiency, and eliminating privacy concerns. This paper presents a novel and systematic data redundancy elimination method to implement efficient DNNs on microcontrollers through innovations in computation and space optimization. By making the optimization itself a trainable component in the target neural networks, this method maximizes performance benefits while keeping the DNN accuracy stable. Experiments are performed on two microcontroller boards with three popular DNNs, namely CifarNet, ZfNet and SqueezeNet. Experiments show that this solution eliminates more than 96% of computations in DNNs and makes them fit well on microcontrollers, yielding 3.4-5

$times$

speedup with little loss of accuracy.

在资源有限的环境中部署性能令人满意的深度神经网络（DNN）是一项挑战。由于微控制器的空间和计算能力有限，这种情况尤为突出。然而，由于在微控制器上执行大型 DNN 对于降低能耗、提高性能效率和消除隐私问题至关重要，因此对微控制器上 DNN 的需求日益增长。本文提出了一种新颖、系统的数据冗余消除方法，通过计算和空间优化方面的创新，在微控制器上实现高效的 DNN。通过将优化本身作为目标神经网络中的可训练组件，该方法在保持 DNN 精度稳定的同时，最大限度地提高了性能。在两块微控制器板上使用三种流行的 DNN（即 CifarNet、ZfNet 和 SqueezeNet）进行了实验。实验结果表明，该解决方案消除了 DNN 中 96% 以上的计算，使它们能够很好地适应微控制器，速度提高了 3.4-5 美元/次，而准确性几乎没有损失。

{"title":"Enabling Efficient Deep Learning on MCU With Transient Redundancy Elimination","authors":"Jiesong Liu;Feng Zhang;Jiawei Guan;Hsin-Hsuan Sung;Xiaoguang Guo;Saiqin Long;Xiaoyong Du;Xipeng Shen","doi":"10.1109/TC.2024.3449102","DOIUrl":"10.1109/TC.2024.3449102","url":null,"abstract":"Deploying deep neural networks (DNNs) with satisfactory performance in resource-constrained environments is challenging. This is especially true of microcontrollers due to their tight space and computational capabilities. However, there is a growing demand for DNNs on microcontrollers, as executing large DNNs on microcontrollers is critical to reducing energy consumption, increasing performance efficiency, and eliminating privacy concerns. This paper presents a novel and systematic data redundancy elimination method to implement efficient DNNs on microcontrollers through innovations in computation and space optimization. By making the optimization itself a trainable component in the target neural networks, this method maximizes performance benefits while keeping the DNN accuracy stable. Experiments are performed on two microcontroller boards with three popular DNNs, namely CifarNet, ZfNet and SqueezeNet. Experiments show that this solution eliminates more than 96% of computations in DNNs and makes them fit well on microcontrollers, yielding 3.4-5\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 speedup with little loss of accuracy.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2649-2663"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BiRD: Bi-Directional Input Reuse Dataflow for Enhancing Depthwise Convolution Performance on Systolic Arrays BiRD：用于提高收缩阵列深度卷积性能的双向输入重复使用数据流

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-23 DOI: 10.1109/TC.2024.3449103

Mingeon Park;Seokjin Hwang;Hyungmin Cho

Depthwise convolution (DWConv) is an effective technique for reducing the size and computational requirements of convolutional neural networks. However, DWConv's input reuse pattern is not easily transformed into dense matrix multiplications, leading to low utilization of processing elements (PEs) on existing systolic arrays. In this paper, we introduce a novel systolic array dataflow mechanism called BiRD, designed to maximize input reuse and boost DWConv performance. BiRD utilizes two directions of input reuse and necessitates only minor modifications to a typical weight-stationary type systolic array. We evaluate BiRD on the Gemmini platform, comparing it with existing dataflow types. The results demonstrate that BiRD achieves significant performance improvements in computation time reduction, while incurring minimal area overhead and improved energy consumption compared to other dataflow types. For example, on a 32

$times{}$

32 systolic array, it results in a 9.8% area overhead, significantly smaller than other dataflow types for DWConv. Compared to matrix multiplication-based DWConv, BiRD achieves a 4.7

$times{}$

performance improvement for DWConv layers of MobileNet-V2, resulting in a 55.8% reduction in total inference computation time and a 44.9% reduction in energy consumption. Our results highlight the effectiveness of BiRD in enhancing the performance of DWConv on systolic arrays.

深度卷积（DWConv）是减少卷积神经网络规模和计算要求的有效技术。然而，DWConv 的输入重用模式不容易转化为密集矩阵乘法，导致现有系统阵列的处理元件（PE）利用率较低。在本文中，我们介绍了一种名为 BiRD 的新型收缩阵列数据流机制，旨在最大限度地提高输入重用率，提升 DWConv 性能。BiRD 利用两个方向的输入重用，只需对典型的权重静态型收缩阵列稍作修改即可。我们在 Gemmini 平台上对 BiRD 进行了评估，并将其与现有的数据流类型进行了比较。结果表明，与其他数据流类型相比，BiRD 在减少计算时间方面实现了显著的性能提升，同时产生的面积开销最小，能耗也有所改善。例如，在一个 32$times{}$32 的收缩阵列上，BiRD 的面积开销为 9.8%，明显小于 DWConv 的其他数据流类型。与基于矩阵乘法的 DWConv 相比，BiRD 使 MobileNet-V2 的 DWConv 层性能提高了 4.7$times{}$，推理计算总时间减少了 55.8%，能耗降低了 44.9%。我们的研究结果凸显了 BiRD 在提高 DWConv 在收缩阵列上的性能方面的有效性。

{"title":"BiRD: Bi-Directional Input Reuse Dataflow for Enhancing Depthwise Convolution Performance on Systolic Arrays","authors":"Mingeon Park;Seokjin Hwang;Hyungmin Cho","doi":"10.1109/TC.2024.3449103","DOIUrl":"10.1109/TC.2024.3449103","url":null,"abstract":"Depthwise convolution (DWConv) is an effective technique for reducing the size and computational requirements of convolutional neural networks. However, DWConv's input reuse pattern is not easily transformed into dense matrix multiplications, leading to low utilization of processing elements (PEs) on existing systolic arrays. In this paper, we introduce a novel systolic array dataflow mechanism called \u0000BiRD\u0000, designed to maximize input reuse and boost DWConv performance. BiRD utilizes two directions of input reuse and necessitates only minor modifications to a typical weight-stationary type systolic array. We evaluate BiRD on the Gemmini platform, comparing it with existing dataflow types. The results demonstrate that BiRD achieves significant performance improvements in computation time reduction, while incurring minimal area overhead and improved energy consumption compared to other dataflow types. For example, on a 32\u0000<inline-formula><tex-math>$times{}$</tex-math></inline-formula>\u000032 systolic array, it results in a 9.8% area overhead, significantly smaller than other dataflow types for DWConv. Compared to matrix multiplication-based DWConv, BiRD achieves a 4.7\u0000<inline-formula><tex-math>$times{}$</tex-math></inline-formula>\u0000 performance improvement for DWConv layers of MobileNet-V2, resulting in a 55.8% reduction in total inference computation time and a 44.9% reduction in energy consumption. Our results highlight the effectiveness of BiRD in enhancing the performance of DWConv on systolic arrays.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2708-2721"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Joint Pruning and Channel-Wise Mixed-Precision Quantization for Efficient Deep Neural Networks 针对高效深度神经网络的联合剪枝和信道混合精度量化技术

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-23 DOI: 10.1109/TC.2024.3449084

Beatrice Alessandra Motetti;Matteo Risso;Alessio Burrello;Enrico Macii;Massimo Poncino;Daniele Jahier Pagliari

The resource requirements of deep neural networks (DNNs) pose significant challenges to their deployment on edge devices. Common approaches to address this issue are pruning and mixed-precision quantization, which lead to latency and memory occupation improvements. These optimization techniques are usually applied independently. We propose a novel methodology to apply them jointly via a lightweight gradient-based search, and in a hardware-aware manner, greatly reducing the time required to generate Pareto-optimal DNNs in terms of accuracy versus cost (i.e., latency or memory). We test our approach on three edge-relevant benchmarks, namely CIFAR-10, Google Speech Commands, and Tiny ImageNet. When targeting the optimization of the memory footprint, we are able to achieve a size reduction of 47.50% and 69.54% at iso-accuracy with the baseline networks with all weights quantized at 8 and 2-bit, respectively. Our method surpasses a previous state-of-the-art approach with up to 56.17% size reduction at iso-accuracy. With respect to the sequential application of state-of-the-art pruning and mixed-precision optimizations, we obtain comparable or superior results, but with a significantly lowered training time. In addition, we show how well-tailored cost models can improve the cost versus accuracy trade-offs when targeting specific hardware for deployment.

深度神经网络（DNN）的资源需求对其在边缘设备上的部署构成了巨大挑战。解决这一问题的常见方法是剪枝和混合精度量化，它们可以改善延迟和内存占用。这些优化技术通常是独立应用的。我们提出了一种新方法，通过基于梯度的轻量级搜索，以硬件感知的方式联合应用这些技术，大大缩短了生成帕累托最优 DNN 所需的时间，实现了精度与成本（即延迟或内存）的对比。我们在三个边缘相关基准上测试了我们的方法，即 CIFAR-10、Google Speech Commands 和 Tiny ImageNet。在针对内存占用进行优化时，我们能够实现 47.50% 和 69.54% 的大小缩减，与所有权重量化为 8 位和 2 位的基线网络达到等精度。我们的方法超越了之前最先进的方法，在等精度情况下，体积缩小了 56.17%。与连续应用最先进的剪枝和混合精度优化方法相比，我们获得了相当或更优的结果，但训练时间却大大缩短。此外，我们还展示了在针对特定硬件进行部署时，量身定制的成本模型如何改善成本与精度之间的权衡。

{"title":"Joint Pruning and Channel-Wise Mixed-Precision Quantization for Efficient Deep Neural Networks","authors":"Beatrice Alessandra Motetti;Matteo Risso;Alessio Burrello;Enrico Macii;Massimo Poncino;Daniele Jahier Pagliari","doi":"10.1109/TC.2024.3449084","DOIUrl":"10.1109/TC.2024.3449084","url":null,"abstract":"The resource requirements of deep neural networks (DNNs) pose significant challenges to their deployment on edge devices. Common approaches to address this issue are pruning and mixed-precision quantization, which lead to latency and memory occupation improvements. These optimization techniques are usually applied independently. We propose a novel methodology to apply them jointly via a lightweight gradient-based search, and in a hardware-aware manner, greatly reducing the time required to generate Pareto-optimal DNNs in terms of accuracy versus cost (i.e., latency or memory). We test our approach on three edge-relevant benchmarks, namely CIFAR-10, Google Speech Commands, and Tiny ImageNet. When targeting the optimization of the memory footprint, we are able to achieve a size reduction of 47.50% and 69.54% at iso-accuracy with the baseline networks with all weights quantized at 8 and 2-bit, respectively. Our method surpasses a previous state-of-the-art approach with up to 56.17% size reduction at iso-accuracy. With respect to the sequential application of state-of-the-art pruning and mixed-precision optimizations, we obtain comparable or superior results, but with a significantly lowered training time. In addition, we show how well-tailored cost models can improve the cost versus accuracy trade-offs when targeting specific hardware for deployment.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2619-2633"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0