2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)最新文献_第4页

tubGEMM: Energy-Efficient and Sparsity-Effective Temporal-Unary-Binary Based Matrix Multiply Unit tubGEMM:高效的稀疏有效的时间一元二元矩阵乘法单元

2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2023-06-20 DOI: 10.1109/ISVLSI59464.2023.10238524

P. Vellaisamy, Harideep Nair, Joseph Finn, Manav Trivedi, Albert Chen, Anna Li, Tsung-Han Lin, Perry Wang, Shawn Blanton, John Paul Shen

General Matrix Multiplication (GEMM) is a ubiquitous compute kernel in deep learning (DL). To support energy-efficient edge-native processing, new GEMM hardware units have been proposed that operate on unary encoded bitstreams using much simpler hardware. Most unary approaches thus far focus on rate-based unary encoding of values and perform stochastic approximate computation. This work presents tubGEMM, a novel matrix-multiply unit design that employs hybrid temporal-unary and binary (tub) encoding and performs exact (not approximate) GEMM. It intrinsically exploits dynamic value sparsity to improve energy efficiency. Compared to the current best unary design uGEMM, tubGEMM significantly reduces area, power, and energy by 89%, 87%, and 50% respectively. A tubGEMM design performing 128x128 matrix multiply on 8-bit integers, in commercial TSMC N5 (5nm) process node, consumes just 0.22 m$mathrm{m}^{2}$ die area, 417.72 mW power, and 8.86 $mu$J energy, assuming no sparsity. Typical sparsity in DL workloads (MobileNetv2, ResNet50) reduces energy by more than 3x, and lowering precision to 4 and 2 bits further reduces it by 24x and 104x respectively.

通用矩阵乘法(GEMM)是深度学习中的泛在计算内核。为了支持高效的边缘原生处理，新的GEMM硬件单元被提出，使用更简单的硬件在一元编码的比特流上运行。到目前为止，大多数一元方法都集中在基于速率的值的一元编码上，并进行随机近似计算。这项工作提出了tubGEMM，一种新颖的矩阵乘单元设计，采用混合时间一元和二进制(tub)编码，并执行精确(而不是近似)GEMM。它本质上利用动态值稀疏性来提高能源效率。与目前最好的一元设计uGEMM相比，tubGEMM的面积、功耗和能耗分别显著降低89%、87%和50%。在商用台积电N5 (5nm)制程节点上，对8位整数执行128x128矩阵乘法的tubGEMM设计仅消耗0.22 m$ mathm {m}^{2}$芯片面积，417.72 mW功率和8.86 $mu$J能量，假设没有稀疏性。典型的深度学习工作负载(MobileNetv2, ResNet50)的稀疏性减少了3倍以上的能量，将精度降低到4位和2位进一步减少了24倍和104倍。

{"title":"tubGEMM: Energy-Efficient and Sparsity-Effective Temporal-Unary-Binary Based Matrix Multiply Unit","authors":"P. Vellaisamy, Harideep Nair, Joseph Finn, Manav Trivedi, Albert Chen, Anna Li, Tsung-Han Lin, Perry Wang, Shawn Blanton, John Paul Shen","doi":"10.1109/ISVLSI59464.2023.10238524","DOIUrl":"https://doi.org/10.1109/ISVLSI59464.2023.10238524","url":null,"abstract":"General Matrix Multiplication (GEMM) is a ubiquitous compute kernel in deep learning (DL). To support energy-efficient edge-native processing, new GEMM hardware units have been proposed that operate on unary encoded bitstreams using much simpler hardware. Most unary approaches thus far focus on rate-based unary encoding of values and perform stochastic approximate computation. This work presents tubGEMM, a novel matrix-multiply unit design that employs hybrid temporal-unary and binary (tub) encoding and performs exact (not approximate) GEMM. It intrinsically exploits dynamic value sparsity to improve energy efficiency. Compared to the current best unary design uGEMM, tubGEMM significantly reduces area, power, and energy by 89%, 87%, and 50% respectively. A tubGEMM design performing 128x128 matrix multiply on 8-bit integers, in commercial TSMC N5 (5nm) process node, consumes just 0.22 m$mathrm{m}^{2}$ die area, 417.72 mW power, and 8.86 $mu$J energy, assuming no sparsity. Typical sparsity in DL workloads (MobileNetv2, ResNet50) reduces energy by more than 3x, and lowering precision to 4 and 2 bits further reduces it by 24x and 104x respectively.","PeriodicalId":199371,"journal":{"name":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115689177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Design Space Exploration for CNN Offloading to FPGAs at the Edge 在边缘将CNN分流到fpga的设计空间探索

2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2023-06-20 DOI: 10.1109/ISVLSI59464.2023.10238644

Guilherme Korol, M. Jordan, M. B. Rutzig, J. Castrillón, A. C. S. Beck

AI-based IoT applications relying on heavy-load deep learning algorithms like CNNs challenge IoT devices that are restricted in energy or processing capabilities. Edge computing offers an alternative by allowing the data to get offloaded to so-called edge servers with hardware more powerful than IoT devices and physically closer than the cloud. However, the increasing complexity of data and algorithms and diverse conditions make even powerful devices, such as those equipped with FPGAs, insufficient to cope with the current demands. In this case, optimizations in the algorithms, like pruning and early-exit, are mandatory to reduce the CNNs computational burden and speed up inference processing. With that in mind, we propose ExpOL, which combines the pruning and early-exit CNN optimizations in a system-level FPGA-based IoT-Edge design space exploration. Based on a user-defined multi-target optimization, ExpOL delivers designs tailored to specific application environments and user needs. When evaluated against state-of-the-art FPGA-based accelerators (either local or offloaded), designs produced by ExpOL are more power-efficient (by up to 2$times$) and process inferences at higher user quality of experience (by up to 12.5%).

基于人工智能的物联网应用依赖于cnn等重载深度学习算法，挑战了在能量或处理能力方面受到限制的物联网设备。边缘计算提供了另一种选择，允许数据卸载到所谓的边缘服务器上，这些服务器的硬件比物联网设备更强大，物理上比云更近。然而，随着数据和算法的日益复杂以及条件的多样化，即使是功能强大的设备，如配备fpga的设备，也不足以应对当前的需求。在这种情况下，必须对算法进行优化，如修剪和早期退出，以减少cnn的计算负担并加快推理处理速度。考虑到这一点，我们提出了ExpOL，它在基于系统级fpga的物联网边缘设计空间探索中结合了修剪和早期退出CNN优化。基于用户定义的多目标优化，ExpOL提供针对特定应用环境和用户需求量身定制的设计。当与最先进的基于fpga的加速器(本地或卸载)进行评估时，ExpOL生产的设计更节能(高达2美元)，并且在更高的用户体验质量下进行过程推断(高达12.5%)。

{"title":"Design Space Exploration for CNN Offloading to FPGAs at the Edge","authors":"Guilherme Korol, M. Jordan, M. B. Rutzig, J. Castrillón, A. C. S. Beck","doi":"10.1109/ISVLSI59464.2023.10238644","DOIUrl":"https://doi.org/10.1109/ISVLSI59464.2023.10238644","url":null,"abstract":"AI-based IoT applications relying on heavy-load deep learning algorithms like CNNs challenge IoT devices that are restricted in energy or processing capabilities. Edge computing offers an alternative by allowing the data to get offloaded to so-called edge servers with hardware more powerful than IoT devices and physically closer than the cloud. However, the increasing complexity of data and algorithms and diverse conditions make even powerful devices, such as those equipped with FPGAs, insufficient to cope with the current demands. In this case, optimizations in the algorithms, like pruning and early-exit, are mandatory to reduce the CNNs computational burden and speed up inference processing. With that in mind, we propose ExpOL, which combines the pruning and early-exit CNN optimizations in a system-level FPGA-based IoT-Edge design space exploration. Based on a user-defined multi-target optimization, ExpOL delivers designs tailored to specific application environments and user needs. When evaluated against state-of-the-art FPGA-based accelerators (either local or offloaded), designs produced by ExpOL are more power-efficient (by up to 2$times$) and process inferences at higher user quality of experience (by up to 12.5%).","PeriodicalId":199371,"journal":{"name":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127515441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dynamic Offloading for Improved Performance and Energy Efficiency in Heterogeneous IoT-Edge-Cloud Continuum 在异构物联网边缘云连续体中提高性能和能源效率的动态卸载

2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2023-06-20 DOI: 10.1109/ISVLSI59464.2023.10238564

J. Vicenzi, Guilherme Korol, M. Jordan, Wagner Ourique de Morais, Hazem Ali, Edison Pignaton De Freitas, M. B. Rutzig, A. C. S. Beck

While machine learning applications in IoT devices are getting more widespread, the computational and power limitations of these devices pose a great challenge. To handle this increasing computational burden, edge, and cloud solutions emerge as a means to offload computation to more powerful devices. However, the unstable nature of network connections constantly changes the communication costs, making the offload process (i.e., when and where to transfer data) a dynamic trade-off. In this work, we propose DECOS: a framework to automatically select at run-time the best offloading solution with minimum latency based on the computational capabilities of devices and network status at a given moment. We use heterogeneous devices for edge and Cloud nodes to evaluate the framework’s performance using MobileNetV1 CNN and network traffic data from a real-world 4G bandwidth dataset. DECOS effectively selects the best processing node to maintain the minimum possible latency, reducing it up to 29% compared to Cloud-exclusive processing while reducing the energy consumption by 1.9$times$ compared to IoT-exclusive execution.

虽然机器学习在物联网设备中的应用越来越广泛，但这些设备的计算和功率限制构成了巨大的挑战。为了处理这种不断增加的计算负担，边缘和云解决方案作为一种将计算卸载到更强大的设备上的手段出现了。然而，网络连接的不稳定性不断改变通信成本，使得卸载过程(即何时何地传输数据)成为一种动态权衡。在这项工作中，我们提出DECOS:一个框架，在运行时根据设备的计算能力和给定时刻的网络状态自动选择具有最小延迟的最佳卸载解决方案。我们使用边缘和云节点的异构设备，使用MobileNetV1 CNN和来自真实4G带宽数据集的网络流量数据来评估框架的性能。DECOS有效地选择最佳处理节点以保持尽可能小的延迟，与云独占处理相比，延迟减少了29%，而与物联网独占执行相比，能耗减少了1.9美元。

{"title":"Dynamic Offloading for Improved Performance and Energy Efficiency in Heterogeneous IoT-Edge-Cloud Continuum","authors":"J. Vicenzi, Guilherme Korol, M. Jordan, Wagner Ourique de Morais, Hazem Ali, Edison Pignaton De Freitas, M. B. Rutzig, A. C. S. Beck","doi":"10.1109/ISVLSI59464.2023.10238564","DOIUrl":"https://doi.org/10.1109/ISVLSI59464.2023.10238564","url":null,"abstract":"While machine learning applications in IoT devices are getting more widespread, the computational and power limitations of these devices pose a great challenge. To handle this increasing computational burden, edge, and cloud solutions emerge as a means to offload computation to more powerful devices. However, the unstable nature of network connections constantly changes the communication costs, making the offload process (i.e., when and where to transfer data) a dynamic trade-off. In this work, we propose DECOS: a framework to automatically select at run-time the best offloading solution with minimum latency based on the computational capabilities of devices and network status at a given moment. We use heterogeneous devices for edge and Cloud nodes to evaluate the framework’s performance using MobileNetV1 CNN and network traffic data from a real-world 4G bandwidth dataset. DECOS effectively selects the best processing node to maintain the minimum possible latency, reducing it up to 29% compared to Cloud-exclusive processing while reducing the energy consumption by 1.9$times$ compared to IoT-exclusive execution.","PeriodicalId":199371,"journal":{"name":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121330880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Photonic Convolution Engine Based on Phase-Change Materials and Stochastic Computing 基于相变材料和随机计算的光子卷积引擎

2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2023-06-20 DOI: 10.1109/ISVLSI59464.2023.10238608

Raphael Cardoso, Clément Zrounba, M.F. Abdalla, Paul Jiménez, Mauricio Gomes de Queiroz, Benoît Charbonnier, Fabio Pavanello, Ian O’Connor, S. L. Beux

The last wave of AI developments sparked a global surge in computing resources allocated to neural network models. Even though such models solve complex problems, their mathematical foundations are simple, with the multiply-accumulate (MAC) operation standing out as one of the most important. However, improvements in traditional CMOS technologies fail to match the ever-increasing performance requirements of AI applications, therefore new technologies, as well as disruptive computing architectures must be explored. In this paper, we propose a novel in-memory implementation of a MAC operator based on stochastic computing and optical phase-change memories (oPCMs), leveraging their proven non-volatility and multi-level capabilities to achieve convolution. We show that resorting to the stochastic computing paradigm allows one to exploit the dynamic mechanisms of oPCMs to naturally compute and store MAC results with less noise sensitivity. Under similar conditions, we demonstrate an improvement of up to $64times$ and $10times$ in the applications that we evaluated.

上一波人工智能发展浪潮引发了分配给神经网络模型的全球计算资源激增。尽管这些模型可以解决复杂的问题，但它们的数学基础很简单，其中最重要的是乘法累加(MAC)运算。然而，传统CMOS技术的改进无法满足人工智能应用日益增长的性能要求，因此必须探索新技术以及颠覆性计算架构。在本文中，我们提出了一种基于随机计算和光相变存储器(oPCMs)的MAC算子的内存实现，利用其已证明的非易失性和多层次能力来实现卷积。我们表明，借助于随机计算范式，可以利用opcm的动态机制，以更低的噪声灵敏度自然地计算和存储MAC结果。在类似的条件下，我们在我们评估的应用中证明了高达64times$和10times$的改进。

{"title":"Photonic Convolution Engine Based on Phase-Change Materials and Stochastic Computing","authors":"Raphael Cardoso, Clément Zrounba, M.F. Abdalla, Paul Jiménez, Mauricio Gomes de Queiroz, Benoît Charbonnier, Fabio Pavanello, Ian O’Connor, S. L. Beux","doi":"10.1109/ISVLSI59464.2023.10238608","DOIUrl":"https://doi.org/10.1109/ISVLSI59464.2023.10238608","url":null,"abstract":"The last wave of AI developments sparked a global surge in computing resources allocated to neural network models. Even though such models solve complex problems, their mathematical foundations are simple, with the multiply-accumulate (MAC) operation standing out as one of the most important. However, improvements in traditional CMOS technologies fail to match the ever-increasing performance requirements of AI applications, therefore new technologies, as well as disruptive computing architectures must be explored. In this paper, we propose a novel in-memory implementation of a MAC operator based on stochastic computing and optical phase-change memories (oPCMs), leveraging their proven non-volatility and multi-level capabilities to achieve convolution. We show that resorting to the stochastic computing paradigm allows one to exploit the dynamic mechanisms of oPCMs to naturally compute and store MAC results with less noise sensitivity. Under similar conditions, we demonstrate an improvement of up to $64times$ and $10times$ in the applications that we evaluated.","PeriodicalId":199371,"journal":{"name":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130447024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Digital SRAM Computing-in-Memory Design Utilizing Activation Unstructured Sparsity for High-Efficient DNN Inference 利用激活非结构化稀疏性进行高效DNN推理的数字SRAM内存计算设计

2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2023-06-20 DOI: 10.1109/ISVLSI59464.2023.10238597

Baiqing Zhong, Mingyu Wang, Chuanghao Zhang, Yangzhan Mai, Xiaojie Li, Zhiyi Yu

The Computing-in-Memory (CIM) architecture has emerged as a promising approach for designing energy-efficient DNN processors. While previous CIM designs have explored the use of DNN weight sparsity, these approaches often involve pruning the weight matrix in a specific manner. This process may increase the new complexity of the calculation and negatively impact DNN accuracy. However, there are barely any digital CIM circuits that leverage the sparsity in activation which is naturally sparse in many scenarios due to the ReLU activation functions. In order to fully utilize activation unstructured sparsity, we proposed a digital SRAM CIM. This circuit is designed using the booth encoding scheme and adopts the circuit structure of an accumulator-based multiply-accumulate (MAC) calculation. It utilizes SRAM bit-line (BL) computing to obtain matrix sparse information and employs an allocator to allocate data calculation for SRAM-CIM. The proposed design is implemented and evaluated at 40 nm CMOS process. Our evaluation results show that the proposed circuit can achieve a clock frequency of 1 GHz at 1.1 V, with a peak performance of 819.2 GOPS, and in the case of 50%-90% sparsity, SRAM-CIM achieves $1.12 times 3.32 times$ speedup, and energy savings of 48.2% to 90.57% over dense mode. When performing an 8-bit matrix multiplication with 90% sparsity, the energy efficiency is 10.57 TOPS/W.

内存计算(CIM)架构已成为设计节能深度神经网络处理器的一种有前途的方法。虽然以前的CIM设计已经探索了DNN权稀疏性的使用，但这些方法通常涉及以特定的方式修剪权矩阵。这个过程可能会增加新的计算复杂性，并对深度神经网络的精度产生负面影响。然而，几乎没有任何数字CIM电路利用激活中的稀疏性，由于ReLU激活函数，在许多场景中，稀疏性是自然的。为了充分利用激活非结构化稀疏性，我们提出了一种数字SRAM CIM。本电路采用booth编码方案设计，采用基于累加器的乘法累加(MAC)计算电路结构。它利用SRAM位线(BL)计算获得矩阵稀疏信息，并采用分配器为SRAM- cim分配数据计算。该设计在40纳米CMOS工艺下实现并进行了评估。我们的评估结果表明，所提出的电路在1.1 V时可以实现1 GHz的时钟频率，峰值性能为819.2 GOPS，并且在50%-90%稀疏度的情况下，SRAM-CIM比密集模式实现了1.12倍3.32倍的加速，节能48.2%至90.57%。当以90%的稀疏度执行8位矩阵乘法时，能量效率为10.57 TOPS/W。

{"title":"A Digital SRAM Computing-in-Memory Design Utilizing Activation Unstructured Sparsity for High-Efficient DNN Inference","authors":"Baiqing Zhong, Mingyu Wang, Chuanghao Zhang, Yangzhan Mai, Xiaojie Li, Zhiyi Yu","doi":"10.1109/ISVLSI59464.2023.10238597","DOIUrl":"https://doi.org/10.1109/ISVLSI59464.2023.10238597","url":null,"abstract":"The Computing-in-Memory (CIM) architecture has emerged as a promising approach for designing energy-efficient DNN processors. While previous CIM designs have explored the use of DNN weight sparsity, these approaches often involve pruning the weight matrix in a specific manner. This process may increase the new complexity of the calculation and negatively impact DNN accuracy. However, there are barely any digital CIM circuits that leverage the sparsity in activation which is naturally sparse in many scenarios due to the ReLU activation functions. In order to fully utilize activation unstructured sparsity, we proposed a digital SRAM CIM. This circuit is designed using the booth encoding scheme and adopts the circuit structure of an accumulator-based multiply-accumulate (MAC) calculation. It utilizes SRAM bit-line (BL) computing to obtain matrix sparse information and employs an allocator to allocate data calculation for SRAM-CIM. The proposed design is implemented and evaluated at 40 nm CMOS process. Our evaluation results show that the proposed circuit can achieve a clock frequency of 1 GHz at 1.1 V, with a peak performance of 819.2 GOPS, and in the case of 50%-90% sparsity, SRAM-CIM achieves $1.12 times 3.32 times$ speedup, and energy savings of 48.2% to 90.57% over dense mode. When performing an 8-bit matrix multiplication with 90% sparsity, the energy efficiency is 10.57 TOPS/W.","PeriodicalId":199371,"journal":{"name":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134503681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluating an XOR-based Hybrid Fault Tolerance Technique to Detect Faults in GPU Pipelines 基于xor的混合容错技术在GPU管道故障检测中的应用

2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2023-06-20 DOI: 10.1109/ISVLSI59464.2023.10238657

Giani Braga, Marcio M. Gonçalves, J. Azambuja

Graphics Processing Units are consistently reaching new applications due to their massive parallel execution architectures. However, some safety-critical areas, such as avionics, come with unfriendly environments due to radiation effects caused by cosmic rays, effectively causing component failures. This work implements and tests a hybrid fault tolerance technique initially proposed by NVIDIA to protect a GPU’s pipeline against radiation effects. Results show that the technique can be effective against data-flow errors but at a high cost in execution time overheads and potentially increased control-flow errors.

图形处理单元由于其大规模的并行执行架构而不断地到达新的应用程序。然而，一些安全关键领域，如航空电子设备，由于宇宙射线引起的辐射效应，环境不友好，有效地导致组件故障。这项工作实现并测试了NVIDIA最初提出的混合容错技术，以保护GPU的管道免受辐射影响。结果表明，该技术可以有效地防止数据流错误，但在执行时间开销和潜在的增加控制流错误方面的成本很高。

引用次数: 0

Design and Evaluation of M-Term Non-Homogeneous Hybrid Karatsuba Polynomial Multiplier m项非齐次混合Karatsuba多项式乘法器的设计与评价

2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2023-06-20 DOI: 10.1109/ISVLSI59464.2023.10238681

Sanampudi Gopala Krishna Reddy, Gogireddy Ravi Kiran Reddy, Vasanthi D R, Madhav Rao

Finite-field multipliers progressively plays a crucial role in modern cryptography systems. While much attention has been given to the development of area-efficient Karatsuba multipliers as a means of bolstering encryption capabilities, there remains a vast and untapped realm of design space yet to be explored. An innovative technique that has emerged in this area involves the implementation of a Composite M-Term Karatsuba-like Multiplier, which integrates a schoolbook multiplier (SBM) at lower bounds to enhance performance. However, the approach of breaking down operand bit-widths homogeneously along the stages may not result in optimal hardware characteristics, and further improvement can be achieved by configuring the recursive stages to non-homogeneous ‘M’ values. This paper attempts to perform an exhaustive design-space exploration of Karatsuba-like multipliers for different bit-widths and presents a methodology for designing different possible sequences for M-Term non-homogeneous hybrid Karatsuba multiplier (MNHKA). Few MNHKA designs among several sequences achieve high performance while minimizing area requirements. This study evaluates the area, delay, and area-delay-product (ADP) characteristics of pure M-Term Karatsuba multiplier (MKA), Composite M-Term Karatsuba with SBM (CMKA), and a novel MNHKA that are configured as finite field multipliers for different popular bitwidths. In addition, this study also introduces a novel Matlab-based framework that enables the generation of an optimized hardware design code for MNHKA design with customizable sequence and operand sizes. The proposed MNHKA design was implemented and verified on ZYNQ ZCU-104 FPGA Board and also synthesized using 45 nm technology library on Cadence-Genus tool. The implemented FPGA results with LUTs utilization and delay metrics clearly indicate that the proposed category of MNHKA polynomial multiplier outperforms SOTA designs for various bit-widths. Specifically, the proposed design achieves an ADP improvement of 12.33% for a bit-width of 64, and greater gains of 21.15%, 27.74%, and 23.045% for higher order bits of 409, 1350, and 2500, respectively, when compared to CMKA(STOA) multiplier. The experimental results of ASIC flow resulted in an impressive maximum footprint saving of 47.61% as well as significant ADP gains of 45.72% for the 1350-bit design, and also achieved ADP improvement of 16.42%, 15.56%, and 22.59% for bit widths of 64, 409, and 2500, respectively, when compared to CMKA design. All the designs are made freely available for further adoption to the researchers and the designers community.

有限域乘法器在现代密码系统中起着越来越重要的作用。虽然人们对开发面积高效的Karatsuba乘数器作为增强加密能力的一种手段给予了很大的关注，但仍有大量未开发的设计空间有待探索。在这一领域出现的一种创新技术涉及实施复合M-Term类似karatsuba的乘法器，该乘法器在下界集成了教科书乘法器(SBM)以提高性能。然而，沿着阶段均匀分解操作数位宽度的方法可能不会产生最佳的硬件特性，并且可以通过将递归阶段配置为非均匀的“M”值来实现进一步的改进。本文试图对不同比特宽度的类Karatsuba乘法器进行详尽的设计空间探索，并提出了一种设计M-Term非齐次混合Karatsuba乘法器(MNHKA)不同可能序列的方法。在几个序列中，很少有MNHKA设计在最小化面积要求的同时实现高性能。本研究评估了纯M-Term Karatsuba乘子(MKA)、复合M-Term Karatsuba与SBM (CMKA)以及一种新型MNHKA的面积、延迟和面积-延迟积(ADP)特性，它们被配置为不同流行比特宽度的有限域乘子。此外，本研究还介绍了一种新颖的基于matlab的框架，该框架能够为具有可定制序列和操作数大小的MNHKA设计生成优化的硬件设计代码。所提出的MNHKA设计在ZYNQ ZCU-104 FPGA板上实现并验证，并在Cadence-Genus工具上使用45 nm工艺库进行合成。利用LUTs利用率和延迟指标实现的FPGA结果清楚地表明，所提出的MNHKA多项式乘子类型在各种比特宽度下都优于SOTA设计。具体来说，与CMKA(STOA)乘频器相比，所提出的设计在位宽为64时实现了12.33%的ADP改进，在409、1350和2500高阶位时分别实现了21.15%、27.74%和23.045%的增益。ASIC流的实验结果表明，与CMKA设计相比，1350位设计的最大内存占用节省了47.61%，ADP显著提高了45.72%，并且在位宽为64、409和2500时，ADP分别提高了16.42%、15.56%和22.59%。所有的设计都是免费的，供研究人员和设计师社区进一步采用。

{"title":"Design and Evaluation of M-Term Non-Homogeneous Hybrid Karatsuba Polynomial Multiplier","authors":"Sanampudi Gopala Krishna Reddy, Gogireddy Ravi Kiran Reddy, Vasanthi D R, Madhav Rao","doi":"10.1109/ISVLSI59464.2023.10238681","DOIUrl":"https://doi.org/10.1109/ISVLSI59464.2023.10238681","url":null,"abstract":"Finite-field multipliers progressively plays a crucial role in modern cryptography systems. While much attention has been given to the development of area-efficient Karatsuba multipliers as a means of bolstering encryption capabilities, there remains a vast and untapped realm of design space yet to be explored. An innovative technique that has emerged in this area involves the implementation of a Composite M-Term Karatsuba-like Multiplier, which integrates a schoolbook multiplier (SBM) at lower bounds to enhance performance. However, the approach of breaking down operand bit-widths homogeneously along the stages may not result in optimal hardware characteristics, and further improvement can be achieved by configuring the recursive stages to non-homogeneous ‘M’ values. This paper attempts to perform an exhaustive design-space exploration of Karatsuba-like multipliers for different bit-widths and presents a methodology for designing different possible sequences for M-Term non-homogeneous hybrid Karatsuba multiplier (MNHKA). Few MNHKA designs among several sequences achieve high performance while minimizing area requirements. This study evaluates the area, delay, and area-delay-product (ADP) characteristics of pure M-Term Karatsuba multiplier (MKA), Composite M-Term Karatsuba with SBM (CMKA), and a novel MNHKA that are configured as finite field multipliers for different popular bitwidths. In addition, this study also introduces a novel Matlab-based framework that enables the generation of an optimized hardware design code for MNHKA design with customizable sequence and operand sizes. The proposed MNHKA design was implemented and verified on ZYNQ ZCU-104 FPGA Board and also synthesized using 45 nm technology library on Cadence-Genus tool. The implemented FPGA results with LUTs utilization and delay metrics clearly indicate that the proposed category of MNHKA polynomial multiplier outperforms SOTA designs for various bit-widths. Specifically, the proposed design achieves an ADP improvement of 12.33% for a bit-width of 64, and greater gains of 21.15%, 27.74%, and 23.045% for higher order bits of 409, 1350, and 2500, respectively, when compared to CMKA(STOA) multiplier. The experimental results of ASIC flow resulted in an impressive maximum footprint saving of 47.61% as well as significant ADP gains of 45.72% for the 1350-bit design, and also achieved ADP improvement of 16.42%, 15.56%, and 22.59% for bit widths of 64, 409, and 2500, respectively, when compared to CMKA design. All the designs are made freely available for further adoption to the researchers and the designers community.","PeriodicalId":199371,"journal":{"name":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122219085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine Learning Techniques for Pre-CTS Identification of Timing Critical Flip-Flops 时序临界触发器的预cts识别的机器学习技术

2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2023-06-20 DOI: 10.1109/ISVLSI59464.2023.10238658

Chunkai Fu, Ben Trombley, Hua Xiang, Gi-Joon Nam, Jiang Hu

-The timing criticality of flip-flops is a key factor for combinational circuit timing optimization and clock network power reduction, both of which are often performed prior to CTS (Clock Tree Synthesis) and routing. However, timing criticality is often changed by CTS/routing and therefore optimizations according to pre-CTS criticality may deviate from the correct directions. This work investigates machine learning techniques for pre-CTS identification of post-routing timing critical flip-flops. Experimental results show that the ML-based early identification can achieve 99.7% accuracy and 0.98 area under ROC (Receiver Operating Characteristic) curve, and is $62000 times$ to $73000 times$ faster than the estimate with CTS and routing flow on average. Our method is almost $8 times$ faster than a state-of-the-art approach of ML-based timing prediction.

触发器的时序临界性是组合电路时序优化和时钟网络功耗降低的关键因素，这两者通常在CTS(时钟树合成)和路由之前进行。然而，定时临界性经常被CTS/路由改变，因此根据CTS前临界性进行的优化可能会偏离正确的方向。这项工作研究了用于后路由时序关键触发器的预cts识别的机器学习技术。实验结果表明，基于机器学习的早期识别准确率达到99.7%，ROC曲线下面积达到0.98，平均速度比CTS和路由流估计快62000 ~ 73000倍。我们的方法几乎比最先进的基于ml的时间预测方法快8倍。

引用次数: 0

LEX - A Cell Switching Arcs Extractor: A Simple SPICE-Input Interface for Electrical Characterization LEX - A细胞开关电弧提取器:一个简单的香料输入接口的电气特性

2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2023-06-20 DOI: 10.1109/ISVLSI59464.2023.10238671

Rodrigo N. Wuerdig, V. H. Maciel, Ricardo Reis, S. Bampi

The characterization of logic cells is a critical step in the design of digital circuits. Existing open-source cell characterization tools typically require significant extra information beyond the SPICE netlist. In this paper, we present a new open-source tool - LEX - that serves as a very useful interface for these characterization tools, enabling the extraction of essential input and output information, Boolean expressions, truth tables, and transition (switching) arcs directly from the SPICE netlist. Our LEX tool offers several advantages over existing open-source methods. First, it simplifies the cell electrical characterization process by eliminating the need for manual input of additional information. This saves time and reduces the incidence of errors. Second, our tool provides a more comprehensive set of information than existing tools, including Boolean expressions and truth tables. Third, LEX is highly flexible and can be integrated with a wide range of existing open-source cell characterization tools. We conducted experiments using a test set of netlists to demonstrate LEX effectiveness. By providing a more comprehensive set of information, eliminating the need for manual input of additional information, and improving efficiency, our tool offers a powerful new option to be integrated into already existing and future open-source characterization tools.

逻辑单元的表征是数字电路设计的关键步骤。现有的开源细胞表征工具通常需要SPICE网络列表之外的大量额外信息。在本文中，我们提出了一个新的开源工具LEX，它作为这些表征工具的一个非常有用的接口，可以直接从SPICE网络列表中提取基本的输入和输出信息、布尔表达式、真值表和转换(切换)弧。与现有的开源方法相比，我们的LEX工具提供了几个优势。首先，它通过消除人工输入额外信息的需要简化了细胞电表征过程。这节省了时间并减少了错误的发生率。其次，我们的工具提供了比现有工具更全面的信息集，包括布尔表达式和真值表。第三，LEX是高度灵活的，可以与广泛的现有开源细胞表征工具集成。我们使用一组网络列表进行了实验，以证明LEX的有效性。通过提供更全面的信息集，消除了手动输入额外信息的需要，并提高了效率，我们的工具提供了一个强大的新选项，可以集成到已经存在的和未来的开源表征工具中。

{"title":"LEX - A Cell Switching Arcs Extractor: A Simple SPICE-Input Interface for Electrical Characterization","authors":"Rodrigo N. Wuerdig, V. H. Maciel, Ricardo Reis, S. Bampi","doi":"10.1109/ISVLSI59464.2023.10238671","DOIUrl":"https://doi.org/10.1109/ISVLSI59464.2023.10238671","url":null,"abstract":"The characterization of logic cells is a critical step in the design of digital circuits. Existing open-source cell characterization tools typically require significant extra information beyond the SPICE netlist. In this paper, we present a new open-source tool - LEX - that serves as a very useful interface for these characterization tools, enabling the extraction of essential input and output information, Boolean expressions, truth tables, and transition (switching) arcs directly from the SPICE netlist. Our LEX tool offers several advantages over existing open-source methods. First, it simplifies the cell electrical characterization process by eliminating the need for manual input of additional information. This saves time and reduces the incidence of errors. Second, our tool provides a more comprehensive set of information than existing tools, including Boolean expressions and truth tables. Third, LEX is highly flexible and can be integrated with a wide range of existing open-source cell characterization tools. We conducted experiments using a test set of netlists to demonstrate LEX effectiveness. By providing a more comprehensive set of information, eliminating the need for manual input of additional information, and improving efficiency, our tool offers a powerful new option to be integrated into already existing and future open-source characterization tools.","PeriodicalId":199371,"journal":{"name":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"354 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126686826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploiting Routing Asymmetry for APUF Implementation in FPGA: A Proof-of-Concept 利用FPGA实现APUF的路由不对称:概念验证

2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2023-06-20 DOI: 10.1109/ISVLSI59464.2023.10238578

Trishna Rajkumar

Implementing Arbiter PUF in an FPGA requires identical logic and symmetrical routing to ensure the delay differences are due to process variations. As the FPGA routing tools optimise for performance and not for symmetry, the FPGA CAD flow requires interventions like manual routing and the use of hard macros. These measures require a designer to work at a lower level of abstraction than RTL which can be tedious and error prone. Furthermore, they require an extensive knowledge of the FPGA fabric which may not be available owing to their proprietary nature. Considering these challenges, we investigate the possibility of an arbiter PUF implementation within the FPGA CAD flow by leveraging the routing asymmetry instead of eliminating it. Preliminary characterisation of a proof of concept APUF model demonstrated uniformity of 49.4 % and reliability of 96.3 %.

在FPGA中实现Arbiter PUF需要相同的逻辑和对称路由，以确保延迟差异是由于过程变化造成的。由于FPGA路由工具针对性能而非对称性进行优化，因此FPGA CAD流程需要手动路由和硬宏的使用等干预。这些措施要求设计师在较低的抽象层次上工作，而不是RTL，这可能是乏味且容易出错的。此外，它们需要广泛的FPGA结构知识，由于其专有性质，这些知识可能无法获得。考虑到这些挑战，我们通过利用路由不对称而不是消除它，研究了在FPGA CAD流中实现仲裁者PUF的可能性。初步表征的概念验证APUF模型的均匀性为49.4%，可靠性为96.3%。

引用次数: 0