Pub Date : 2023-06-20DOI: 10.1109/ISVLSI59464.2023.10238524
P. Vellaisamy, Harideep Nair, Joseph Finn, Manav Trivedi, Albert Chen, Anna Li, Tsung-Han Lin, Perry Wang, Shawn Blanton, John Paul Shen
General Matrix Multiplication (GEMM) is a ubiquitous compute kernel in deep learning (DL). To support energy-efficient edge-native processing, new GEMM hardware units have been proposed that operate on unary encoded bitstreams using much simpler hardware. Most unary approaches thus far focus on rate-based unary encoding of values and perform stochastic approximate computation. This work presents tubGEMM, a novel matrix-multiply unit design that employs hybrid temporal-unary and binary (tub) encoding and performs exact (not approximate) GEMM. It intrinsically exploits dynamic value sparsity to improve energy efficiency. Compared to the current best unary design uGEMM, tubGEMM significantly reduces area, power, and energy by 89%, 87%, and 50% respectively. A tubGEMM design performing 128x128 matrix multiply on 8-bit integers, in commercial TSMC N5 (5nm) process node, consumes just 0.22 m$mathrm{m}^{2}$ die area, 417.72 mW power, and 8.86 $mu$J energy, assuming no sparsity. Typical sparsity in DL workloads (MobileNetv2, ResNet50) reduces energy by more than 3x, and lowering precision to 4 and 2 bits further reduces it by 24x and 104x respectively.
{"title":"tubGEMM: Energy-Efficient and Sparsity-Effective Temporal-Unary-Binary Based Matrix Multiply Unit","authors":"P. Vellaisamy, Harideep Nair, Joseph Finn, Manav Trivedi, Albert Chen, Anna Li, Tsung-Han Lin, Perry Wang, Shawn Blanton, John Paul Shen","doi":"10.1109/ISVLSI59464.2023.10238524","DOIUrl":"https://doi.org/10.1109/ISVLSI59464.2023.10238524","url":null,"abstract":"General Matrix Multiplication (GEMM) is a ubiquitous compute kernel in deep learning (DL). To support energy-efficient edge-native processing, new GEMM hardware units have been proposed that operate on unary encoded bitstreams using much simpler hardware. Most unary approaches thus far focus on rate-based unary encoding of values and perform stochastic approximate computation. This work presents tubGEMM, a novel matrix-multiply unit design that employs hybrid temporal-unary and binary (tub) encoding and performs exact (not approximate) GEMM. It intrinsically exploits dynamic value sparsity to improve energy efficiency. Compared to the current best unary design uGEMM, tubGEMM significantly reduces area, power, and energy by 89%, 87%, and 50% respectively. A tubGEMM design performing 128x128 matrix multiply on 8-bit integers, in commercial TSMC N5 (5nm) process node, consumes just 0.22 m$mathrm{m}^{2}$ die area, 417.72 mW power, and 8.86 $mu$J energy, assuming no sparsity. Typical sparsity in DL workloads (MobileNetv2, ResNet50) reduces energy by more than 3x, and lowering precision to 4 and 2 bits further reduces it by 24x and 104x respectively.","PeriodicalId":199371,"journal":{"name":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115689177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-20DOI: 10.1109/ISVLSI59464.2023.10238644
Guilherme Korol, M. Jordan, M. B. Rutzig, J. Castrillón, A. C. S. Beck
AI-based IoT applications relying on heavy-load deep learning algorithms like CNNs challenge IoT devices that are restricted in energy or processing capabilities. Edge computing offers an alternative by allowing the data to get offloaded to so-called edge servers with hardware more powerful than IoT devices and physically closer than the cloud. However, the increasing complexity of data and algorithms and diverse conditions make even powerful devices, such as those equipped with FPGAs, insufficient to cope with the current demands. In this case, optimizations in the algorithms, like pruning and early-exit, are mandatory to reduce the CNNs computational burden and speed up inference processing. With that in mind, we propose ExpOL, which combines the pruning and early-exit CNN optimizations in a system-level FPGA-based IoT-Edge design space exploration. Based on a user-defined multi-target optimization, ExpOL delivers designs tailored to specific application environments and user needs. When evaluated against state-of-the-art FPGA-based accelerators (either local or offloaded), designs produced by ExpOL are more power-efficient (by up to 2$times$) and process inferences at higher user quality of experience (by up to 12.5%).
{"title":"Design Space Exploration for CNN Offloading to FPGAs at the Edge","authors":"Guilherme Korol, M. Jordan, M. B. Rutzig, J. Castrillón, A. C. S. Beck","doi":"10.1109/ISVLSI59464.2023.10238644","DOIUrl":"https://doi.org/10.1109/ISVLSI59464.2023.10238644","url":null,"abstract":"AI-based IoT applications relying on heavy-load deep learning algorithms like CNNs challenge IoT devices that are restricted in energy or processing capabilities. Edge computing offers an alternative by allowing the data to get offloaded to so-called edge servers with hardware more powerful than IoT devices and physically closer than the cloud. However, the increasing complexity of data and algorithms and diverse conditions make even powerful devices, such as those equipped with FPGAs, insufficient to cope with the current demands. In this case, optimizations in the algorithms, like pruning and early-exit, are mandatory to reduce the CNNs computational burden and speed up inference processing. With that in mind, we propose ExpOL, which combines the pruning and early-exit CNN optimizations in a system-level FPGA-based IoT-Edge design space exploration. Based on a user-defined multi-target optimization, ExpOL delivers designs tailored to specific application environments and user needs. When evaluated against state-of-the-art FPGA-based accelerators (either local or offloaded), designs produced by ExpOL are more power-efficient (by up to 2$times$) and process inferences at higher user quality of experience (by up to 12.5%).","PeriodicalId":199371,"journal":{"name":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127515441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-20DOI: 10.1109/ISVLSI59464.2023.10238564
J. Vicenzi, Guilherme Korol, M. Jordan, Wagner Ourique de Morais, Hazem Ali, Edison Pignaton De Freitas, M. B. Rutzig, A. C. S. Beck
While machine learning applications in IoT devices are getting more widespread, the computational and power limitations of these devices pose a great challenge. To handle this increasing computational burden, edge, and cloud solutions emerge as a means to offload computation to more powerful devices. However, the unstable nature of network connections constantly changes the communication costs, making the offload process (i.e., when and where to transfer data) a dynamic trade-off. In this work, we propose DECOS: a framework to automatically select at run-time the best offloading solution with minimum latency based on the computational capabilities of devices and network status at a given moment. We use heterogeneous devices for edge and Cloud nodes to evaluate the framework’s performance using MobileNetV1 CNN and network traffic data from a real-world 4G bandwidth dataset. DECOS effectively selects the best processing node to maintain the minimum possible latency, reducing it up to 29% compared to Cloud-exclusive processing while reducing the energy consumption by 1.9$times$ compared to IoT-exclusive execution.
{"title":"Dynamic Offloading for Improved Performance and Energy Efficiency in Heterogeneous IoT-Edge-Cloud Continuum","authors":"J. Vicenzi, Guilherme Korol, M. Jordan, Wagner Ourique de Morais, Hazem Ali, Edison Pignaton De Freitas, M. B. Rutzig, A. C. S. Beck","doi":"10.1109/ISVLSI59464.2023.10238564","DOIUrl":"https://doi.org/10.1109/ISVLSI59464.2023.10238564","url":null,"abstract":"While machine learning applications in IoT devices are getting more widespread, the computational and power limitations of these devices pose a great challenge. To handle this increasing computational burden, edge, and cloud solutions emerge as a means to offload computation to more powerful devices. However, the unstable nature of network connections constantly changes the communication costs, making the offload process (i.e., when and where to transfer data) a dynamic trade-off. In this work, we propose DECOS: a framework to automatically select at run-time the best offloading solution with minimum latency based on the computational capabilities of devices and network status at a given moment. We use heterogeneous devices for edge and Cloud nodes to evaluate the framework’s performance using MobileNetV1 CNN and network traffic data from a real-world 4G bandwidth dataset. DECOS effectively selects the best processing node to maintain the minimum possible latency, reducing it up to 29% compared to Cloud-exclusive processing while reducing the energy consumption by 1.9$times$ compared to IoT-exclusive execution.","PeriodicalId":199371,"journal":{"name":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121330880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-20DOI: 10.1109/ISVLSI59464.2023.10238608
Raphael Cardoso, Clément Zrounba, M.F. Abdalla, Paul Jiménez, Mauricio Gomes de Queiroz, Benoît Charbonnier, Fabio Pavanello, Ian O’Connor, S. L. Beux
The last wave of AI developments sparked a global surge in computing resources allocated to neural network models. Even though such models solve complex problems, their mathematical foundations are simple, with the multiply-accumulate (MAC) operation standing out as one of the most important. However, improvements in traditional CMOS technologies fail to match the ever-increasing performance requirements of AI applications, therefore new technologies, as well as disruptive computing architectures must be explored. In this paper, we propose a novel in-memory implementation of a MAC operator based on stochastic computing and optical phase-change memories (oPCMs), leveraging their proven non-volatility and multi-level capabilities to achieve convolution. We show that resorting to the stochastic computing paradigm allows one to exploit the dynamic mechanisms of oPCMs to naturally compute and store MAC results with less noise sensitivity. Under similar conditions, we demonstrate an improvement of up to $64times$ and $10times$ in the applications that we evaluated.
{"title":"Photonic Convolution Engine Based on Phase-Change Materials and Stochastic Computing","authors":"Raphael Cardoso, Clément Zrounba, M.F. Abdalla, Paul Jiménez, Mauricio Gomes de Queiroz, Benoît Charbonnier, Fabio Pavanello, Ian O’Connor, S. L. Beux","doi":"10.1109/ISVLSI59464.2023.10238608","DOIUrl":"https://doi.org/10.1109/ISVLSI59464.2023.10238608","url":null,"abstract":"The last wave of AI developments sparked a global surge in computing resources allocated to neural network models. Even though such models solve complex problems, their mathematical foundations are simple, with the multiply-accumulate (MAC) operation standing out as one of the most important. However, improvements in traditional CMOS technologies fail to match the ever-increasing performance requirements of AI applications, therefore new technologies, as well as disruptive computing architectures must be explored. In this paper, we propose a novel in-memory implementation of a MAC operator based on stochastic computing and optical phase-change memories (oPCMs), leveraging their proven non-volatility and multi-level capabilities to achieve convolution. We show that resorting to the stochastic computing paradigm allows one to exploit the dynamic mechanisms of oPCMs to naturally compute and store MAC results with less noise sensitivity. Under similar conditions, we demonstrate an improvement of up to $64times$ and $10times$ in the applications that we evaluated.","PeriodicalId":199371,"journal":{"name":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130447024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Computing-in-Memory (CIM) architecture has emerged as a promising approach for designing energy-efficient DNN processors. While previous CIM designs have explored the use of DNN weight sparsity, these approaches often involve pruning the weight matrix in a specific manner. This process may increase the new complexity of the calculation and negatively impact DNN accuracy. However, there are barely any digital CIM circuits that leverage the sparsity in activation which is naturally sparse in many scenarios due to the ReLU activation functions. In order to fully utilize activation unstructured sparsity, we proposed a digital SRAM CIM. This circuit is designed using the booth encoding scheme and adopts the circuit structure of an accumulator-based multiply-accumulate (MAC) calculation. It utilizes SRAM bit-line (BL) computing to obtain matrix sparse information and employs an allocator to allocate data calculation for SRAM-CIM. The proposed design is implemented and evaluated at 40 nm CMOS process. Our evaluation results show that the proposed circuit can achieve a clock frequency of 1 GHz at 1.1 V, with a peak performance of 819.2 GOPS, and in the case of 50%-90% sparsity, SRAM-CIM achieves $1.12 times 3.32 times$ speedup, and energy savings of 48.2% to 90.57% over dense mode. When performing an 8-bit matrix multiplication with 90% sparsity, the energy efficiency is 10.57 TOPS/W.
{"title":"A Digital SRAM Computing-in-Memory Design Utilizing Activation Unstructured Sparsity for High-Efficient DNN Inference","authors":"Baiqing Zhong, Mingyu Wang, Chuanghao Zhang, Yangzhan Mai, Xiaojie Li, Zhiyi Yu","doi":"10.1109/ISVLSI59464.2023.10238597","DOIUrl":"https://doi.org/10.1109/ISVLSI59464.2023.10238597","url":null,"abstract":"The Computing-in-Memory (CIM) architecture has emerged as a promising approach for designing energy-efficient DNN processors. While previous CIM designs have explored the use of DNN weight sparsity, these approaches often involve pruning the weight matrix in a specific manner. This process may increase the new complexity of the calculation and negatively impact DNN accuracy. However, there are barely any digital CIM circuits that leverage the sparsity in activation which is naturally sparse in many scenarios due to the ReLU activation functions. In order to fully utilize activation unstructured sparsity, we proposed a digital SRAM CIM. This circuit is designed using the booth encoding scheme and adopts the circuit structure of an accumulator-based multiply-accumulate (MAC) calculation. It utilizes SRAM bit-line (BL) computing to obtain matrix sparse information and employs an allocator to allocate data calculation for SRAM-CIM. The proposed design is implemented and evaluated at 40 nm CMOS process. Our evaluation results show that the proposed circuit can achieve a clock frequency of 1 GHz at 1.1 V, with a peak performance of 819.2 GOPS, and in the case of 50%-90% sparsity, SRAM-CIM achieves $1.12 times 3.32 times$ speedup, and energy savings of 48.2% to 90.57% over dense mode. When performing an 8-bit matrix multiplication with 90% sparsity, the energy efficiency is 10.57 TOPS/W.","PeriodicalId":199371,"journal":{"name":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134503681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-20DOI: 10.1109/ISVLSI59464.2023.10238657
Giani Braga, Marcio M. Gonçalves, J. Azambuja
Graphics Processing Units are consistently reaching new applications due to their massive parallel execution architectures. However, some safety-critical areas, such as avionics, come with unfriendly environments due to radiation effects caused by cosmic rays, effectively causing component failures. This work implements and tests a hybrid fault tolerance technique initially proposed by NVIDIA to protect a GPU’s pipeline against radiation effects. Results show that the technique can be effective against data-flow errors but at a high cost in execution time overheads and potentially increased control-flow errors.
{"title":"Evaluating an XOR-based Hybrid Fault Tolerance Technique to Detect Faults in GPU Pipelines","authors":"Giani Braga, Marcio M. Gonçalves, J. Azambuja","doi":"10.1109/ISVLSI59464.2023.10238657","DOIUrl":"https://doi.org/10.1109/ISVLSI59464.2023.10238657","url":null,"abstract":"Graphics Processing Units are consistently reaching new applications due to their massive parallel execution architectures. However, some safety-critical areas, such as avionics, come with unfriendly environments due to radiation effects caused by cosmic rays, effectively causing component failures. This work implements and tests a hybrid fault tolerance technique initially proposed by NVIDIA to protect a GPU’s pipeline against radiation effects. Results show that the technique can be effective against data-flow errors but at a high cost in execution time overheads and potentially increased control-flow errors.","PeriodicalId":199371,"journal":{"name":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132248338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-20DOI: 10.1109/ISVLSI59464.2023.10238681
Sanampudi Gopala Krishna Reddy, Gogireddy Ravi Kiran Reddy, Vasanthi D R, Madhav Rao
Finite-field multipliers progressively plays a crucial role in modern cryptography systems. While much attention has been given to the development of area-efficient Karatsuba multipliers as a means of bolstering encryption capabilities, there remains a vast and untapped realm of design space yet to be explored. An innovative technique that has emerged in this area involves the implementation of a Composite M-Term Karatsuba-like Multiplier, which integrates a schoolbook multiplier (SBM) at lower bounds to enhance performance. However, the approach of breaking down operand bit-widths homogeneously along the stages may not result in optimal hardware characteristics, and further improvement can be achieved by configuring the recursive stages to non-homogeneous ‘M’ values. This paper attempts to perform an exhaustive design-space exploration of Karatsuba-like multipliers for different bit-widths and presents a methodology for designing different possible sequences for M-Term non-homogeneous hybrid Karatsuba multiplier (MNHKA). Few MNHKA designs among several sequences achieve high performance while minimizing area requirements. This study evaluates the area, delay, and area-delay-product (ADP) characteristics of pure M-Term Karatsuba multiplier (MKA), Composite M-Term Karatsuba with SBM (CMKA), and a novel MNHKA that are configured as finite field multipliers for different popular bitwidths. In addition, this study also introduces a novel Matlab-based framework that enables the generation of an optimized hardware design code for MNHKA design with customizable sequence and operand sizes. The proposed MNHKA design was implemented and verified on ZYNQ ZCU-104 FPGA Board and also synthesized using 45 nm technology library on Cadence-Genus tool. The implemented FPGA results with LUTs utilization and delay metrics clearly indicate that the proposed category of MNHKA polynomial multiplier outperforms SOTA designs for various bit-widths. Specifically, the proposed design achieves an ADP improvement of 12.33% for a bit-width of 64, and greater gains of 21.15%, 27.74%, and 23.045% for higher order bits of 409, 1350, and 2500, respectively, when compared to CMKA(STOA) multiplier. The experimental results of ASIC flow resulted in an impressive maximum footprint saving of 47.61% as well as significant ADP gains of 45.72% for the 1350-bit design, and also achieved ADP improvement of 16.42%, 15.56%, and 22.59% for bit widths of 64, 409, and 2500, respectively, when compared to CMKA design. All the designs are made freely available for further adoption to the researchers and the designers community.
{"title":"Design and Evaluation of M-Term Non-Homogeneous Hybrid Karatsuba Polynomial Multiplier","authors":"Sanampudi Gopala Krishna Reddy, Gogireddy Ravi Kiran Reddy, Vasanthi D R, Madhav Rao","doi":"10.1109/ISVLSI59464.2023.10238681","DOIUrl":"https://doi.org/10.1109/ISVLSI59464.2023.10238681","url":null,"abstract":"Finite-field multipliers progressively plays a crucial role in modern cryptography systems. While much attention has been given to the development of area-efficient Karatsuba multipliers as a means of bolstering encryption capabilities, there remains a vast and untapped realm of design space yet to be explored. An innovative technique that has emerged in this area involves the implementation of a Composite M-Term Karatsuba-like Multiplier, which integrates a schoolbook multiplier (SBM) at lower bounds to enhance performance. However, the approach of breaking down operand bit-widths homogeneously along the stages may not result in optimal hardware characteristics, and further improvement can be achieved by configuring the recursive stages to non-homogeneous ‘M’ values. This paper attempts to perform an exhaustive design-space exploration of Karatsuba-like multipliers for different bit-widths and presents a methodology for designing different possible sequences for M-Term non-homogeneous hybrid Karatsuba multiplier (MNHKA). Few MNHKA designs among several sequences achieve high performance while minimizing area requirements. This study evaluates the area, delay, and area-delay-product (ADP) characteristics of pure M-Term Karatsuba multiplier (MKA), Composite M-Term Karatsuba with SBM (CMKA), and a novel MNHKA that are configured as finite field multipliers for different popular bitwidths. In addition, this study also introduces a novel Matlab-based framework that enables the generation of an optimized hardware design code for MNHKA design with customizable sequence and operand sizes. The proposed MNHKA design was implemented and verified on ZYNQ ZCU-104 FPGA Board and also synthesized using 45 nm technology library on Cadence-Genus tool. The implemented FPGA results with LUTs utilization and delay metrics clearly indicate that the proposed category of MNHKA polynomial multiplier outperforms SOTA designs for various bit-widths. Specifically, the proposed design achieves an ADP improvement of 12.33% for a bit-width of 64, and greater gains of 21.15%, 27.74%, and 23.045% for higher order bits of 409, 1350, and 2500, respectively, when compared to CMKA(STOA) multiplier. The experimental results of ASIC flow resulted in an impressive maximum footprint saving of 47.61% as well as significant ADP gains of 45.72% for the 1350-bit design, and also achieved ADP improvement of 16.42%, 15.56%, and 22.59% for bit widths of 64, 409, and 2500, respectively, when compared to CMKA design. All the designs are made freely available for further adoption to the researchers and the designers community.","PeriodicalId":199371,"journal":{"name":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122219085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-20DOI: 10.1109/ISVLSI59464.2023.10238658
Chunkai Fu, Ben Trombley, Hua Xiang, Gi-Joon Nam, Jiang Hu
-The timing criticality of flip-flops is a key factor for combinational circuit timing optimization and clock network power reduction, both of which are often performed prior to CTS (Clock Tree Synthesis) and routing. However, timing criticality is often changed by CTS/routing and therefore optimizations according to pre-CTS criticality may deviate from the correct directions. This work investigates machine learning techniques for pre-CTS identification of post-routing timing critical flip-flops. Experimental results show that the ML-based early identification can achieve 99.7% accuracy and 0.98 area under ROC (Receiver Operating Characteristic) curve, and is $62000 times$ to $73000 times$ faster than the estimate with CTS and routing flow on average. Our method is almost $8 times$ faster than a state-of-the-art approach of ML-based timing prediction.
{"title":"Machine Learning Techniques for Pre-CTS Identification of Timing Critical Flip-Flops","authors":"Chunkai Fu, Ben Trombley, Hua Xiang, Gi-Joon Nam, Jiang Hu","doi":"10.1109/ISVLSI59464.2023.10238658","DOIUrl":"https://doi.org/10.1109/ISVLSI59464.2023.10238658","url":null,"abstract":"-The timing criticality of flip-flops is a key factor for combinational circuit timing optimization and clock network power reduction, both of which are often performed prior to CTS (Clock Tree Synthesis) and routing. However, timing criticality is often changed by CTS/routing and therefore optimizations according to pre-CTS criticality may deviate from the correct directions. This work investigates machine learning techniques for pre-CTS identification of post-routing timing critical flip-flops. Experimental results show that the ML-based early identification can achieve 99.7% accuracy and 0.98 area under ROC (Receiver Operating Characteristic) curve, and is $62000 times$ to $73000 times$ faster than the estimate with CTS and routing flow on average. Our method is almost $8 times$ faster than a state-of-the-art approach of ML-based timing prediction.","PeriodicalId":199371,"journal":{"name":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122131705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-20DOI: 10.1109/ISVLSI59464.2023.10238671
Rodrigo N. Wuerdig, V. H. Maciel, Ricardo Reis, S. Bampi
The characterization of logic cells is a critical step in the design of digital circuits. Existing open-source cell characterization tools typically require significant extra information beyond the SPICE netlist. In this paper, we present a new open-source tool - LEX - that serves as a very useful interface for these characterization tools, enabling the extraction of essential input and output information, Boolean expressions, truth tables, and transition (switching) arcs directly from the SPICE netlist. Our LEX tool offers several advantages over existing open-source methods. First, it simplifies the cell electrical characterization process by eliminating the need for manual input of additional information. This saves time and reduces the incidence of errors. Second, our tool provides a more comprehensive set of information than existing tools, including Boolean expressions and truth tables. Third, LEX is highly flexible and can be integrated with a wide range of existing open-source cell characterization tools. We conducted experiments using a test set of netlists to demonstrate LEX effectiveness. By providing a more comprehensive set of information, eliminating the need for manual input of additional information, and improving efficiency, our tool offers a powerful new option to be integrated into already existing and future open-source characterization tools.
{"title":"LEX - A Cell Switching Arcs Extractor: A Simple SPICE-Input Interface for Electrical Characterization","authors":"Rodrigo N. Wuerdig, V. H. Maciel, Ricardo Reis, S. Bampi","doi":"10.1109/ISVLSI59464.2023.10238671","DOIUrl":"https://doi.org/10.1109/ISVLSI59464.2023.10238671","url":null,"abstract":"The characterization of logic cells is a critical step in the design of digital circuits. Existing open-source cell characterization tools typically require significant extra information beyond the SPICE netlist. In this paper, we present a new open-source tool - LEX - that serves as a very useful interface for these characterization tools, enabling the extraction of essential input and output information, Boolean expressions, truth tables, and transition (switching) arcs directly from the SPICE netlist. Our LEX tool offers several advantages over existing open-source methods. First, it simplifies the cell electrical characterization process by eliminating the need for manual input of additional information. This saves time and reduces the incidence of errors. Second, our tool provides a more comprehensive set of information than existing tools, including Boolean expressions and truth tables. Third, LEX is highly flexible and can be integrated with a wide range of existing open-source cell characterization tools. We conducted experiments using a test set of netlists to demonstrate LEX effectiveness. By providing a more comprehensive set of information, eliminating the need for manual input of additional information, and improving efficiency, our tool offers a powerful new option to be integrated into already existing and future open-source characterization tools.","PeriodicalId":199371,"journal":{"name":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"354 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126686826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-20DOI: 10.1109/ISVLSI59464.2023.10238578
Trishna Rajkumar
Implementing Arbiter PUF in an FPGA requires identical logic and symmetrical routing to ensure the delay differences are due to process variations. As the FPGA routing tools optimise for performance and not for symmetry, the FPGA CAD flow requires interventions like manual routing and the use of hard macros. These measures require a designer to work at a lower level of abstraction than RTL which can be tedious and error prone. Furthermore, they require an extensive knowledge of the FPGA fabric which may not be available owing to their proprietary nature. Considering these challenges, we investigate the possibility of an arbiter PUF implementation within the FPGA CAD flow by leveraging the routing asymmetry instead of eliminating it. Preliminary characterisation of a proof of concept APUF model demonstrated uniformity of 49.4 % and reliability of 96.3 %.
{"title":"Exploiting Routing Asymmetry for APUF Implementation in FPGA: A Proof-of-Concept","authors":"Trishna Rajkumar","doi":"10.1109/ISVLSI59464.2023.10238578","DOIUrl":"https://doi.org/10.1109/ISVLSI59464.2023.10238578","url":null,"abstract":"Implementing Arbiter PUF in an FPGA requires identical logic and symmetrical routing to ensure the delay differences are due to process variations. As the FPGA routing tools optimise for performance and not for symmetry, the FPGA CAD flow requires interventions like manual routing and the use of hard macros. These measures require a designer to work at a lower level of abstraction than RTL which can be tedious and error prone. Furthermore, they require an extensive knowledge of the FPGA fabric which may not be available owing to their proprietary nature. Considering these challenges, we investigate the possibility of an arbiter PUF implementation within the FPGA CAD flow by leveraging the routing asymmetry instead of eliminating it. Preliminary characterisation of a proof of concept APUF model demonstrated uniformity of 49.4 % and reliability of 96.3 %.","PeriodicalId":199371,"journal":{"name":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131536082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}