2023 24th International Symposium on Quality Electronic Design (ISQED)最新文献_第8页

RECO-LFSR: Reconfigurable Low-power Cryptographic processor based on LFSR for Trusted IoT platforms RECO-LFSR:基于LFSR的可重构低功耗加密处理器，适用于可信物联网平台

2023 24th International Symposium on Quality Electronic Design (ISQED)

Pub Date : 2023-04-05 DOI: 10.1109/ISQED57927.2023.10129323

Mohamed El-Hadedy, Russell Hua, Kazutomo Yoshii, Wen-mei W. Hwu, M. Margala

Today we see lightweight computer hardware utilized in large volumes, especially with the growing use of IoT devices in homes. However, such devices often ignore security until it is too late and sensitive data breaches have occurred.From here, the importance of finding lightweight cryptographic primitives to secure IoT devices is exponentially increasing, while not impacting the limited resources and limitation of the battery lifetime. In the search for a lightweight cryptographic standard, one must consider how to implement such algorithms optimally. For example, certain parts of an algorithm might be faster in hardware than in software and vice versa.This paper presents a hardware extension supporting the MicroBlaze softcore processor to efficiently implement one of the Lightweight Cryptography (LWC) finalists (TinyJAMBU) on Digilent Nexys A7-100T. The proposed hardware extension consists of a reconfigurable Non-Linear Feedback Shift Register (NLFSR), the central computing part for the authenticated encryption with associated data (AEAD) TinyJAMBU. The proposed NLFSR can run different variants of TinyJAMBU while only consuming 186 mWh in just ten minutes at 100 MHz. The total resources needed to host the proposed NLFSR on the FPGA are 610 LUT and 505 Flip-Flops while executable the binary size is 352 bytes smaller. Therefore, the proposed solution based on the hardware extension is x2.17 times faster than the pure software implementation of the whole TinyJAMBU using MicroBlaze while consuming six mWh more. To our knowledge, this is the first implementation of TinyJAMBU using software/hardware partitioning on FPGA with the softcore processor MicroBlaze.

今天，我们看到大量使用轻量级计算机硬件，特别是随着物联网设备在家庭中的使用越来越多。然而，这些设备往往忽视安全，直到为时已晚，敏感数据泄露发生。从这里开始，寻找轻量级加密原语来保护物联网设备的重要性呈指数级增长，同时不影响有限的资源和电池寿命的限制。在寻找轻量级加密标准时，必须考虑如何以最佳方式实现这些算法。例如，算法的某些部分在硬件上可能比在软件上快，反之亦然。本文提出了一个支持MicroBlaze软核处理器的硬件扩展，以有效地在Digilent Nexys A7-100T上实现轻量级加密(LWC)决赛(TinyJAMBU)之一。提出的硬件扩展包括一个可重构的非线性反馈移位寄存器(NLFSR)，该寄存器是带关联数据的身份验证加密(AEAD) TinyJAMBU的中心计算部分。提出的NLFSR可以运行不同版本的TinyJAMBU，在100 MHz下仅在10分钟内消耗186mwh。在FPGA上托管提议的NLFSR所需的总资源为610 LUT和505 flip - flop，而可执行二进制大小减少了352字节。因此，提出的基于硬件扩展的解决方案比使用MicroBlaze实现整个TinyJAMBU的纯软件实现快x2.17倍，同时多消耗6兆瓦时。据我们所知，这是第一个在FPGA上使用软核处理器MicroBlaze的软硬件分区的TinyJAMBU实现。

{"title":"RECO-LFSR: Reconfigurable Low-power Cryptographic processor based on LFSR for Trusted IoT platforms","authors":"Mohamed El-Hadedy, Russell Hua, Kazutomo Yoshii, Wen-mei W. Hwu, M. Margala","doi":"10.1109/ISQED57927.2023.10129323","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129323","url":null,"abstract":"Today we see lightweight computer hardware utilized in large volumes, especially with the growing use of IoT devices in homes. However, such devices often ignore security until it is too late and sensitive data breaches have occurred.From here, the importance of finding lightweight cryptographic primitives to secure IoT devices is exponentially increasing, while not impacting the limited resources and limitation of the battery lifetime. In the search for a lightweight cryptographic standard, one must consider how to implement such algorithms optimally. For example, certain parts of an algorithm might be faster in hardware than in software and vice versa.This paper presents a hardware extension supporting the MicroBlaze softcore processor to efficiently implement one of the Lightweight Cryptography (LWC) finalists (TinyJAMBU) on Digilent Nexys A7-100T. The proposed hardware extension consists of a reconfigurable Non-Linear Feedback Shift Register (NLFSR), the central computing part for the authenticated encryption with associated data (AEAD) TinyJAMBU. The proposed NLFSR can run different variants of TinyJAMBU while only consuming 186 mWh in just ten minutes at 100 MHz. The total resources needed to host the proposed NLFSR on the FPGA are 610 LUT and 505 Flip-Flops while executable the binary size is 352 bytes smaller. Therefore, the proposed solution based on the hardware extension is x2.17 times faster than the pure software implementation of the whole TinyJAMBU using MicroBlaze while consuming six mWh more. To our knowledge, this is the first implementation of TinyJAMBU using software/hardware partitioning on FPGA with the softcore processor MicroBlaze.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131412266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Analysis of Pattern-dependent Rapid Thermal Annealing Effects on SRAM Design 模式相关快速热退火对SRAM设计的影响分析

2023 24th International Symposium on Quality Electronic Design (ISQED)

Pub Date : 2023-04-05 DOI: 10.1109/ISQED57927.2023.10129399

Vidya A. Chhabria, S. Sapatnekar

Rapid thermal annealing (RTA) is an important step in semiconductor manufacturing. RTA-induced variability due to differences in die layout patterns can significantly contribute to transistor parameter variations, resulting in degraded chip performance and yield. The die layout patterns that drive these variations are related to the distribution of the density of transistors (silicon) and shallow trench isolation (silicon dioxide) across the die, which result in emissivity variations that change the die surface temperature during annealing. While prior art has developed pattern-dependent simulators and provided mitigation techniques for digital design, it has failed to consider the impact of the temperature-dependent thermal conductivity of silicon on RTA effects and has not analyzed the effects on memory. This work develops a novel 3D transient pattern-dependent RTA simulation methodology that accounts for the dependence of the thermal conductivity of silicon on temperature. The simulator is used to both analyze the effects of RTA on memory performance and to propose mitigation strategies for a 7nm FinFET SRAM design. It is shown that RTA effects degrade read and write delays by 16% and 20% and read static noise margin (SNM) by 15%, and the applied mitigation strategies can compensate for these degradations at the cost of a 16% increase in area for a 7.5% tolerance in SNM margin.

快速热退火(RTA)是半导体制造中的一个重要步骤。由于芯片布局模式的差异，rta引起的可变性会显著影响晶体管参数的变化，从而导致芯片性能和良率的下降。驱动这些变化的模具布局模式与晶体管(硅)的密度分布和跨模具的浅沟槽隔离(二氧化硅)有关，这会导致发射率变化，从而改变退火过程中的模具表面温度。虽然现有技术已经开发了模式相关的模拟器，并为数字设计提供了缓解技术，但它未能考虑硅的温度相关导热系数对RTA效应的影响，也没有分析对存储器的影响。这项工作开发了一种新的三维瞬态模式相关的RTA模拟方法，该方法考虑了硅的导热系数对温度的依赖性。该模拟器用于分析RTA对存储器性能的影响，并为7nm FinFET SRAM设计提出缓解策略。研究表明，RTA效应使读写延迟分别降低16%和20%，读取静态噪声裕度(SNM)降低15%，所应用的缓解策略可以补偿这些退化，但代价是SNM裕度公差为7.5%，而面积增加16%。

{"title":"Analysis of Pattern-dependent Rapid Thermal Annealing Effects on SRAM Design","authors":"Vidya A. Chhabria, S. Sapatnekar","doi":"10.1109/ISQED57927.2023.10129399","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129399","url":null,"abstract":"Rapid thermal annealing (RTA) is an important step in semiconductor manufacturing. RTA-induced variability due to differences in die layout patterns can significantly contribute to transistor parameter variations, resulting in degraded chip performance and yield. The die layout patterns that drive these variations are related to the distribution of the density of transistors (silicon) and shallow trench isolation (silicon dioxide) across the die, which result in emissivity variations that change the die surface temperature during annealing. While prior art has developed pattern-dependent simulators and provided mitigation techniques for digital design, it has failed to consider the impact of the temperature-dependent thermal conductivity of silicon on RTA effects and has not analyzed the effects on memory. This work develops a novel 3D transient pattern-dependent RTA simulation methodology that accounts for the dependence of the thermal conductivity of silicon on temperature. The simulator is used to both analyze the effects of RTA on memory performance and to propose mitigation strategies for a 7nm FinFET SRAM design. It is shown that RTA effects degrade read and write delays by 16% and 20% and read static noise margin (SNM) by 15%, and the applied mitigation strategies can compensate for these degradations at the cost of a 16% increase in area for a 7.5% tolerance in SNM margin.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124643030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Heterogeneous Multi-Functional Look-Up-Table-based Processing-in-Memory Architecture for Deep Learning Acceleration

2023 24th International Symposium on Quality Electronic Design (ISQED)

Pub Date : 2023-04-05 DOI: 10.1109/ISQED57927.2023.10129338

Sathwika Bavikadi, Purab Ranjan Sutradhar, A. Ganguly, Sai Manoj Pudukotai Dinakarrao

Emerging applications including deep neural networks (DNNs) and convolutional neural networks (CNNs) employ massive amounts of data to perform computations and data analysis. Such applications often lead to resource constraints and impose large overheads in data movement between memory and compute units. Several architectures such as Processing-in-Memory (PIM) are introduced to alleviate the bandwidth bottlenecks and inefficiency of traditional computing architectures. However, the existing PIM architectures represent a trade-off between power, performance, area, energy efficiency, and programmability. To better achieve the energy-efficiency and flexibility criteria simultaneously in hardware accelerators, we introduce a multi-functional look-up-table (LUT)-based reconfigurable PIM architecture in this work. The proposed architecture is a many-core architecture, each core comprises processing elements (PEs), a stand-alone processor with programmable functional units built using high-speed reconfigurable LUTs. The proposed LUTs can perform various operations, including convolutional, pooling, and activation that are required for CNN acceleration. Additionally, the proposed LUTs are capable of providing multiple outputs relating to different functionalities simultaneously without the need to design different LUTs for different functionalities. This leads to optimized area and power overheads. Furthermore, we also design special-function LUTs, which can provide simultaneous outputs for multiplication and accumulation as well as special activation functions such as hyperbolics and sigmoids. We have evaluated various CNNs such as LeNet, AlexNet, and ResNet18,34,50. Our experimental results have demonstrated that when AlexNet is implemented on the proposed architecture shows a maximum of 200× higher energy efficiency and 1.5× higher throughput than a DRAM-based LUT-based PIM architecture.

包括深度神经网络(dnn)和卷积神经网络(cnn)在内的新兴应用使用大量数据来执行计算和数据分析。这样的应用程序通常会导致资源限制，并在内存和计算单元之间的数据移动中增加大量开销。为了缓解传统计算体系结构的带宽瓶颈和低效率，引入了内存中处理(PIM)等体系结构。然而，现有的PIM体系结构在功率、性能、面积、能源效率和可编程性之间进行了权衡。为了更好地在硬件加速器中同时实现能效和灵活性标准，我们在本工作中引入了一种基于多功能查找表(LUT)的可重构PIM架构。所提出的体系结构是一个多核体系结构，每个核心包括处理元素(pe)，一个独立的处理器，具有使用高速可重构lut构建的可编程功能单元。建议的lut可以执行各种操作，包括卷积、池化和激活CNN加速所需的操作。此外，建议的lut能够同时提供与不同功能相关的多个输出，而无需为不同的功能设计不同的lut。这导致优化的面积和电力开销。此外，我们还设计了特殊功能的lut，它可以同时提供乘法和积累的输出以及特殊的激活函数，如双曲线和s型曲线。我们已经评估了各种cnn，如LeNet, AlexNet和resnet18,34,50。我们的实验结果表明，当AlexNet在所提出的架构上实现时，其能效比基于dram的基于lut的PIM架构最高提高200倍，吞吐量提高1.5倍。

{"title":"Heterogeneous Multi-Functional Look-Up-Table-based Processing-in-Memory Architecture for Deep Learning Acceleration","authors":"Sathwika Bavikadi, Purab Ranjan Sutradhar, A. Ganguly, Sai Manoj Pudukotai Dinakarrao","doi":"10.1109/ISQED57927.2023.10129338","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129338","url":null,"abstract":"Emerging applications including deep neural networks (DNNs) and convolutional neural networks (CNNs) employ massive amounts of data to perform computations and data analysis. Such applications often lead to resource constraints and impose large overheads in data movement between memory and compute units. Several architectures such as Processing-in-Memory (PIM) are introduced to alleviate the bandwidth bottlenecks and inefficiency of traditional computing architectures. However, the existing PIM architectures represent a trade-off between power, performance, area, energy efficiency, and programmability. To better achieve the energy-efficiency and flexibility criteria simultaneously in hardware accelerators, we introduce a multi-functional look-up-table (LUT)-based reconfigurable PIM architecture in this work. The proposed architecture is a many-core architecture, each core comprises processing elements (PEs), a stand-alone processor with programmable functional units built using high-speed reconfigurable LUTs. The proposed LUTs can perform various operations, including convolutional, pooling, and activation that are required for CNN acceleration. Additionally, the proposed LUTs are capable of providing multiple outputs relating to different functionalities simultaneously without the need to design different LUTs for different functionalities. This leads to optimized area and power overheads. Furthermore, we also design special-function LUTs, which can provide simultaneous outputs for multiplication and accumulation as well as special activation functions such as hyperbolics and sigmoids. We have evaluated various CNNs such as LeNet, AlexNet, and ResNet18,34,50. Our experimental results have demonstrated that when AlexNet is implemented on the proposed architecture shows a maximum of 200× higher energy efficiency and 1.5× higher throughput than a DRAM-based LUT-based PIM architecture.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124787951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SQRTLIB : Library of Hardware Square Root Designs SQRTLIB:硬件平方根设计库

2023 24th International Symposium on Quality Electronic Design (ISQED)

Pub Date : 2023-04-05 DOI: 10.1109/ISQED57927.2023.10129377

C. PrashanthH., S. SrinikethS., Shrikrishna Hebbar, R. Chinmaye, M. Rao

Square-root is an elementary arithmetic function that is utilized not only for image and signal processing applications but also to extract vector functionalities. The square-root module demands high energy and hardware resources, apart from being a complex design to implement. In the past, many techniques, including Iterative, New Non-Restoring (New-NR), CORDIC, Piece-wise-linear (PWL) approximation, Look-Up-Tables (LUTs), Digit-by-digit based integer (Digit-Int) format and fixed-point (Digit-FP) format implementations were reported to realize square-root function. Cartesian genetic programming (CGP) is a type of evolutionary algorithm that is suggested to evolve circuits by exploring a large solution space. This paper attempts to develop a library of square-root circuits ranging from 2-bits to 8-bits and also benchmark the proposed CGP evolved square-root circuits with the other hardware implementations. All designs were analyzed using both FPGA and ASIC (130 nm Skywater node) flow to characterize hardware parameters and evaluated using various error metrics. Among all the implementations, CGP-derived square-root designs of fixed-point format offered the best trade-off between hardware and error characteristics. All novel designs of this work are made freely available in [1] for further research and development usage.

平方根是一个初等算术函数，不仅用于图像和信号处理应用，而且还用于提取矢量函数。平方根模块除了设计复杂外，还需要大量的能量和硬件资源。在过去，许多技术，包括迭代，新非恢复(New- nr)， CORDIC，分段线性(PWL)近似，查找表(LUTs)，基于数字的整数(Digit-Int)格式和定点(Digit-FP)格式实现被报道实现平方根函数。笛卡尔遗传规划(CGP)是一种通过探索大的解空间来进化电路的进化算法。本文试图开发一个从2位到8位的平方根电路库，并将所提出的CGP进化平方根电路与其他硬件实现进行比较。所有设计都使用FPGA和ASIC (130 nm Skywater节点)流进行分析，以表征硬件参数，并使用各种误差指标进行评估。在所有实现中，基于cgp的定点格式的平方根设计在硬件和错误特性之间提供了最好的折衷。这项工作的所有新颖设计都在[1]中免费提供，以供进一步的研究和开发使用。

{"title":"SQRTLIB : Library of Hardware Square Root Designs","authors":"C. PrashanthH., S. SrinikethS., Shrikrishna Hebbar, R. Chinmaye, M. Rao","doi":"10.1109/ISQED57927.2023.10129377","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129377","url":null,"abstract":"Square-root is an elementary arithmetic function that is utilized not only for image and signal processing applications but also to extract vector functionalities. The square-root module demands high energy and hardware resources, apart from being a complex design to implement. In the past, many techniques, including Iterative, New Non-Restoring (New-NR), CORDIC, Piece-wise-linear (PWL) approximation, Look-Up-Tables (LUTs), Digit-by-digit based integer (Digit-Int) format and fixed-point (Digit-FP) format implementations were reported to realize square-root function. Cartesian genetic programming (CGP) is a type of evolutionary algorithm that is suggested to evolve circuits by exploring a large solution space. This paper attempts to develop a library of square-root circuits ranging from 2-bits to 8-bits and also benchmark the proposed CGP evolved square-root circuits with the other hardware implementations. All designs were analyzed using both FPGA and ASIC (130 nm Skywater node) flow to characterize hardware parameters and evaluated using various error metrics. Among all the implementations, CGP-derived square-root designs of fixed-point format offered the best trade-off between hardware and error characteristics. All novel designs of this work are made freely available in [1] for further research and development usage.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"143 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124919717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Polynomial Formal Verification of a Processor: A RISC-V Case Study 处理器的多项式形式验证:一个RISC-V案例研究

2023 24th International Symposium on Quality Electronic Design (ISQED)

Pub Date : 2023-04-05 DOI: 10.1109/ISQED57927.2023.10129397

Lennart Weingarten, Alireza Mahzoon, Mehran Goli, R. Drechsler

Formal verification is an important task to ensure the correctness of a circuit. In the last 30 years, several formal methods have been proposed to verify various architectures. However, the space and time complexities of these methods are usually unknown, particularly, when it comes to complex designs, e.g., processors. As a result, there is always unpredictability in the performance of the verification tool. If we prove that a formal method has polynomial space and time complexities, we can successfully resolve the unpredictability problem and ensure the scalability of the method.In this paper, we propose a Polynomial Formal Verification (PFV) method based on Binary Decision Diagrams (BDDs) to fully verify a RISC-V processor. We take advantage of partial simulation to extract the hardware related to each instruction. Then, we create the reference BDD for each instruction with respect to its size and function. Finally, we run a symbolic simulation for each hardware instruction and compare it with the reference model. We prove that the whole verification task can be carried out in polynomial space and time. The experiments demonstrate that the PFV of a RISC-V RV32I processor can be performed in less than one second.

形式验证是保证电路正确性的一项重要工作。在过去的30年中，已经提出了几种正式的方法来验证各种体系结构。然而，这些方法的空间和时间复杂性通常是未知的，特别是当涉及复杂的设计时，例如处理器。因此，在验证工具的性能中总是存在不可预测性。如果我们证明一种形式方法具有多项式的空间和时间复杂度，我们就可以成功地解决不可预测性问题，并保证方法的可扩展性。在本文中，我们提出了一种基于二进制决策图(bdd)的多项式形式验证(PFV)方法来全面验证RISC-V处理器。我们利用部分仿真来提取与每条指令相关的硬件。然后，我们根据每个指令的大小和功能为其创建参考BDD。最后，我们对每个硬件指令进行了符号仿真，并与参考模型进行了比较。我们证明了整个验证任务可以在多项式空间和时间内完成。实验表明，RISC-V RV32I处理器的PFV可以在不到1秒的时间内完成。

{"title":"Polynomial Formal Verification of a Processor: A RISC-V Case Study","authors":"Lennart Weingarten, Alireza Mahzoon, Mehran Goli, R. Drechsler","doi":"10.1109/ISQED57927.2023.10129397","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129397","url":null,"abstract":"Formal verification is an important task to ensure the correctness of a circuit. In the last 30 years, several formal methods have been proposed to verify various architectures. However, the space and time complexities of these methods are usually unknown, particularly, when it comes to complex designs, e.g., processors. As a result, there is always unpredictability in the performance of the verification tool. If we prove that a formal method has polynomial space and time complexities, we can successfully resolve the unpredictability problem and ensure the scalability of the method.In this paper, we propose a Polynomial Formal Verification (PFV) method based on Binary Decision Diagrams (BDDs) to fully verify a RISC-V processor. We take advantage of partial simulation to extract the hardware related to each instruction. Then, we create the reference BDD for each instruction with respect to its size and function. Finally, we run a symbolic simulation for each hardware instruction and compare it with the reference model. We prove that the whole verification task can be carried out in polynomial space and time. The experiments demonstrate that the PFV of a RISC-V RV32I processor can be performed in less than one second.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114546068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Lightweight Instruction Set for Flexible Dilated Convolutions and Mixed-Precision Operands 灵活扩展卷积和混合精度操作数的轻量级指令集

2023 24th International Symposium on Quality Electronic Design (ISQED)

Pub Date : 2023-04-05 DOI: 10.1109/ISQED57927.2023.10129341

Simon Friedrich, Shambhavi Balamuthu Sampath, R. Wittig, M. Vemparala, Nael Fasfous, E. Matús, W. Stechele, G. Fettweis

Modern deep neural networks specialized for object detection and semantic segmentation require specific operations to increase or preserve the resolution of their feature maps. Hence, more generic convolution layers called transposed and dilated convolutions are employed, adding a large number of zeros between the elements of the input features or weights. Usually, standard neural network hardware accelerators process these convolutions in a straightforward manner, without paying attention to the added zeros, resulting in an increased computation time. To cope with this problem, recent works propose to skip the redundant elements with additional hardware or solve the problem efficiently only for a limited range of dilation rates. We present a general approach for accelerating transposed and dilated convolutions that does not introduce any hardware overhead while supporting all dilation rates. To achieve this, we introduce a novel precision-scalable lightweight instruction set and memory scheme that can be applied to the different convolution variants. This results in a speed-up of 5 times in DeepLabV3+ outperforming the recently proposed design methods. The support of precision-scalable execution of all workloads further increases the speedup in computation time shown for the PointPillars, DeepLabV3+, and ENet networks. Compared to the state-of-the-art commercial EdgeTPU, the instruction footprint of ResNet-50 of our designed accelerator is reduced by 60 percent.

专门用于对象检测和语义分割的现代深度神经网络需要特定的操作来增加或保持其特征映射的分辨率。因此，使用了更通用的卷积层，称为转置卷积和扩展卷积，在输入特征或权重的元素之间添加大量的零。通常，标准的神经网络硬件加速器以直接的方式处理这些卷积，而不注意添加的零，从而导致计算时间增加。为了解决这个问题，最近的工作建议用额外的硬件跳过冗余的元素，或者只在有限的膨胀率范围内有效地解决问题。我们提出了一种加速转置和扩展卷积的通用方法，该方法在支持所有扩展速率的同时不会引入任何硬件开销。为了实现这一目标，我们引入了一种新的精确可扩展的轻量级指令集和存储方案，可以应用于不同的卷积变体。这使得DeepLabV3+的速度提高了5倍，优于最近提出的设计方法。对所有工作负载的精确可扩展执行的支持进一步提高了PointPillars、DeepLabV3+和ENet网络的计算时间加速。与最先进的商用EdgeTPU相比，我们设计的加速器的ResNet-50的指令足迹减少了60%。

{"title":"Lightweight Instruction Set for Flexible Dilated Convolutions and Mixed-Precision Operands","authors":"Simon Friedrich, Shambhavi Balamuthu Sampath, R. Wittig, M. Vemparala, Nael Fasfous, E. Matús, W. Stechele, G. Fettweis","doi":"10.1109/ISQED57927.2023.10129341","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129341","url":null,"abstract":"Modern deep neural networks specialized for object detection and semantic segmentation require specific operations to increase or preserve the resolution of their feature maps. Hence, more generic convolution layers called transposed and dilated convolutions are employed, adding a large number of zeros between the elements of the input features or weights. Usually, standard neural network hardware accelerators process these convolutions in a straightforward manner, without paying attention to the added zeros, resulting in an increased computation time. To cope with this problem, recent works propose to skip the redundant elements with additional hardware or solve the problem efficiently only for a limited range of dilation rates. We present a general approach for accelerating transposed and dilated convolutions that does not introduce any hardware overhead while supporting all dilation rates. To achieve this, we introduce a novel precision-scalable lightweight instruction set and memory scheme that can be applied to the different convolution variants. This results in a speed-up of 5 times in DeepLabV3+ outperforming the recently proposed design methods. The support of precision-scalable execution of all workloads further increases the speedup in computation time shown for the PointPillars, DeepLabV3+, and ENet networks. Compared to the state-of-the-art commercial EdgeTPU, the instruction footprint of ResNet-50 of our designed accelerator is reduced by 60 percent.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":" 33","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120832453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Power Savings in USB Hubs Through A Proactive Scheduling Strategy 通过主动调度策略在USB集线器省电

2023 24th International Symposium on Quality Electronic Design (ISQED)

Pub Date : 2023-04-05 DOI: 10.1109/ISQED57927.2023.10129309

Bikrant Das Sharma, Abdul Rahman Ismail, Chris Meyers

USB has been the dominant external I/O in computing systems over the past two decades. With the increased adoption of USB-C with high data rates, USB hubs are becoming more popular. Existing power-saving mechanisms do not save much power in USB hubs when there is a steady bandwidth demand from devices. In this paper, we demonstrate significant power savings with a proactive scheduling policy for hubs. Our approach includes the introduction of a shallow U1/CL1 low-power state, resulting in better overall power savings due to the reduced entry and exit times to U1/CL1. Our results demonstrate power savings of tens of watts by increasing the scheduling interval up to the minimum latency tolerance across all devices connected to that hub. As USB moves to USB4 and hubs are used to connect to higher bandwidth devices, these power savings will become even more pronounced.

在过去的二十年里，USB一直是计算机系统中占主导地位的外部I/O。随着高数据速率USB- c的普及，USB集线器变得越来越流行。当设备有稳定的带宽需求时，现有的节电机制并不能在USB集线器中节省很多电力。在本文中，我们演示了使用集线器的主动调度策略可以显著节省电力。我们的方法包括引入浅U1/CL1低功耗状态，由于减少了进入和退出U1/CL1的时间，从而更好地节省了总体功耗。我们的结果表明，通过将调度间隔增加到连接到该集线器的所有设备的最小延迟容限，可以节省数十瓦的电力。随着USB转向USB4，集线器用于连接更高带宽的设备，这些省电将变得更加明显。

引用次数: 0

Enlarging Reliable Pairs via Inter-Distance Offset for a PUF Entropy-Boosting Algorithm 基于间隔偏移的PUF熵增强算法中可靠对的扩大

2023 24th International Symposium on Quality Electronic Design (ISQED)

Pub Date : 2023-04-05 DOI: 10.1109/ISQED57927.2023.10129308

Md. Omar Faruque, Wenjie Che

Physically Unclonable Functions (PUFs) are emerging hardware security primitives that leverage random variations during chip fabrication to generate unique secrets. The amount of random secrets that can be extracted from a limited number of physical PUF components can be measured by entropy bits. Existing strategies of pairing or grouping N RO-PUF elements have an entropy upper bound limited by log2(N!) or O(N•log2(N)). A recently proposed entropy boosting technique [9] improves the entropy bits to be quadratically large at N(N-1)/2 or O(N^2), significantly improved the RO-PUF hardware utilization efficiency in generating secrets. However, the improved amount of random secrets comes at the cost of discarding a large portion of unreliable bits. In this paper, we propose an "Inter-Distance Offset (IDO)" technique that converts those unreliable pairs to be reliable by adjusting the pair inter-distance to an appropriate range. Theoretical analysis of the ratio of converted unreliable bits is provided along with experimental validations. Experimental evaluations on reliability, Entropy and reliability tradeoffs are given using real RO PUF datasets in [10]. Information leakage is analyzed and evaluated using PUF datasets to identify those offset ranges that leak no information. The proposed technique improves the portion of reliable (quadratically large) entropy bits by 20% and 100% respectively for different offset ranges. Hardware implementation on FPGAs demonstrates that the proposed technique is lightweight in implementation and runtime.

物理不可克隆函数(puf)是新兴的硬件安全原语，它利用芯片制造过程中的随机变化来生成独特的秘密。可以从有限数量的物理PUF组件中提取的随机秘密的数量可以通过熵位来测量。现有的N个RO-PUF元素配对或分组策略的熵上界限制为log2(N!)或O(N•log2(N))。最近提出的一种熵增强技术[9]将熵位提高到N(N-1)/2或O(N^2)的二次大，显著提高了RO-PUF在生成秘密时的硬件利用率。然而，随机秘密数量的增加是以丢弃大量不可靠比特为代价的。在本文中，我们提出了一种“距离间偏移(IDO)”技术，通过调整对间距离到适当的范围，将那些不可靠的对转换为可靠的对。对转换不可靠位的比率进行了理论分析，并进行了实验验证。在[10]中使用真实RO PUF数据集给出了可靠性、熵和可靠性权衡的实验评估。使用PUF数据集对信息泄漏进行分析和评估，以确定未泄漏信息的偏移范围。在不同的偏移量范围内，该技术将可靠(二次大)熵比特的比例分别提高了20%和100%。在fpga上的硬件实现表明，该技术在实现和运行时都是轻量级的。

{"title":"Enlarging Reliable Pairs via Inter-Distance Offset for a PUF Entropy-Boosting Algorithm","authors":"Md. Omar Faruque, Wenjie Che","doi":"10.1109/ISQED57927.2023.10129308","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129308","url":null,"abstract":"Physically Unclonable Functions (PUFs) are emerging hardware security primitives that leverage random variations during chip fabrication to generate unique secrets. The amount of random secrets that can be extracted from a limited number of physical PUF components can be measured by entropy bits. Existing strategies of pairing or grouping N RO-PUF elements have an entropy upper bound limited by log2(N!) or O(N•log2(N)). A recently proposed entropy boosting technique [9] improves the entropy bits to be quadratically large at N(N-1)/2 or O(N^2), significantly improved the RO-PUF hardware utilization efficiency in generating secrets. However, the improved amount of random secrets comes at the cost of discarding a large portion of unreliable bits. In this paper, we propose an \"Inter-Distance Offset (IDO)\" technique that converts those unreliable pairs to be reliable by adjusting the pair inter-distance to an appropriate range. Theoretical analysis of the ratio of converted unreliable bits is provided along with experimental validations. Experimental evaluations on reliability, Entropy and reliability tradeoffs are given using real RO PUF datasets in [10]. Information leakage is analyzed and evaluated using PUF datasets to identify those offset ranges that leak no information. The proposed technique improves the portion of reliable (quadratically large) entropy bits by 20% and 100% respectively for different offset ranges. Hardware implementation on FPGAs demonstrates that the proposed technique is lightweight in implementation and runtime.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122029368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ISQED 2023 Organizing Committee ISQED 2023组委会

2023 24th International Symposium on Quality Electronic Design (ISQED)

Pub Date : 2023-04-05 DOI: 10.1109/isqed57927.2023.10129289

引用次数: 0

NetViz: A Tool for Netlist Security Visualization NetViz:一个网络列表安全可视化工具

2023 24th International Symposium on Quality Electronic Design (ISQED)

Pub Date : 2023-04-05 DOI: 10.1109/ISQED57927.2023.10129374

James Geist, Travis Meade, Shaojie Zhang, Yier Jin

Algorithmic analysis of gate level netlists has become an important technique in hardware security. Algorithms can help detect malicious hardware injected into a design, or lock a design against reverse engineering or malicious modification. Many analysis tools have come from the research and commercial communities; however, it is currently the job of the analyst to make these tools work together and interpret the results. Typically tools are text-based, and require error-prone editing of input files in different formats. The analyst must interpret textual results, and sometimes transform them into other formats for use in third party visualization tools. These tasks are repetitive overhead that take time and effort that could better be spent on investigating the netlist. In this paper we introduce NetViz, a visual hardware security environment. NetViz is a meta-tool which combines other analysis tools, automates the task of transferring data between them, and helps with interpretation of results by providing graphical representations of the data.

门级网络列表的算法分析已经成为硬件安全的一项重要技术。算法可以帮助检测注入到设计中的恶意硬件，或者锁定设计以防止逆向工程或恶意修改。许多分析工具来自研究和商业团体;然而，分析师目前的工作是使这些工具协同工作并解释结果。通常工具都是基于文本的，并且需要对不同格式的输入文件进行容易出错的编辑。分析人员必须解释文本结果，有时还要将其转换为其他格式，以便在第三方可视化工具中使用。这些任务是重复的开销，需要花费时间和精力，这些时间和精力可以更好地用于调查网表。本文介绍了一个可视化的硬件安全环境NetViz。NetViz是一个元工具，它结合了其他分析工具，自动化了在它们之间传输数据的任务，并通过提供数据的图形表示来帮助解释结果。

引用次数: 0