ACM Transactions on Design Automation of Electronic Systems最新文献_第6页

EPHA: An Energy-efficient Parallel Hybrid Architecture for ANNs and SNNs EPHA：适用于 ANN 和 SNN 的高能效并行混合架构

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-01-25 DOI: 10.1145/3643134

Yunping Zhao, Sheng Ma, Hengzhu Liu, Libo Huang, Libo Huang

Artificial neural networks (ANNs) and spiking neural networks (SNNs) are two general approaches to achieve artificial intelligence (AI). The former have been widely used in academia and industry fields. The latter, SNNs, are more similar to biological neural networks and can realize ultra-low power consumption, thus have received widespread research attention. However, due to their fundamental differences in computation formula and information coding, the two methods often require different and incompatible platforms. Alongside the development of AI, a general platform that can support both ANNs and SNNs is necessary. Moreover, there are some similarities between ANNs and SNNs, which leaves room to deploy different networks on the same architecture. However, there is little related research on this topic. Accordingly, this paper presents an energy-efficient, scalable, and non-Von Neumann architecture (EPHA) for ANNs and SNNs. Our study combines device-, circuit-, architecture-, and algorithm-level innovations to achieve a parallel architecture with ultra-low power consumption. We use the compensated ferrimagnet to act as both synapses and neurons to store weights and perform dot-product operations, respectively. Moreover, we propose a novel computing flow to reduce the operations across multiple crossbar arrays, which enables our design to conduct large and complex tasks. On a suite of ANN and SNN workloads, the EPHA is 1.6 × more power efficient than a state-of-the-art design, NEBULA, in the ANN mode. In the SNN mode, our design is 4 orders of magnitude more than the Loihi in power efficiency.

人工神经网络（ANN）和尖峰神经网络（SNN）是实现人工智能（AI）的两种通用方法。前者已广泛应用于学术界和工业领域。后者，即 SNN，与生物神经网络更为相似，可以实现超低功耗，因此受到了广泛的研究关注。然而，由于这两种方法在计算公式和信息编码方面存在本质区别，它们往往需要不同的平台，互不兼容。随着人工智能的发展，有必要建立一个能同时支持 ANN 和 SNN 的通用平台。此外，ANN 和 SNN 有一些相似之处，这就为在同一架构上部署不同的网络留出了空间。然而，这方面的相关研究很少。因此，本文提出了一种适用于 ANNs 和 SNNs 的高能效、可扩展和非冯-诺依曼架构 (EPHA)。我们的研究结合了设备、电路、架构和算法层面的创新，以实现超低功耗的并行架构。我们利用补偿铁氧体作为突触和神经元，分别存储权重和执行点积运算。此外，我们还提出了一种新颖的计算流程，以减少跨多个横杆阵列的操作，从而使我们的设计能够执行大型复杂任务。在一系列 ANN 和 SNN 工作负载上，EPHA 在 ANN 模式下的功耗效率是最先进设计 NEBULA 的 1.6 倍。在 SNN 模式下，我们的设计比 Loihi 的能效高出 4 个数量级。

{"title":"EPHA: An Energy-efficient Parallel Hybrid Architecture for ANNs and SNNs","authors":"Yunping Zhao, Sheng Ma, Hengzhu Liu, Libo Huang, Libo Huang","doi":"10.1145/3643134","DOIUrl":"https://doi.org/10.1145/3643134","url":null,"abstract":"Artificial neural networks (ANNs) and spiking neural networks (SNNs) are two general approaches to achieve artificial intelligence (AI). The former have been widely used in academia and industry fields. The latter, SNNs, are more similar to biological neural networks and can realize ultra-low power consumption, thus have received widespread research attention. However, due to their fundamental differences in computation formula and information coding, the two methods often require different and incompatible platforms. Alongside the development of AI, a general platform that can support both ANNs and SNNs is necessary. Moreover, there are some similarities between ANNs and SNNs, which leaves room to deploy different networks on the same architecture. However, there is little related research on this topic. Accordingly, this paper presents an energy-efficient, scalable, and non-Von Neumann architecture (EPHA) for ANNs and SNNs. Our study combines device-, circuit-, architecture-, and algorithm-level innovations to achieve a parallel architecture with ultra-low power consumption. We use the compensated ferrimagnet to act as both synapses and neurons to store weights and perform dot-product operations, respectively. Moreover, we propose a novel computing flow to reduce the operations across multiple crossbar arrays, which enables our design to conduct large and complex tasks. On a suite of ANN and SNN workloads, the EPHA is 1.6 × more power efficient than a state-of-the-art design, NEBULA, in the ANN mode. In the SNN mode, our design is 4 orders of magnitude more than the Loihi in power efficiency.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139579076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pareto Optimization of Analog circuits using Reinforcement Learning 利用强化学习对模拟电路进行帕累托优化

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-01-17 DOI: 10.1145/3640463

Karthik Somayaji Ns, Peng Li

Analog circuit optimization and design presents a unique set of challenges in the IC design process. Many applications require for the designer to optimize for multiple competing objectives which poses a crucial challenge. Motivated by these practical aspects, we propose a novel method to tackle multi-objective optimization for analog circuit design in continuous action spaces. In particular, we propose to: (i) Extrapolate current techniques in Multi-Objective Reinforcement Learning (MORL) to continuous state and action spaces. (ii) Provide for a dynamically tunable trained model to query user defined preferences in multi-objective optimization in the analog circuit design context.

模拟电路优化和设计在集成电路设计过程中提出了一系列独特的挑战。许多应用要求设计人员针对多个相互竞争的目标进行优化，这就提出了严峻的挑战。在这些实际问题的推动下，我们提出了一种新方法来解决连续作用空间中模拟电路设计的多目标优化问题。具体而言，我们建议(i) 将当前的多目标强化学习（MORL）技术推广到连续状态和动作空间。(ii) 在模拟电路设计的多目标优化中，提供一个可动态调整的训练模型，以查询用户定义的偏好。

引用次数: 0

Detecting Adversarial Examples Utilizing Pixel Value Diversity 利用像素值多样性检测对抗性示例

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-01-16 DOI: 10.1145/3636460

Jinxin Dong, Pingqiang Zhou

In this paper, we introduce two novel methods to detect adversarial examples utilizing pixel value diversity. First, we propose the concept of pixel value diversity (which reflects the spread of pixel values in an image) and two independent metrics (UPVR and RPVR) to assess the pixel value diversity separately. Then we propose two methods to detect adversarial examples based on the threshold method and Bayesian method respectively. Experimental results show that compared to an excellent prior method LID, our proposed methods achieve better performances in detecting adversarial examples. We also show the robustness of our proposed work against an adaptive attack method.

在本文中，我们介绍了两种利用像素值多样性检测对抗示例的新方法。首先，我们提出了像素值多样性的概念（它反映了图像中像素值的分布）和两个独立的指标（UPVR 和 RPVR）来分别评估像素值多样性。然后，我们分别提出了基于阈值法和贝叶斯法的两种检测对抗示例的方法。实验结果表明，与优秀的先验方法 LID 相比，我们提出的方法在检测对抗性示例方面取得了更好的性能。我们还展示了我们提出的方法对自适应攻击方法的鲁棒性。

引用次数: 0

A Module-Level Configuration Methodology for Programmable Camouflaged Logic 可编程伪装逻辑的模块级配置方法

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-01-12 DOI: 10.1145/3640462

Jianfeng Wang, Zhonghao Chen, Jiahao Zhang, Yixin Xu, Tongguang Yu, Ziheng Zheng, Enze Ye, Sumitha George, Huazhong Yang, Yongpan Liu, Kai Ni, Vijaykrishnan Narayanan, Xueqing Li

Logic camouflage is a widely adopted technique that mitigates the threat of intellectual property (IP) piracy and overproduction in the integrated circuit (IC) supply chain. Camouflaged logic achieves functional obfuscation through physical-level ambiguity and post-manufacturing programmability. However, discussions on programmability are confined to the level of logic cells/gates, limiting the broader-scale application of logic camouflage. In this work, we propose a novel module-level configuration methodology for programmable camouflaged logic that can be implemented without additional hardware ports and with negligible resources. We prove theoretically that the configuration of the programmable camouflaged logic cells can be achieved through the inputs and netlist of the original module. Further, we propose a novel lightweight ferroelectric FET (FeFET)-based reconfigurable logic gate (rGate) family and apply it to the proposed methodology. With the flexible replacement and the proposed configuration-aware conversion algorithm, this work is characterized by the input-only programming scheme as well as the combination of high output error rate and point-function-like defense. Evaluations show an average of >95% of the alternative rGate location for camouflage, which is sufficient for the security-aware design. We illustrate the exponential complexity in function state traversal and the enhanced defense capability of locked blackbox against SAT attacks compared to key-based methods. We also preserve an evident output Hamming distance and introduce negligible hardware overheads in both gate-level and module-level evaluations under typical benchmarks.

逻辑伪装是一种广泛采用的技术，可减轻集成电路（IC）供应链中知识产权（IP）盗版和过度生产的威胁。伪装逻辑通过物理层面的模糊性和制造后的可编程性实现功能混淆。然而，关于可编程性的讨论仅限于逻辑单元/门的层面，限制了逻辑伪装在更大范围内的应用。在这项工作中，我们为可编程伪装逻辑提出了一种新颖的模块级配置方法，无需额外的硬件端口和可忽略的资源即可实现。我们从理论上证明，可编程伪装逻辑单元的配置可以通过原始模块的输入和网表来实现。此外，我们还提出了一种基于铁电场效应晶体管（FeFET）的新型轻量级可重构逻辑门（rGate）系列，并将其应用于所提出的方法。通过灵活的替换和所提出的配置感知转换算法，这项工作的特点是只需输入编程方案以及高输出错误率和点函数式防御的结合。评估显示，伪装的可选 rGate 位置平均达到 95%，足以满足安全感知设计的要求。我们说明了函数状态遍历的指数复杂性，以及与基于密钥的方法相比，锁定黑盒对 SAT 攻击的增强防御能力。我们还保留了明显的输出汉明距离，并在典型基准下的门级和模块级评估中引入了可忽略不计的硬件开销。

{"title":"A Module-Level Configuration Methodology for Programmable Camouflaged Logic","authors":"Jianfeng Wang, Zhonghao Chen, Jiahao Zhang, Yixin Xu, Tongguang Yu, Ziheng Zheng, Enze Ye, Sumitha George, Huazhong Yang, Yongpan Liu, Kai Ni, Vijaykrishnan Narayanan, Xueqing Li","doi":"10.1145/3640462","DOIUrl":"https://doi.org/10.1145/3640462","url":null,"abstract":"Logic camouflage is a widely adopted technique that mitigates the threat of intellectual property (IP) piracy and overproduction in the integrated circuit (IC) supply chain. Camouflaged logic achieves functional obfuscation through physical-level ambiguity and post-manufacturing programmability. However, discussions on programmability are confined to the level of logic cells/gates, limiting the broader-scale application of logic camouflage. In this work, we propose a novel module-level configuration methodology for programmable camouflaged logic that can be implemented without additional hardware ports and with negligible resources. We prove theoretically that the configuration of the programmable camouflaged logic cells can be achieved through the inputs and netlist of the original module. Further, we propose a novel lightweight ferroelectric FET (FeFET)-based reconfigurable logic gate (rGate) family and apply it to the proposed methodology. With the flexible replacement and the proposed configuration-aware conversion algorithm, this work is characterized by the input-only programming scheme as well as the combination of high output error rate and point-function-like defense. Evaluations show an average of >95% of the alternative rGate location for camouflage, which is sufficient for the security-aware design. We illustrate the exponential complexity in function state traversal and the enhanced defense capability of locked blackbox against SAT attacks compared to key-based methods. We also preserve an evident output Hamming distance and introduce negligible hardware overheads in both gate-level and module-level evaluations under typical benchmarks.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139458921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RGMU: A High-flexibility and Low-cost Reconfigurable Galois Field Multiplication Unit Design Approach for CGRCA RGMU：面向 CGRCA 的高灵活性和低成本可重构伽罗瓦场乘法单元设计方法

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-01-09 DOI: 10.1145/3639820

Danping Jiang, Zibin Dai, Yanjiang Liu, Zongren Zhang

Finite field multiplication is a non-linear transformation operator that appears in the majority of symmetric cryptographic algorithms. Numerous specified finite field multiplication units have been proposed as a fundamental module in the coarse-grained reconfigurable cipher logic array to support more cryptographic algorithms, however, it will introduce low flexibility and high overhead, resulting in reduced performance of the coarse-grained reconfigurable cipher logic array. In this paper, a high-flexibility and low-cost reconfigurable Galois field multiplication unit, which is termed as RGMU, is proposed to balance the trade-offs between the function, delay, and area. All the finite field multiplication operations, including maximum distance separable matrix multiplication, parallel update of Fibonacci linear feedback shift register, parallel update of Galois linear feedback shift register, and composite field multiplication, are analyzed and two basic operation components are abstracted. Further, a reconfigurable finite field multiplication computational model is established to demonstrate the efficacy of reconfigurable units and guide the design of RGMU with high performance. Finally, the overall architecture of RGMU and two multiplication circuits are introduced. Experimental results show that the RGMU can not only reduce the hardware overhead and power consumption but also has the unique advantage of satisfying all the finite field multiplication operations in symmetric cryptography algorithms.

有限域乘法是一种非线性变换算子，出现在大多数对称加密算法中。为了支持更多加密算法，人们提出了许多特定的有限域乘法单元，作为粗粒度可重构密码逻辑阵列中的基本模块，但它会带来低灵活性和高开销，导致粗粒度可重构密码逻辑阵列的性能下降。本文提出了一种高灵活性、低成本的可重构伽罗瓦场乘法单元，即 RGMU，以平衡功能、延迟和面积之间的权衡。分析了所有有限场乘法操作，包括最大距离可分离矩阵乘法、斐波纳契线性反馈移位寄存器并行更新、伽罗伊线性反馈移位寄存器并行更新和复合场乘法，并抽象出两个基本操作组件。此外，还建立了可重构有限场乘法计算模型，以证明可重构单元的功效，并指导高性能 RGMU 的设计。最后，介绍了 RGMU 的整体架构和两个乘法电路。实验结果表明，RGMU 不仅能降低硬件开销和功耗，还具有满足对称加密算法中所有有限场乘法运算的独特优势。

{"title":"RGMU: A High-flexibility and Low-cost Reconfigurable Galois Field Multiplication Unit Design Approach for CGRCA","authors":"Danping Jiang, Zibin Dai, Yanjiang Liu, Zongren Zhang","doi":"10.1145/3639820","DOIUrl":"https://doi.org/10.1145/3639820","url":null,"abstract":"Finite field multiplication is a non-linear transformation operator that appears in the majority of symmetric cryptographic algorithms. Numerous specified finite field multiplication units have been proposed as a fundamental module in the coarse-grained reconfigurable cipher logic array to support more cryptographic algorithms, however, it will introduce low flexibility and high overhead, resulting in reduced performance of the coarse-grained reconfigurable cipher logic array. In this paper, a high-flexibility and low-cost reconfigurable Galois field multiplication unit, which is termed as RGMU, is proposed to balance the trade-offs between the function, delay, and area. All the finite field multiplication operations, including maximum distance separable matrix multiplication, parallel update of Fibonacci linear feedback shift register, parallel update of Galois linear feedback shift register, and composite field multiplication, are analyzed and two basic operation components are abstracted. Further, a reconfigurable finite field multiplication computational model is established to demonstrate the efficacy of reconfigurable units and guide the design of RGMU with high performance. Finally, the overall architecture of RGMU and two multiplication circuits are introduced. Experimental results show that the RGMU can not only reduce the hardware overhead and power consumption but also has the unique advantage of satisfying all the finite field multiplication operations in symmetric cryptography algorithms.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139412945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TROP: TRust-aware OPportunistic Routing in NoC with Hardware Trojans TROP：带硬件木马的 NoC 中的 TRust-aware OPportunistic 路由

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-01-08 DOI: 10.1145/3639821

Syam Sankar, Ruchika Gupta, John Jose, Sukumar Nandi

Multiple software and hardware intellectual property (IP) components are combined on a single chip to form Multi-Processor Systems-on-Chips (MPSoCs). Due to the rigid time-to-market constraints, some of the IPs are from outsourced third parties. Due to the supply-chain management of IP blocks being handled by unreliable third-party vendors, security has grown as a crucial design concern in the MPSoC. These IPs may get exposed to certain unwanted practises like the insertion of malicious circuits called Hardware Trojan (HT) leading to security threats and attacks, including sensitive data leakage or integrity violations. A Network-on-Chip (NoC) connects various units of an MPSoC. Since it serves as the interface between various units in an MPSoC, it has complete access to all the data flowing through the system. This makes NoC security a paramount design issue. Our research focuses on a threat model where the NoC is infiltrated by multiple HTs that can corrupt packets. Data integrity verified at the destination’s network interface (NI) triggers re-transmissions of packets if the verification results in an error. In this paper, we propose an opportunistic trust-aware routing strategy that efficiently avoids HT while ensuring that the packets arrive at their destination unaltered. Experimental results demonstrate the successful movement of packets through opportunistically selected neighbours along a trust-aware path free from the HT effect. We also observe a significant reduction in the rate of packet re-transmissions and latency at the expense of incurring minimum area and power overhead.

多个软件和硬件知识产权（IP）组件在单个芯片上组合成多处理器片上系统（MPSoC）。由于严格的上市时间限制，部分 IP 来自外包的第三方。由于 IP 块的供应链管理由不可靠的第三方供应商负责，安全性已成为 MPSoC 设计中的一个关键问题。这些 IP 可能会暴露在某些不必要的行为中，如插入被称为硬件木马（HT）的恶意电路，从而导致安全威胁和攻击，包括敏感数据泄漏或完整性违规。片上网络（NoC）连接 MPSoC 的各个单元。由于 NoC 是 MPSoC 中各个单元之间的接口，因此可以完全访问流经系统的所有数据。因此，NoC 的安全性是一个至关重要的设计问题。我们的研究重点是 NoC 被多个 HT 入侵的威胁模型，这些 HT 可以破坏数据包。在目的地网络接口（NI）验证数据完整性时，如果验证结果出错，就会触发数据包的重新传输。在本文中，我们提出了一种机会主义信任感知路由策略，它能有效地避免 HT，同时确保数据包在到达目的地时未被更改。实验结果表明，数据包能成功地沿着一条无 HT 影响的信任感知路径，通过机会选择的邻居移动。我们还观察到，数据包重传率和延迟显著降低，而产生的面积和功耗开销却最小。

{"title":"TROP: TRust-aware OPportunistic Routing in NoC with Hardware Trojans","authors":"Syam Sankar, Ruchika Gupta, John Jose, Sukumar Nandi","doi":"10.1145/3639821","DOIUrl":"https://doi.org/10.1145/3639821","url":null,"abstract":" Multiple software and hardware intellectual property (IP) components are combined on a single chip to form Multi-Processor Systems-on-Chips (MPSoCs). Due to the rigid time-to-market constraints, some of the IPs are from outsourced third parties. Due to the supply-chain management of IP blocks being handled by unreliable third-party vendors, security has grown as a crucial design concern in the MPSoC. These IPs may get exposed to certain unwanted practises like the insertion of malicious circuits called Hardware Trojan (HT) leading to security threats and attacks, including sensitive data leakage or integrity violations. A Network-on-Chip (NoC) connects various units of an MPSoC. Since it serves as the interface between various units in an MPSoC, it has complete access to all the data flowing through the system. This makes NoC security a paramount design issue. Our research focuses on a threat model where the NoC is infiltrated by multiple HTs that can corrupt packets. Data integrity verified at the destination’s network interface (NI) triggers re-transmissions of packets if the verification results in an error. In this paper, we propose an opportunistic trust-aware routing strategy that efficiently avoids HT while ensuring that the packets arrive at their destination unaltered. Experimental results demonstrate the successful movement of packets through opportunistically selected neighbours along a trust-aware path free from the HT effect. We also observe a significant reduction in the rate of packet re-transmissions and latency at the expense of incurring minimum area and power overhead.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139413076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mixed Integer Programming based Placement Refinement by RSMT Model with Movable Pins 基于混合整数编程的带可移动引脚 RSMT 模型的布局精化

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-01-03 DOI: 10.1145/3639365

Ke Tang, Lang Feng, Zhongfeng Wang

Placement is a critical step in the physical design for digital application specific integrated circuits (ASICs), as it can directly affect the design qualities such as wirelength and timing. For many domain specific designs, the demands for high performance parallel computing result in repetitive hardware instances, such as the processing elements in the neural network accelerators. As these instances can dominate the area of the designs, the runtime of the complete design’s placement can be traded for optimizing and reusing one instance’s placement to achieve higher quality. Therefore, this work proposes a mixed integer programming (MIP)-based placement refinement algorithm for the repetitive instances. By efficiently modeling the rectilinear steiner tree wirelength, the placement can be precisely refined for better quality. Besides, the MIP formulations for timing-driven placement are proposed. A theoretical proof is then provided to show the correctness of the proposed wirelength model. For the instances in various popular fields, the experiments show that given the placement from the commercial placers, the proposed algorithm can perform further placement refinement to reduce 3.76%/3.64% detailed routing wirelength and 1.68%/2.42% critical path delay under wirelength/timing-driven mode, respectively, and also outperforms the state-of-the-art previous work.

布局是数字应用专用集成电路（ASIC）物理设计的关键步骤，因为它会直接影响线长和时序等设计质量。对于许多特定领域的设计而言，高性能并行计算的需求导致了重复的硬件实例，如神经网络加速器中的处理元件。由于这些实例可能会占据设计的大部分面积，因此可以通过优化和重复使用一个实例的布局来换取整个设计的布局运行时间，从而达到更高的质量。因此，本研究针对重复实例提出了一种基于混合整数编程（MIP）的位置细化算法。通过有效建模直角斯坦纳树线长，可以精确地细化布局，从而获得更高的质量。此外，还提出了时序驱动布局的 MIP 公式。理论证明了所提出的线长模型的正确性。对于各种热门领域的实例，实验表明，给定商业放置器的放置位置后，在线长/时序驱动模式下，所提出的算法可以执行进一步的放置细化，分别减少 3.76%/3.64% 的详细路由线长和 1.68%/2.42% 的关键路径延迟，其性能也优于最先进的前人工作。

{"title":"Mixed Integer Programming based Placement Refinement by RSMT Model with Movable Pins","authors":"Ke Tang, Lang Feng, Zhongfeng Wang","doi":"10.1145/3639365","DOIUrl":"https://doi.org/10.1145/3639365","url":null,"abstract":"Placement is a critical step in the physical design for digital application specific integrated circuits (ASICs), as it can directly affect the design qualities such as wirelength and timing. For many domain specific designs, the demands for high performance parallel computing result in repetitive hardware instances, such as the processing elements in the neural network accelerators. As these instances can dominate the area of the designs, the runtime of the complete design’s placement can be traded for optimizing and reusing one instance’s placement to achieve higher quality. Therefore, this work proposes a mixed integer programming (MIP)-based placement refinement algorithm for the repetitive instances. By efficiently modeling the rectilinear steiner tree wirelength, the placement can be precisely refined for better quality. Besides, the MIP formulations for timing-driven placement are proposed. A theoretical proof is then provided to show the correctness of the proposed wirelength model. For the instances in various popular fields, the experiments show that given the placement from the commercial placers, the proposed algorithm can perform further placement refinement to reduce 3.76%/3.64% detailed routing wirelength and 1.68%/2.42% critical path delay under wirelength/timing-driven mode, respectively, and also outperforms the state-of-the-art previous work.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139093245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Application-Level Validation of Accelerator Designs Using a Formal Software/Hardware Interface 使用形式化软件/硬件接口对加速器设计进行应用级验证

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-12-29 DOI: 10.1145/3639051

Bo-Yuan Huang, Steven Lyubomirsky, Yi Li, Mike He, Gus Henry Smith, Thierry Tambe, Akash Gaonkar, Vishal Canumalla, Andrew Cheung, Gu-Yeon Wei, Aarti Gupta, Zachary Tatlock, Sharad Malik

Ideally, accelerator development should be as easy as software development. Several recent design languages/tools are working toward this goal, but actually testing early designs on real applications end-to-end remains prohibitively difficult due to the costs of building specialized compiler and simulator support. We propose a new first-in-class, mostly automated methodology termed “3LA” to enable end-to-end testing of prototype accelerator designs on unmodified source applications. A key contribution of 3LA is the use of a formal software/hardware interface that specifies an accelerator’s operations and their semantics. Specifically, we leverage the Instruction-Level Abstraction (ILA) formal specification for accelerators that has been successfully used thus far for accelerator implementation verification. We show how the ILA for accelerators serves as a software/hardware interface, similar to the Instruction Set Architecture (ISA) for processors, that can be used for automated development of compilers and instruction-level simulators. Another key contribution of this work is to show how ILA-based accelerator semantics enables extending recent work on equality saturation to auto-generate basic compiler support for prototype accelerators in a technique we term “flexible matching.” By combining flexible matching with simulators auto-generated from ILA specifications, our approach enables end-to-end evaluation with modest engineering effort. We detail several case studies of 3LA, which uncovered an unknown flaw in a recently published accelerator and facilitated its fix.

理想情况下，加速器开发应该像软件开发一样简单。最近有几种设计语言/工具正在努力实现这一目标，但由于构建专用编译器和模拟器支持的成本过高，在实际应用中端到端测试早期设计仍然非常困难。我们提出了一种同类首创的全新自动化方法，称为 "3LA"，可在未修改的源程序上对加速器原型设计进行端到端测试。3LA 的一个主要贡献是使用正式的软件/硬件接口来指定加速器的操作及其语义。具体来说，我们利用了迄今已成功用于加速器实现验证的加速器指令级抽象（ILA）形式规范。我们展示了加速器的 ILA 如何充当软/硬件接口，类似于处理器的指令集架构 (ISA)，可用于编译器和指令级模拟器的自动开发。这项工作的另一个主要贡献是展示了基于 ILA 的加速器语义如何扩展近期的等价饱和工作，从而通过我们称之为 "灵活匹配 "的技术自动生成原型加速器的基本编译器支持。通过将灵活匹配与根据 ILA 规范自动生成的模拟器相结合，我们的方法能够以较小的工程工作量实现端到端的评估。我们详细介绍了 3LA 的几个案例研究，它发现了最近发布的加速器中的一个未知缺陷，并帮助修复了该缺陷。

{"title":"Application-Level Validation of Accelerator Designs Using a Formal Software/Hardware Interface","authors":"Bo-Yuan Huang, Steven Lyubomirsky, Yi Li, Mike He, Gus Henry Smith, Thierry Tambe, Akash Gaonkar, Vishal Canumalla, Andrew Cheung, Gu-Yeon Wei, Aarti Gupta, Zachary Tatlock, Sharad Malik","doi":"10.1145/3639051","DOIUrl":"https://doi.org/10.1145/3639051","url":null,"abstract":"Ideally, accelerator development should be as easy as software development. Several recent design languages/tools are working toward this goal, but actually testing early designs on real applications end-to-end remains prohibitively difficult due to the costs of building specialized compiler and simulator support. We propose a new first-in-class, mostly automated methodology termed “3LA” to enable end-to-end testing of prototype accelerator designs on unmodified source applications. A key contribution of 3LA is the use of a formal software/hardware interface that specifies an accelerator’s operations and their semantics. Specifically, we leverage the Instruction-Level Abstraction (ILA) formal specification for accelerators that has been successfully used thus far for accelerator implementation verification. We show how the ILA for accelerators serves as a software/hardware interface, similar to the Instruction Set Architecture (ISA) for processors, that can be used for automated development of compilers and instruction-level simulators. Another key contribution of this work is to show how ILA-based accelerator semantics enables extending recent work on equality saturation to auto-generate basic compiler support for prototype accelerators in a technique we term “flexible matching.” By combining flexible matching with simulators auto-generated from ILA specifications, our approach enables end-to-end evaluation with modest engineering effort. We detail several case studies of 3LA, which uncovered an unknown flaw in a recently published accelerator and facilitated its fix.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2023-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139064774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI Systems DeepFlow：分布式人工智能系统的跨栈寻路框架

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-12-21 DOI: 10.1145/3635867

Newsha Ardalani, Saptadeep Pal, Puneet Gupta

Over the past decade, machine learning model complexity has grown at an extraordinary rate, as has the scale of the systems training such large models. However there is an alarmingly low hardware utilization (5-20%) in large scale AI systems. The low system utilization is a cumulative effect of minor losses across different layers of the stack, exacerbated by the disconnect between engineers designing different layers spanning across different industries. To address this challenge, in this work we designed a cross-stack performance modelling and design space exploration framework. First, we introduce CrossFlow, a novel framework that enables cross-layer analysis all the way from the technology layer to the algorithmic layer. Next, we introduce DeepFlow (built on top of CrossFlow using machine learning techniques) to automate the design space exploration and co-optimization across different layers of the stack. We have validated CrossFlow’s accuracy with distributed training on real commercial hardware and showcase several DeepFlow case studies demonstrating pitfalls of not optimizing across the technology-hardware-software stack for what is likely, the most important workload driving large development investments in all aspects of computing stack.

在过去十年中，机器学习模型的复杂性以惊人的速度增长，训练这种大型模型的系统的规模也是如此。然而，大规模人工智能系统的硬件利用率却低得惊人（5%-20%）。系统利用率低是堆栈各层微小损耗的累积效应，而跨行业设计不同层的工程师之间的脱节则加剧了这一问题。为了应对这一挑战，我们在这项工作中设计了一个跨堆栈性能建模和设计空间探索框架。首先，我们介绍了 CrossFlow，这是一个新颖的框架，可实现从技术层到算法层的跨层分析。接着，我们介绍了 DeepFlow（利用机器学习技术构建于 CrossFlow 之上），以实现跨堆栈不同层的设计空间探索和协同优化的自动化。我们通过在实际商用硬件上进行分布式训练，验证了 CrossFlow 的准确性，并展示了几个 DeepFlow 案例研究，说明了对于可能是推动计算堆栈各方面大量开发投资的最重要工作负载，不在技术-硬件-软件堆栈之间进行优化的隐患。

{"title":"DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI Systems","authors":"Newsha Ardalani, Saptadeep Pal, Puneet Gupta","doi":"10.1145/3635867","DOIUrl":"https://doi.org/10.1145/3635867","url":null,"abstract":"Over the past decade, machine learning model complexity has grown at an extraordinary rate, as has the scale of the systems training such large models. However there is an alarmingly low hardware utilization (5-20%) in large scale AI systems. The low system utilization is a cumulative effect of minor losses across different layers of the stack, exacerbated by the disconnect between engineers designing different layers spanning across different industries. To address this challenge, in this work we designed a cross-stack performance modelling and design space exploration framework. First, we introduce CrossFlow, a novel framework that enables cross-layer analysis all the way from the technology layer to the algorithmic layer. Next, we introduce DeepFlow (built on top of CrossFlow using machine learning techniques) to automate the design space exploration and co-optimization across different layers of the stack. We have validated CrossFlow’s accuracy with distributed training on real commercial hardware and showcase several DeepFlow case studies demonstrating pitfalls of not optimizing across the technology-hardware-software stack for what is likely, the most important workload driving large development investments in all aspects of computing stack.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138824205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhanced Real-time Scheduling of AVB Flows in Time-Sensitive Networking 增强时间敏感型网络中 AVB 流量的实时调度

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-12-18 DOI: 10.1145/3637878

Libing Deng, Gang Zeng, Ryo Kurachi, Hiroaki Takada, Xiongren Xiao, Renfa Li, Guoqi Xie

Time-Sensitive Networking (TSN) realizes high bandwidth and time determinism for data transmission and thus becomes the crucial communication technology in time-critical systems. The Gate Control List (GCL) is used to control the transmission of different classes of traffic in TSN, including Time-Triggered (TT) flows, Audio-Video-Bridging (AVB) flows, and Best-Effort (BE) flows. Most studies focus on optimizing GCL synthesis by reserving the preceding time slots to serve TT flows with the strict delay requirement, but ignore the deadlines of non-TT flows and cause the large delay. Therefore, this paper proposes a comprehensive scheduling method to enhance the real-time scheduling of AVB flows while guaranteeing the time determinism of TT flows. This method first optimizes GCL synthesis to reserve the preceding time slots for AVB flows, and then introduces the Earliest Deadline First (EDF) method to further improve the transmission of AVB flows by considering their deadlines. Moreover, the worst-case delay (WCD) analysis method is proposed to verify the effectiveness of the proposed method. Experimental results show that the proposed method improves the transmission of AVB flows compared to the state-of-the-art methods.

时敏网络（TSN）实现了数据传输的高带宽和时间确定性，因此成为时间关键型系统中的重要通信技术。门控列表（GCL）用于控制 TSN 中不同类别流量的传输，包括时间触发（TT）流、音视频桥接（AVB）流和尽力（BE）流。大多数研究侧重于优化 GCL 合成，通过预留前面的时隙为有严格时延要求的 TT 流量提供服务，但忽略了非 TT 流量的截止时间，导致时延过大。因此，本文提出了一种综合调度方法，在保证 TT 流时间确定性的同时，提高 AVB 流的实时调度能力。该方法首先优化 GCL 合成，为 AVB 流量预留前面的时隙，然后引入最早截止时间优先（EDF）方法，通过考虑 AVB 流量的截止时间进一步改善其传输。此外，还提出了最坏情况延迟（WCD）分析方法，以验证所提方法的有效性。实验结果表明，与最先进的方法相比，所提出的方法改善了 AVB 流量的传输。

{"title":"Enhanced Real-time Scheduling of AVB Flows in Time-Sensitive Networking","authors":"Libing Deng, Gang Zeng, Ryo Kurachi, Hiroaki Takada, Xiongren Xiao, Renfa Li, Guoqi Xie","doi":"10.1145/3637878","DOIUrl":"https://doi.org/10.1145/3637878","url":null,"abstract":"Time-Sensitive Networking (TSN) realizes high bandwidth and time determinism for data transmission and thus becomes the crucial communication technology in time-critical systems. The Gate Control List (GCL) is used to control the transmission of different classes of traffic in TSN, including Time-Triggered (TT) flows, Audio-Video-Bridging (AVB) flows, and Best-Effort (BE) flows. Most studies focus on optimizing GCL synthesis by reserving the preceding time slots to serve TT flows with the strict delay requirement, but ignore the deadlines of non-TT flows and cause the large delay. Therefore, this paper proposes a comprehensive scheduling method to enhance the real-time scheduling of AVB flows while guaranteeing the time determinism of TT flows. This method first optimizes GCL synthesis to reserve the preceding time slots for AVB flows, and then introduces the Earliest Deadline First (EDF) method to further improve the transmission of AVB flows by considering their deadlines. Moreover, the worst-case delay (WCD) analysis method is proposed to verify the effectiveness of the proposed method. Experimental results show that the proposed method improves the transmission of AVB flows compared to the state-of-the-art methods.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138717326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0