Artificial neural networks (ANNs) and spiking neural networks (SNNs) are two general approaches to achieve artificial intelligence (AI). The former have been widely used in academia and industry fields. The latter, SNNs, are more similar to biological neural networks and can realize ultra-low power consumption, thus have received widespread research attention. However, due to their fundamental differences in computation formula and information coding, the two methods often require different and incompatible platforms. Alongside the development of AI, a general platform that can support both ANNs and SNNs is necessary. Moreover, there are some similarities between ANNs and SNNs, which leaves room to deploy different networks on the same architecture. However, there is little related research on this topic. Accordingly, this paper presents an energy-efficient, scalable, and non-Von Neumann architecture (EPHA) for ANNs and SNNs. Our study combines device-, circuit-, architecture-, and algorithm-level innovations to achieve a parallel architecture with ultra-low power consumption. We use the compensated ferrimagnet to act as both synapses and neurons to store weights and perform dot-product operations, respectively. Moreover, we propose a novel computing flow to reduce the operations across multiple crossbar arrays, which enables our design to conduct large and complex tasks. On a suite of ANN and SNN workloads, the EPHA is 1.6 × more power efficient than a state-of-the-art design, NEBULA, in the ANN mode. In the SNN mode, our design is 4 orders of magnitude more than the Loihi in power efficiency.
{"title":"EPHA: An Energy-efficient Parallel Hybrid Architecture for ANNs and SNNs","authors":"Yunping Zhao, Sheng Ma, Hengzhu Liu, Libo Huang, Libo Huang","doi":"10.1145/3643134","DOIUrl":"https://doi.org/10.1145/3643134","url":null,"abstract":"<p>Artificial neural networks (ANNs) and spiking neural networks (SNNs) are two general approaches to achieve artificial intelligence (AI). The former have been widely used in academia and industry fields. The latter, SNNs, are more similar to biological neural networks and can realize ultra-low power consumption, thus have received widespread research attention. However, due to their fundamental differences in computation formula and information coding, the two methods often require different and incompatible platforms. Alongside the development of AI, a general platform that can support both ANNs and SNNs is necessary. Moreover, there are some similarities between ANNs and SNNs, which leaves room to deploy different networks on the same architecture. However, there is little related research on this topic. Accordingly, this paper presents an energy-efficient, scalable, and non-Von Neumann architecture (EPHA) for ANNs and SNNs. Our study combines device-, circuit-, architecture-, and algorithm-level innovations to achieve a parallel architecture with ultra-low power consumption. We use the compensated ferrimagnet to act as both synapses and neurons to store weights and perform dot-product operations, respectively. Moreover, we propose a novel computing flow to reduce the operations across multiple crossbar arrays, which enables our design to conduct large and complex tasks. On a suite of ANN and SNN workloads, the EPHA is 1.6 × more power efficient than a state-of-the-art design, NEBULA, in the ANN mode. In the SNN mode, our design is 4 orders of magnitude more than the Loihi in power efficiency.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139579076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Analog circuit optimization and design presents a unique set of challenges in the IC design process. Many applications require for the designer to optimize for multiple competing objectives which poses a crucial challenge. Motivated by these practical aspects, we propose a novel method to tackle multi-objective optimization for analog circuit design in continuous action spaces. In particular, we propose to: (i) Extrapolate current techniques in Multi-Objective Reinforcement Learning (MORL) to continuous state and action spaces. (ii) Provide for a dynamically tunable trained model to query user defined preferences in multi-objective optimization in the analog circuit design context.
{"title":"Pareto Optimization of Analog circuits using Reinforcement Learning","authors":"Karthik Somayaji Ns, Peng Li","doi":"10.1145/3640463","DOIUrl":"https://doi.org/10.1145/3640463","url":null,"abstract":"<p>Analog circuit optimization and design presents a unique set of challenges in the IC design process. Many applications require for the designer to optimize for multiple competing objectives which poses a crucial challenge. Motivated by these practical aspects, we propose a novel method to tackle multi-objective optimization for analog circuit design in continuous action spaces. In particular, we propose to: (i) Extrapolate current techniques in Multi-Objective Reinforcement Learning (MORL) to continuous state and action spaces. (ii) Provide for a dynamically tunable trained model to query user defined preferences in multi-objective optimization in the analog circuit design context.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139482630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we introduce two novel methods to detect adversarial examples utilizing pixel value diversity. First, we propose the concept of pixel value diversity (which reflects the spread of pixel values in an image) and two independent metrics (UPVR and RPVR) to assess the pixel value diversity separately. Then we propose two methods to detect adversarial examples based on the threshold method and Bayesian method respectively. Experimental results show that compared to an excellent prior method LID, our proposed methods achieve better performances in detecting adversarial examples. We also show the robustness of our proposed work against an adaptive attack method.
在本文中,我们介绍了两种利用像素值多样性检测对抗示例的新方法。首先,我们提出了像素值多样性的概念(它反映了图像中像素值的分布)和两个独立的指标(UPVR 和 RPVR)来分别评估像素值多样性。然后,我们分别提出了基于阈值法和贝叶斯法的两种检测对抗示例的方法。实验结果表明,与优秀的先验方法 LID 相比,我们提出的方法在检测对抗性示例方面取得了更好的性能。我们还展示了我们提出的方法对自适应攻击方法的鲁棒性。
{"title":"Detecting Adversarial Examples Utilizing Pixel Value Diversity","authors":"Jinxin Dong, Pingqiang Zhou","doi":"10.1145/3636460","DOIUrl":"https://doi.org/10.1145/3636460","url":null,"abstract":"<p>In this paper, we introduce two novel methods to detect adversarial examples utilizing pixel value diversity. First, we propose the concept of pixel value diversity (which reflects the spread of pixel values in an image) and two independent metrics (UPVR and RPVR) to assess the pixel value diversity separately. Then we propose two methods to detect adversarial examples based on the threshold method and Bayesian method respectively. Experimental results show that compared to an excellent prior method LID, our proposed methods achieve better performances in detecting adversarial examples. We also show the robustness of our proposed work against an adaptive attack method.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139476939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianfeng Wang, Zhonghao Chen, Jiahao Zhang, Yixin Xu, Tongguang Yu, Ziheng Zheng, Enze Ye, Sumitha George, Huazhong Yang, Yongpan Liu, Kai Ni, Vijaykrishnan Narayanan, Xueqing Li
Logic camouflage is a widely adopted technique that mitigates the threat of intellectual property (IP) piracy and overproduction in the integrated circuit (IC) supply chain. Camouflaged logic achieves functional obfuscation through physical-level ambiguity and post-manufacturing programmability. However, discussions on programmability are confined to the level of logic cells/gates, limiting the broader-scale application of logic camouflage. In this work, we propose a novel module-level configuration methodology for programmable camouflaged logic that can be implemented without additional hardware ports and with negligible resources. We prove theoretically that the configuration of the programmable camouflaged logic cells can be achieved through the inputs and netlist of the original module. Further, we propose a novel lightweight ferroelectric FET (FeFET)-based reconfigurable logic gate (rGate) family and apply it to the proposed methodology. With the flexible replacement and the proposed configuration-aware conversion algorithm, this work is characterized by the input-only programming scheme as well as the combination of high output error rate and point-function-like defense. Evaluations show an average of >95% of the alternative rGate location for camouflage, which is sufficient for the security-aware design. We illustrate the exponential complexity in function state traversal and the enhanced defense capability of locked blackbox against SAT attacks compared to key-based methods. We also preserve an evident output Hamming distance and introduce negligible hardware overheads in both gate-level and module-level evaluations under typical benchmarks.
逻辑伪装是一种广泛采用的技术,可减轻集成电路(IC)供应链中知识产权(IP)盗版和过度生产的威胁。伪装逻辑通过物理层面的模糊性和制造后的可编程性实现功能混淆。然而,关于可编程性的讨论仅限于逻辑单元/门的层面,限制了逻辑伪装在更大范围内的应用。在这项工作中,我们为可编程伪装逻辑提出了一种新颖的模块级配置方法,无需额外的硬件端口和可忽略的资源即可实现。我们从理论上证明,可编程伪装逻辑单元的配置可以通过原始模块的输入和网表来实现。此外,我们还提出了一种基于铁电场效应晶体管(FeFET)的新型轻量级可重构逻辑门(rGate)系列,并将其应用于所提出的方法。通过灵活的替换和所提出的配置感知转换算法,这项工作的特点是只需输入编程方案以及高输出错误率和点函数式防御的结合。评估显示,伪装的可选 rGate 位置平均达到 95%,足以满足安全感知设计的要求。我们说明了函数状态遍历的指数复杂性,以及与基于密钥的方法相比,锁定黑盒对 SAT 攻击的增强防御能力。我们还保留了明显的输出汉明距离,并在典型基准下的门级和模块级评估中引入了可忽略不计的硬件开销。
{"title":"A Module-Level Configuration Methodology for Programmable Camouflaged Logic","authors":"Jianfeng Wang, Zhonghao Chen, Jiahao Zhang, Yixin Xu, Tongguang Yu, Ziheng Zheng, Enze Ye, Sumitha George, Huazhong Yang, Yongpan Liu, Kai Ni, Vijaykrishnan Narayanan, Xueqing Li","doi":"10.1145/3640462","DOIUrl":"https://doi.org/10.1145/3640462","url":null,"abstract":"<p>Logic camouflage is a widely adopted technique that mitigates the threat of intellectual property (IP) piracy and overproduction in the integrated circuit (IC) supply chain. Camouflaged logic achieves functional obfuscation through physical-level ambiguity and post-manufacturing programmability. However, discussions on programmability are confined to the level of logic cells/gates, limiting the broader-scale application of logic camouflage. In this work, we propose a novel module-level configuration methodology for programmable camouflaged logic that can be implemented without additional hardware ports and with negligible resources. We prove theoretically that the configuration of the programmable camouflaged logic cells can be achieved through the inputs and netlist of the original module. Further, we propose a novel lightweight ferroelectric FET (FeFET)-based reconfigurable logic gate (rGate) family and apply it to the proposed methodology. With the flexible replacement and the proposed configuration-aware conversion algorithm, this work is characterized by the input-only programming scheme as well as the combination of high output error rate and point-function-like defense. Evaluations show an average of >95% of the alternative rGate location for camouflage, which is sufficient for the security-aware design. We illustrate the exponential complexity in function state traversal and the enhanced defense capability of locked blackbox against SAT attacks compared to key-based methods. We also preserve an evident output Hamming distance and introduce negligible hardware overheads in both gate-level and module-level evaluations under typical benchmarks.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139458921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Finite field multiplication is a non-linear transformation operator that appears in the majority of symmetric cryptographic algorithms. Numerous specified finite field multiplication units have been proposed as a fundamental module in the coarse-grained reconfigurable cipher logic array to support more cryptographic algorithms, however, it will introduce low flexibility and high overhead, resulting in reduced performance of the coarse-grained reconfigurable cipher logic array. In this paper, a high-flexibility and low-cost reconfigurable Galois field multiplication unit, which is termed as RGMU, is proposed to balance the trade-offs between the function, delay, and area. All the finite field multiplication operations, including maximum distance separable matrix multiplication, parallel update of Fibonacci linear feedback shift register, parallel update of Galois linear feedback shift register, and composite field multiplication, are analyzed and two basic operation components are abstracted. Further, a reconfigurable finite field multiplication computational model is established to demonstrate the efficacy of reconfigurable units and guide the design of RGMU with high performance. Finally, the overall architecture of RGMU and two multiplication circuits are introduced. Experimental results show that the RGMU can not only reduce the hardware overhead and power consumption but also has the unique advantage of satisfying all the finite field multiplication operations in symmetric cryptography algorithms.
{"title":"RGMU: A High-flexibility and Low-cost Reconfigurable Galois Field Multiplication Unit Design Approach for CGRCA","authors":"Danping Jiang, Zibin Dai, Yanjiang Liu, Zongren Zhang","doi":"10.1145/3639820","DOIUrl":"https://doi.org/10.1145/3639820","url":null,"abstract":"<p>Finite field multiplication is a non-linear transformation operator that appears in the majority of symmetric cryptographic algorithms. Numerous specified finite field multiplication units have been proposed as a fundamental module in the coarse-grained reconfigurable cipher logic array to support more cryptographic algorithms, however, it will introduce low flexibility and high overhead, resulting in reduced performance of the coarse-grained reconfigurable cipher logic array. In this paper, a high-flexibility and low-cost reconfigurable Galois field multiplication unit, which is termed as RGMU, is proposed to balance the trade-offs between the function, delay, and area. All the finite field multiplication operations, including maximum distance separable matrix multiplication, parallel update of Fibonacci linear feedback shift register, parallel update of Galois linear feedback shift register, and composite field multiplication, are analyzed and two basic operation components are abstracted. Further, a reconfigurable finite field multiplication computational model is established to demonstrate the efficacy of reconfigurable units and guide the design of RGMU with high performance. Finally, the overall architecture of RGMU and two multiplication circuits are introduced. Experimental results show that the RGMU can not only reduce the hardware overhead and power consumption but also has the unique advantage of satisfying all the finite field multiplication operations in symmetric cryptography algorithms.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139412945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Syam Sankar, Ruchika Gupta, John Jose, Sukumar Nandi
Multiple software and hardware intellectual property (IP) components are combined on a single chip to form Multi-Processor Systems-on-Chips (MPSoCs). Due to the rigid time-to-market constraints, some of the IPs are from outsourced third parties. Due to the supply-chain management of IP blocks being handled by unreliable third-party vendors, security has grown as a crucial design concern in the MPSoC. These IPs may get exposed to certain unwanted practises like the insertion of malicious circuits called Hardware Trojan (HT) leading to security threats and attacks, including sensitive data leakage or integrity violations. A Network-on-Chip (NoC) connects various units of an MPSoC. Since it serves as the interface between various units in an MPSoC, it has complete access to all the data flowing through the system. This makes NoC security a paramount design issue. Our research focuses on a threat model where the NoC is infiltrated by multiple HTs that can corrupt packets. Data integrity verified at the destination’s network interface (NI) triggers re-transmissions of packets if the verification results in an error. In this paper, we propose an opportunistic trust-aware routing strategy that efficiently avoids HT while ensuring that the packets arrive at their destination unaltered. Experimental results demonstrate the successful movement of packets through opportunistically selected neighbours along a trust-aware path free from the HT effect. We also observe a significant reduction in the rate of packet re-transmissions and latency at the expense of incurring minimum area and power overhead.
多个软件和硬件知识产权(IP)组件在单个芯片上组合成多处理器片上系统(MPSoC)。由于严格的上市时间限制,部分 IP 来自外包的第三方。由于 IP 块的供应链管理由不可靠的第三方供应商负责,安全性已成为 MPSoC 设计中的一个关键问题。这些 IP 可能会暴露在某些不必要的行为中,如插入被称为硬件木马(HT)的恶意电路,从而导致安全威胁和攻击,包括敏感数据泄漏或完整性违规。片上网络(NoC)连接 MPSoC 的各个单元。由于 NoC 是 MPSoC 中各个单元之间的接口,因此可以完全访问流经系统的所有数据。因此,NoC 的安全性是一个至关重要的设计问题。我们的研究重点是 NoC 被多个 HT 入侵的威胁模型,这些 HT 可以破坏数据包。在目的地网络接口(NI)验证数据完整性时,如果验证结果出错,就会触发数据包的重新传输。在本文中,我们提出了一种机会主义信任感知路由策略,它能有效地避免 HT,同时确保数据包在到达目的地时未被更改。实验结果表明,数据包能成功地沿着一条无 HT 影响的信任感知路径,通过机会选择的邻居移动。我们还观察到,数据包重传率和延迟显著降低,而产生的面积和功耗开销却最小。
{"title":"TROP: TRust-aware OPportunistic Routing in NoC with Hardware Trojans","authors":"Syam Sankar, Ruchika Gupta, John Jose, Sukumar Nandi","doi":"10.1145/3639821","DOIUrl":"https://doi.org/10.1145/3639821","url":null,"abstract":"<p> Multiple software and hardware intellectual property (IP) components are combined on a single chip to form Multi-Processor Systems-on-Chips (MPSoCs). Due to the rigid time-to-market constraints, some of the IPs are from outsourced third parties. Due to the supply-chain management of IP blocks being handled by unreliable third-party vendors, security has grown as a crucial design concern in the MPSoC. These IPs may get exposed to certain unwanted practises like the insertion of malicious circuits called Hardware Trojan (HT) leading to security threats and attacks, including sensitive data leakage or integrity violations. A Network-on-Chip (NoC) connects various units of an MPSoC. Since it serves as the interface between various units in an MPSoC, it has complete access to all the data flowing through the system. This makes NoC security a paramount design issue. Our research focuses on a threat model where the NoC is infiltrated by multiple HTs that can corrupt packets. Data integrity verified at the destination’s network interface (NI) triggers re-transmissions of packets if the verification results in an error. In this paper, we propose an opportunistic trust-aware routing strategy that efficiently avoids HT while ensuring that the packets arrive at their destination unaltered. Experimental results demonstrate the successful movement of packets through opportunistically selected neighbours along a trust-aware path free from the HT effect. We also observe a significant reduction in the rate of packet re-transmissions and latency at the expense of incurring minimum area and power overhead.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139413076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Placement is a critical step in the physical design for digital application specific integrated circuits (ASICs), as it can directly affect the design qualities such as wirelength and timing. For many domain specific designs, the demands for high performance parallel computing result in repetitive hardware instances, such as the processing elements in the neural network accelerators. As these instances can dominate the area of the designs, the runtime of the complete design’s placement can be traded for optimizing and reusing one instance’s placement to achieve higher quality. Therefore, this work proposes a mixed integer programming (MIP)-based placement refinement algorithm for the repetitive instances. By efficiently modeling the rectilinear steiner tree wirelength, the placement can be precisely refined for better quality. Besides, the MIP formulations for timing-driven placement are proposed. A theoretical proof is then provided to show the correctness of the proposed wirelength model. For the instances in various popular fields, the experiments show that given the placement from the commercial placers, the proposed algorithm can perform further placement refinement to reduce 3.76%/3.64% detailed routing wirelength and 1.68%/2.42% critical path delay under wirelength/timing-driven mode, respectively, and also outperforms the state-of-the-art previous work.
{"title":"Mixed Integer Programming based Placement Refinement by RSMT Model with Movable Pins","authors":"Ke Tang, Lang Feng, Zhongfeng Wang","doi":"10.1145/3639365","DOIUrl":"https://doi.org/10.1145/3639365","url":null,"abstract":"<p>Placement is a critical step in the physical design for digital application specific integrated circuits (ASICs), as it can directly affect the design qualities such as wirelength and timing. For many domain specific designs, the demands for high performance parallel computing result in repetitive hardware instances, such as the processing elements in the neural network accelerators. As these instances can dominate the area of the designs, the runtime of the complete design’s placement can be traded for optimizing and reusing one instance’s placement to achieve higher quality. Therefore, this work proposes a mixed integer programming (MIP)-based placement refinement algorithm for the repetitive instances. By efficiently modeling the rectilinear steiner tree wirelength, the placement can be precisely refined for better quality. Besides, the MIP formulations for timing-driven placement are proposed. A theoretical proof is then provided to show the correctness of the proposed wirelength model. For the instances in various popular fields, the experiments show that given the placement from the commercial placers, the proposed algorithm can perform further placement refinement to reduce 3.76%/3.64% detailed routing wirelength and 1.68%/2.42% critical path delay under wirelength/timing-driven mode, respectively, and also outperforms the state-of-the-art previous work.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139093245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo-Yuan Huang, Steven Lyubomirsky, Yi Li, Mike He, Gus Henry Smith, Thierry Tambe, Akash Gaonkar, Vishal Canumalla, Andrew Cheung, Gu-Yeon Wei, Aarti Gupta, Zachary Tatlock, Sharad Malik
Ideally, accelerator development should be as easy as software development. Several recent design languages/tools are working toward this goal, but actually testing early designs on real applications end-to-end remains prohibitively difficult due to the costs of building specialized compiler and simulator support. We propose a new first-in-class, mostly automated methodology termed “3LA” to enable end-to-end testing of prototype accelerator designs on unmodified source applications. A key contribution of 3LA is the use of a formal software/hardware interface that specifies an accelerator’s operations and their semantics. Specifically, we leverage the Instruction-Level Abstraction (ILA) formal specification for accelerators that has been successfully used thus far for accelerator implementation verification. We show how the ILA for accelerators serves as a software/hardware interface, similar to the Instruction Set Architecture (ISA) for processors, that can be used for automated development of compilers and instruction-level simulators. Another key contribution of this work is to show how ILA-based accelerator semantics enables extending recent work on equality saturation to auto-generate basic compiler support for prototype accelerators in a technique we term “flexible matching.” By combining flexible matching with simulators auto-generated from ILA specifications, our approach enables end-to-end evaluation with modest engineering effort. We detail several case studies of 3LA, which uncovered an unknown flaw in a recently published accelerator and facilitated its fix.
理想情况下,加速器开发应该像软件开发一样简单。最近有几种设计语言/工具正在努力实现这一目标,但由于构建专用编译器和模拟器支持的成本过高,在实际应用中端到端测试早期设计仍然非常困难。我们提出了一种同类首创的全新自动化方法,称为 "3LA",可在未修改的源程序上对加速器原型设计进行端到端测试。3LA 的一个主要贡献是使用正式的软件/硬件接口来指定加速器的操作及其语义。具体来说,我们利用了迄今已成功用于加速器实现验证的加速器指令级抽象(ILA)形式规范。我们展示了加速器的 ILA 如何充当软/硬件接口,类似于处理器的指令集架构 (ISA),可用于编译器和指令级模拟器的自动开发。这项工作的另一个主要贡献是展示了基于 ILA 的加速器语义如何扩展近期的等价饱和工作,从而通过我们称之为 "灵活匹配 "的技术自动生成原型加速器的基本编译器支持。通过将灵活匹配与根据 ILA 规范自动生成的模拟器相结合,我们的方法能够以较小的工程工作量实现端到端的评估。我们详细介绍了 3LA 的几个案例研究,它发现了最近发布的加速器中的一个未知缺陷,并帮助修复了该缺陷。
{"title":"Application-Level Validation of Accelerator Designs Using a Formal Software/Hardware Interface","authors":"Bo-Yuan Huang, Steven Lyubomirsky, Yi Li, Mike He, Gus Henry Smith, Thierry Tambe, Akash Gaonkar, Vishal Canumalla, Andrew Cheung, Gu-Yeon Wei, Aarti Gupta, Zachary Tatlock, Sharad Malik","doi":"10.1145/3639051","DOIUrl":"https://doi.org/10.1145/3639051","url":null,"abstract":"<p>Ideally, accelerator development should be as easy as software development. Several recent design languages/tools are working toward this goal, but actually testing early designs on real applications end-to-end remains prohibitively difficult due to the costs of building specialized compiler and simulator support. We propose a new first-in-class, mostly automated methodology termed “3LA” to enable end-to-end testing of prototype accelerator designs on unmodified source applications. A key contribution of 3LA is the use of a formal software/hardware interface that specifies an accelerator’s operations and their semantics. Specifically, we leverage the Instruction-Level Abstraction (ILA) formal specification for accelerators that has been successfully used thus far for accelerator implementation verification. We show how the ILA for accelerators serves as a software/hardware interface, similar to the Instruction Set Architecture (ISA) for processors, that can be used for automated development of compilers and instruction-level simulators. Another key contribution of this work is to show how ILA-based accelerator semantics enables extending recent work on equality saturation to auto-generate basic compiler support for prototype accelerators in a technique we term “flexible matching.” By combining flexible matching with simulators auto-generated from ILA specifications, our approach enables end-to-end evaluation with modest engineering effort. We detail several case studies of 3LA, which uncovered an unknown flaw in a recently published accelerator and facilitated its fix.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2023-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139064774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Over the past decade, machine learning model complexity has grown at an extraordinary rate, as has the scale of the systems training such large models. However there is an alarmingly low hardware utilization (5-20%) in large scale AI systems. The low system utilization is a cumulative effect of minor losses across different layers of the stack, exacerbated by the disconnect between engineers designing different layers spanning across different industries. To address this challenge, in this work we designed a cross-stack performance modelling and design space exploration framework. First, we introduce CrossFlow, a novel framework that enables cross-layer analysis all the way from the technology layer to the algorithmic layer. Next, we introduce DeepFlow (built on top of CrossFlow using machine learning techniques) to automate the design space exploration and co-optimization across different layers of the stack. We have validated CrossFlow’s accuracy with distributed training on real commercial hardware and showcase several DeepFlow case studies demonstrating pitfalls of not optimizing across the technology-hardware-software stack for what is likely, the most important workload driving large development investments in all aspects of computing stack.
{"title":"DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI Systems","authors":"Newsha Ardalani, Saptadeep Pal, Puneet Gupta","doi":"10.1145/3635867","DOIUrl":"https://doi.org/10.1145/3635867","url":null,"abstract":"<p>Over the past decade, machine learning model complexity has grown at an extraordinary rate, as has the scale of the systems training such large models. However there is an alarmingly low hardware utilization (5-20%) in large scale AI systems. The low system utilization is a cumulative effect of minor losses across different layers of the stack, exacerbated by the disconnect between engineers designing different layers spanning across different industries. To address this challenge, in this work we designed a cross-stack performance modelling and design space exploration framework. First, we introduce CrossFlow, a novel framework that enables cross-layer analysis all the way from the technology layer to the algorithmic layer. Next, we introduce DeepFlow (built on top of CrossFlow using machine learning techniques) to automate the design space exploration and co-optimization across different layers of the stack. We have validated CrossFlow’s accuracy with distributed training on real commercial hardware and showcase several DeepFlow case studies demonstrating pitfalls of not optimizing across the technology-hardware-software stack for what is likely, the most important workload driving large development investments in all aspects of computing stack.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138824205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Time-Sensitive Networking (TSN) realizes high bandwidth and time determinism for data transmission and thus becomes the crucial communication technology in time-critical systems. The Gate Control List (GCL) is used to control the transmission of different classes of traffic in TSN, including Time-Triggered (TT) flows, Audio-Video-Bridging (AVB) flows, and Best-Effort (BE) flows. Most studies focus on optimizing GCL synthesis by reserving the preceding time slots to serve TT flows with the strict delay requirement, but ignore the deadlines of non-TT flows and cause the large delay. Therefore, this paper proposes a comprehensive scheduling method to enhance the real-time scheduling of AVB flows while guaranteeing the time determinism of TT flows. This method first optimizes GCL synthesis to reserve the preceding time slots for AVB flows, and then introduces the Earliest Deadline First (EDF) method to further improve the transmission of AVB flows by considering their deadlines. Moreover, the worst-case delay (WCD) analysis method is proposed to verify the effectiveness of the proposed method. Experimental results show that the proposed method improves the transmission of AVB flows compared to the state-of-the-art methods.
{"title":"Enhanced Real-time Scheduling of AVB Flows in Time-Sensitive Networking","authors":"Libing Deng, Gang Zeng, Ryo Kurachi, Hiroaki Takada, Xiongren Xiao, Renfa Li, Guoqi Xie","doi":"10.1145/3637878","DOIUrl":"https://doi.org/10.1145/3637878","url":null,"abstract":"<p>Time-Sensitive Networking (TSN) realizes high bandwidth and time determinism for data transmission and thus becomes the crucial communication technology in time-critical systems. The Gate Control List (GCL) is used to control the transmission of different classes of traffic in TSN, including Time-Triggered (TT) flows, Audio-Video-Bridging (AVB) flows, and Best-Effort (BE) flows. Most studies focus on optimizing GCL synthesis by reserving the preceding time slots to serve TT flows with the strict delay requirement, but ignore the deadlines of non-TT flows and cause the large delay. Therefore, this paper proposes a comprehensive scheduling method to enhance the real-time scheduling of AVB flows while guaranteeing the time determinism of TT flows. This method first optimizes GCL synthesis to reserve the preceding time slots for AVB flows, and then introduces the Earliest Deadline First (EDF) method to further improve the transmission of AVB flows by considering their deadlines. Moreover, the worst-case delay (WCD) analysis method is proposed to verify the effectiveness of the proposed method. Experimental results show that the proposed method improves the transmission of AVB flows compared to the state-of-the-art methods.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138717326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}