首页 > 最新文献

ACM Transactions on Design Automation of Electronic Systems最新文献

英文 中文
2D Search Space for Extracting Broadside Tests from Functional Test Sequences 从功能测试序列中提取宽边测试的二维搜索空间
IF 1.4 4区 计算机科学 Q2 Computer Science Pub Date : 2024-03-02 DOI: 10.1145/3650207
Irith Pomeranz

Testing for delay faults after chip manufacturing is critical to correct chip operation. Tests for delay faults are applied using scan chains that provide access to internal memory elements. As a result, a circuit may operate under non-functional operation conditions during test application. This may lead to overtesting. The extraction of broadside tests from functional test sequences ensures that the tests create functional operation conditions. When N functional test sequences of length L + 1 are available, the number of broadside tests that can be extracted is N · L. Depending on the source of the functional test sequences, the value of N · L may be large. In this case, it is important to select a subset of nN sequences, and consider only the first lL clock cycles of every sequence for the extraction of n · lN · L broadside tests. The two-dimensional N × L search space for broadside tests is the subject of this article. Using a static procedure that considers fixed values of n and l, the article demonstrates that, for the same value of n · l, different circuits benefit from different values of n and l. It also describes a dynamic procedure that matches the parameters n and l to the circuit. The discussion is supported by experimental results for transition faults in benchmark circuits.

芯片制造后的延迟故障测试对芯片的正确运行至关重要。延迟故障测试是通过扫描链进行的,扫描链可访问内部存储器元件。因此,在测试过程中,电路可能会在非功能操作条件下运行。这可能会导致过度测试。从功能测试序列中提取宽边测试可确保测试创建功能操作条件。根据功能测试序列的来源,N - L 的值可能很大。在这种情况下,重要的是选择 n ≤ N 个序列的子集,并只考虑每个序列的前 l ≤ L 个时钟周期,以提取 n - l ≪ N - L 个宽边测试。宽边测试的二维 N × L 搜索空间是本文的主题。文章使用一种考虑固定 n 和 l 值的静态程序,证明了对于相同的 n - l 值,不同的电路从不同的 n 和 l 值中获益。讨论得到了基准电路中过渡故障的实验结果的支持。
{"title":"2D Search Space for Extracting Broadside Tests from Functional Test Sequences","authors":"Irith Pomeranz","doi":"10.1145/3650207","DOIUrl":"https://doi.org/10.1145/3650207","url":null,"abstract":"<p>Testing for delay faults after chip manufacturing is critical to correct chip operation. Tests for delay faults are applied using scan chains that provide access to internal memory elements. As a result, a circuit may operate under non-functional operation conditions during test application. This may lead to overtesting. The extraction of broadside tests from functional test sequences ensures that the tests create functional operation conditions. When <i>N</i> functional test sequences of length <i>L</i> + 1 are available, the number of broadside tests that can be extracted is <i>N</i> · <i>L</i>. Depending on the source of the functional test sequences, the value of <i>N</i> · <i>L</i> may be large. In this case, it is important to select a subset of <i>n</i> ≤ <i>N</i> sequences, and consider only the first <i>l</i> ≤ <i>L</i> clock cycles of every sequence for the extraction of <i>n</i> · <i>l</i> ≪ <i>N</i> · <i>L</i> broadside tests. The two-dimensional <i>N</i> × <i>L</i> search space for broadside tests is the subject of this article. Using a static procedure that considers fixed values of <i>n</i> and <i>l</i>, the article demonstrates that, for the same value of <i>n</i> · <i>l</i>, different circuits benefit from different values of <i>n</i> and <i>l</i>. It also describes a dynamic procedure that matches the parameters <i>n</i> and <i>l</i> to the circuit. The discussion is supported by experimental results for transition faults in benchmark circuits.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140017482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FortiFix: A Fault Attack Aware Compiler Framework for Crypto Implementations FortiFix:面向密码实现的故障攻击感知编译器框架
IF 1.4 4区 计算机科学 Q2 Computer Science Pub Date : 2024-03-01 DOI: 10.1145/3650029
Keerthi K, Chester Rebeiro

Fault attacks are one of the most powerful forms of cryptanalytic attack on embedded systems, that can corrupt cipher’s operations leading to a breach of confidentiality and integrity. A single precisely injected fault during the execution of a cipher can be exploited to retrieve the secret key in a few milliseconds. Naïve countermeasures introduced into implementation can lead to huge overheads, making them unusable in resource-constraint environments. On the other hand, optimized countermeasures requires significant knowledge, not just about the attack, but also on the (a) the cryptographic properties of the cipher, (b) the program structure, and (c) the underlying hardware architecture. This makes the protection against fault attacks tedious and error-prone.

In this paper, we introduce the first automated compiler framework named FortiFix that can detect and patch fault exploitable regions in a block cipher implementation. The framework has two phases. The pre-compilation phase identifies regions in the source code of a block cipher that are vulnerable to fault attacks. The second phase is incorporated as transformation passes in the LLVM compiler to find exploitable instructions, quantify the impact of a fault on these instructions, and finally insert appropriate countermeasures based on user defined security requirements. As a proof of concept, we have evaluated two block cipher implementations AES-128 and CLEFIA-128 on three different hardware platforms such as MSP430 (16-bit), ARM (32-bit) and RISCV (32-bit).

故障攻击是嵌入式系统上最强大的密码分析攻击形式之一,它可以破坏密码的运行,导致机密性和完整性遭到破坏。在密码执行过程中精确注入一个故障,就能在几毫秒内获取密钥。在执行过程中引入简单的反措施会导致巨大的开销,使其在资源受限的环境中无法使用。另一方面,优化对策需要大量知识,不仅包括攻击方面的知识,还包括 (a) 密码的加密特性、(b) 程序结构和 (c) 底层硬件架构方面的知识。这使得防范故障攻击的工作变得繁琐且容易出错。在本文中,我们介绍了首个名为 FortiFix 的自动编译器框架,它可以检测并修补块密码实现中的故障可利用区域。该框架分为两个阶段。预编译阶段识别区块密码源代码中容易受到故障攻击的区域。第二阶段是在 LLVM 编译器中进行转换,以找到可利用的指令,量化故障对这些指令的影响,最后根据用户定义的安全要求插入适当的对策。作为概念验证,我们在三种不同的硬件平台(如 MSP430(16 位)、ARM(32 位)和 RISCV(32 位))上评估了 AES-128 和 CLEFIA-128 两种区块密码的实现。
{"title":"FortiFix: A Fault Attack Aware Compiler Framework for Crypto Implementations","authors":"Keerthi K, Chester Rebeiro","doi":"10.1145/3650029","DOIUrl":"https://doi.org/10.1145/3650029","url":null,"abstract":"<p>Fault attacks are one of the most powerful forms of cryptanalytic attack on embedded systems, that can corrupt cipher’s operations leading to a breach of confidentiality and integrity. A single precisely injected fault during the execution of a cipher can be exploited to retrieve the secret key in a few milliseconds. Naïve countermeasures introduced into implementation can lead to huge overheads, making them unusable in resource-constraint environments. On the other hand, optimized countermeasures requires significant knowledge, not just about the attack, but also on the (a) the cryptographic properties of the cipher, (b) the program structure, and (c) the underlying hardware architecture. This makes the protection against fault attacks tedious and error-prone. </p><p>In this paper, we introduce the first automated compiler framework named <span>FortiFix</span> that can detect and patch fault exploitable regions in a block cipher implementation. The framework has two phases. The <i>pre-compilation phase</i> identifies regions in the source code of a block cipher that are vulnerable to fault attacks. The second phase is incorporated as transformation passes in the LLVM compiler to find exploitable instructions, quantify the impact of a fault on these instructions, and finally insert appropriate countermeasures based on user defined security requirements. As a proof of concept, we have evaluated two block cipher implementations AES-128 and CLEFIA-128 on three different hardware platforms such as MSP430 (16-bit), ARM (32-bit) and RISCV (32-bit).</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140017391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Root-Cause Analysis with Semi-Supervised Co-Training for Integrated Systems 利用半监督协同训练对集成系统进行根本原因分析
IF 1.4 4区 计算机科学 Q2 Computer Science Pub Date : 2024-03-01 DOI: 10.1145/3649313
Renjian Pan, Xin Li, Krishnendu Chakrabarty

Root-cause analysis for integrated systems has become increasingly challenging due to their growing complexity. To tackle these challenges, machine learning (ML) has been applied to enhance root-cause analysis. Nonetheless, ML-based root-cause analysis usually requires abundant training data with root causes labeled by human experts, which are difficult or even impossible to obtain. To overcome this drawback, a semi-supervised co-training method is proposed for root-cause-analysis in this paper, which only requires a small portion of labeled data. First, a random forest is trained with labeled data. Next, we propose a co-training technique to learn from unlabeled data with semi-supervised learning, which pre-labels a subset of these data automatically and then retrains each decision tree in the random forest. In addition, a robust framework is proposed to avoid over-fitting. We further apply initialization by clustering and feature selection to improve the diagnostic performance. With two case studies from industry, the proposed approach shows superior performance against other state-of-the-art methods by saving up to 67% of labeling efforts.

由于集成系统日益复杂,对其进行根本原因分析变得越来越具有挑战性。为了应对这些挑战,人们应用机器学习(ML)来加强根源分析。然而,基于 ML 的根本原因分析通常需要大量由人类专家标注根本原因的训练数据,而这些数据很难甚至不可能获得。为了克服这一缺点,本文提出了一种用于根源分析的半监督协同训练方法,它只需要一小部分标注数据。首先,使用标注数据训练随机森林。接下来,我们提出了一种通过半监督学习从无标签数据中学习的联合训练技术,该技术会自动预标记这些数据的一个子集,然后重新训练随机森林中的每一棵决策树。此外,我们还提出了一个稳健的框架,以避免过度拟合。我们进一步通过聚类和特征选择进行初始化,以提高诊断性能。通过对两个行业案例的研究,我们提出的方法与其他最先进的方法相比表现出更优越的性能,最多可节省 67% 的标注工作。
{"title":"Root-Cause Analysis with Semi-Supervised Co-Training for Integrated Systems","authors":"Renjian Pan, Xin Li, Krishnendu Chakrabarty","doi":"10.1145/3649313","DOIUrl":"https://doi.org/10.1145/3649313","url":null,"abstract":"<p>Root-cause analysis for integrated systems has become increasingly challenging due to their growing complexity. To tackle these challenges, machine learning (ML) has been applied to enhance root-cause analysis. Nonetheless, ML-based root-cause analysis usually requires abundant training data with root causes labeled by human experts, which are difficult or even impossible to obtain. To overcome this drawback, a semi-supervised co-training method is proposed for root-cause-analysis in this paper, which only requires a small portion of labeled data. First, a random forest is trained with labeled data. Next, we propose a co-training technique to learn from unlabeled data with semi-supervised learning, which pre-labels a subset of these data automatically and then retrains each decision tree in the random forest. In addition, a robust framework is proposed to avoid over-fitting. We further apply initialization by clustering and feature selection to improve the diagnostic performance. With two case studies from industry, the proposed approach shows superior performance against other state-of-the-art methods by saving up to 67% of labeling efforts.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140017388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
H3D-Transformer: A Heterogeneous 3D (H3D) Computing Platform for Transformer Model Acceleration on Edge Devices H3D-Transformer:在边缘设备上加速变压器模型的异构三维(H3D)计算平台
IF 1.4 4区 计算机科学 Q2 Computer Science Pub Date : 2024-02-28 DOI: 10.1145/3649219
Yandong Luo, Shimeng Yu

Prior hardware accelerator designs primarily focused on single-chip solutions for 10MB-class computer vision models. The GB-class transformer models for natural language processing (NLP) impose challenges on existing accelerator design due to the massive number of parameters and the diverse matrix multiplication (MatMul) workloads involved. This work proposes a heterogeneous 3D-based accelerator design for transformer models, which adopts an interposer substrate with multiple 3D memory/logic hybrid cubes optimized for accelerating different MatMul workloads. An approximate computing scheme is proposed to take advantage of heterogeneous computing paradigms of mixed-signal compute-in-memory (CIM) and digital tensor processing units (TPU). From the system-level evaluation results, 10 TOPS/W energy efficiency is achieved for the Bert and GPT2 model, which is about 2.6 × ∼ 3.1 × higher than the baseline with 7nm TPU and stacked FeFET memory.

先前的硬件加速器设计主要集中于 10MB 级计算机视觉模型的单芯片解决方案。用于自然语言处理(NLP)的 GB 级转换器模型由于涉及大量参数和多种矩阵乘法(MatMul)工作负载,给现有加速器设计带来了挑战。这项工作针对变换器模型提出了一种基于三维的异构加速器设计,它采用了带有多个三维内存/逻辑混合立方体的内插基板,并针对不同的 MatMul 工作负载进行了优化。利用混合信号内存计算(CIM)和数字张量处理单元(TPU)的异构计算范例,提出了一种近似计算方案。从系统级评估结果来看,Bert 和 GPT2 模型实现了 10 TOPS/W 的能效,比采用 7nm TPU 和堆叠 FeFET 内存的基准高出约 2.6 × ∼ 3.1 ×。
{"title":"H3D-Transformer: A Heterogeneous 3D (H3D) Computing Platform for Transformer Model Acceleration on Edge Devices","authors":"Yandong Luo, Shimeng Yu","doi":"10.1145/3649219","DOIUrl":"https://doi.org/10.1145/3649219","url":null,"abstract":"<p>Prior hardware accelerator designs primarily focused on single-chip solutions for 10MB-class computer vision models. The GB-class transformer models for natural language processing (NLP) impose challenges on existing accelerator design due to the massive number of parameters and the diverse matrix multiplication (MatMul) workloads involved. This work proposes a heterogeneous 3D-based accelerator design for transformer models, which adopts an interposer substrate with multiple 3D memory/logic hybrid cubes optimized for accelerating different MatMul workloads. An approximate computing scheme is proposed to take advantage of heterogeneous computing paradigms of mixed-signal compute-in-memory (CIM) and digital tensor processing units (TPU). From the system-level evaluation results, 10 TOPS/W energy efficiency is achieved for the Bert and GPT2 model, which is about 2.6 × ∼ 3.1 × higher than the baseline with 7nm TPU and stacked FeFET memory.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139987992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IDeSyDe: Systematic Design Space Exploration via Design Space Identification IDeSyDe:通过设计空间识别进行系统设计空间探索
IF 1.4 4区 计算机科学 Q2 Computer Science Pub Date : 2024-02-10 DOI: 10.1145/3647640
Rodolfo Jordão, Matthias Becker, Ingo Sander

Design space exploration (DSE) is a key activity in embedded design processes, where a mapping between applications and platforms that meets the process design requirements must be found. Finding such mappings is very challenging due to the complexity of modern embedded platforms and applications. DSE tools aid in this challenge by potentially covering sections of the design space that could be unintuitive to designers, leading to more optimised designs. Despite this potential benefit, DSE tools remain relatively niche in the embedded industry. A significant obstacle hindering their wider adoption is integrating such tools into embedded design processes.

We present two contributions that address this integration issue. First, we present the design space identification (DSI) approach for systematically constructing DSE solutions that are modular and tuneable. Modularity means that DSE solutions can be reused to construct other DSE solutions, while tuneability means that the most specific DSE solution is chosen for the target DSE problem. Moreover, DSI enables transparent cooperation between exploration algorithms. Second, we present IDeSyDe, an extensible DSE framework for DSE solutions based on DSI. IDeSyDe allows extensions to be developed in different programming languages in a manner compliant with the DSI approach.

We showcase the relevance of these contributions through five different case studies. The case study evaluations showed that non-exploration DSI procedures create overheads, which are marginal compared to the exploration algorithms. Empirically, most evaluations average 2% of the total DSE request. More importantly, the case studies have shown that IDeSyDe indeed provides a modular and incremental framework for constructing DSE solutions. In particular, the last case study required minimal extensions over the previous case studies so that support for a new application type was added to IDeSyDe.

设计空间探索(DSE)是嵌入式设计流程中的一项关键活动,必须在应用和平台之间找到符合流程设计要求的映射。由于现代嵌入式平台和应用的复杂性,找到这种映射非常具有挑战性。DSE 工具可以帮助应对这一挑战,因为它有可能涵盖设计人员无法直观理解的设计空间部分,从而实现更优化的设计。尽管有这样的潜在优势,但 DSE 工具在嵌入式行业仍相对小众。将这些工具集成到嵌入式设计流程中是阻碍其更广泛应用的一大障碍。我们提出了两个解决集成问题的方案。首先,我们提出了设计空间识别 (DSI) 方法,用于系统地构建模块化和可调整的 DSE 解决方案。模块化意味着 DSE 解决方案可重复用于构建其他 DSE 解决方案,而可调整性则意味着可针对目标 DSE 问题选择最具体的 DSE 解决方案。此外,DSI 还能实现探索算法之间的透明合作。其次,我们介绍了 IDeSyDe,这是一个可扩展的 DSE 框架,用于基于 DSI 的 DSE 解决方案。IDeSyDe 允许以符合 DSI 方法的方式用不同的编程语言开发扩展程序。我们通过五个不同的案例研究展示了这些贡献的相关性。案例研究评估表明,非探索式 DSI 程序产生的开销与探索式算法相比微不足道。根据经验,大多数评估平均占 DSE 总请求的 2%。更重要的是,案例研究表明,IDeSyDe 确实为构建 DSE 解决方案提供了一个模块化的增量框架。特别是,最后一个案例研究与之前的案例研究相比,只需进行最小限度的扩展,就能在 IDeSyDe 中添加对新应用类型的支持。
{"title":"IDeSyDe: Systematic Design Space Exploration via Design Space Identification","authors":"Rodolfo Jordão, Matthias Becker, Ingo Sander","doi":"10.1145/3647640","DOIUrl":"https://doi.org/10.1145/3647640","url":null,"abstract":"<p>Design space exploration (DSE) is a key activity in embedded design processes, where a mapping between applications and platforms that meets the process design requirements must be found. Finding such mappings is very challenging due to the complexity of modern embedded platforms and applications. DSE tools aid in this challenge by potentially covering sections of the design space that could be unintuitive to designers, leading to more optimised designs. Despite this potential benefit, DSE tools remain relatively niche in the embedded industry. A significant obstacle hindering their wider adoption is integrating such tools into embedded design processes. </p><p>We present two contributions that address this integration issue. First, we present the design space identification (DSI) approach for systematically constructing DSE solutions that are modular and tuneable. Modularity means that DSE solutions can be reused to construct other DSE solutions, while tuneability means that the most specific DSE solution is chosen for the target DSE problem. Moreover, DSI enables transparent cooperation between exploration algorithms. Second, we present IDeSyDe, an extensible DSE framework for DSE solutions based on DSI. IDeSyDe allows extensions to be developed in different programming languages in a manner compliant with the DSI approach. </p><p>We showcase the relevance of these contributions through five different case studies. The case study evaluations showed that non-exploration DSI procedures create overheads, which are marginal compared to the exploration algorithms. Empirically, most evaluations average 2% of the total DSE request. More importantly, the case studies have shown that IDeSyDe indeed provides a modular and incremental framework for constructing DSE solutions. In particular, the last case study required minimal extensions over the previous case studies so that support for a new application type was added to IDeSyDe.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139768878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VeriGen: A Large Language Model for Verilog Code Generation VeriGen:用于 Verilog 代码生成的大型语言模型
IF 1.4 4区 计算机科学 Q2 Computer Science Pub Date : 2024-02-09 DOI: 10.1145/3643681
Shailja Thakur, Baleegh Ahmad, Hammond Pearce, Benjamin Tan, Brendan Dolan-Gavitt, Ramesh Karri, Siddharth Garg

In this study, we explore the capability of Large Language Models (LLMs) to automate hardware design by automatically completing partial Verilog code, a common language for designing and modeling digital systems. We fine-tune pre-existing LLMs on Verilog datasets compiled from GitHub and Verilog textbooks. We evaluate the functional correctness of the generated Verilog code using a specially designed test suite, featuring a custom problem set and testing benches. Here, our fine-tuned open-source CodeGen-16B model outperforms the commercial state-of-the-art GPT-3.5-turbo model with a 1.1% overall increase. Upon testing with a more diverse and complex problem set, we find that the fine-tuned model shows competitive performance against state-of-the-art gpt-3.5-turbo, excelling in certain scenarios. Notably, it demonstrates a 41% improvement in generating syntactically correct Verilog code across various problem categories compared to its pre-trained counterpart, highlighting the potential of smaller, in-house LLMs in hardware design automation.

We release our training/evaluation scripts and LLM checkpoints as open-source contributions.

在本研究中,我们探索了大型语言模型(LLM)通过自动完成部分 Verilog 代码实现硬件设计自动化的能力,Verilog 是一种用于设计和模拟数字系统的通用语言。我们在从 GitHub 和 Verilog 教科书编译的 Verilog 数据集上对已有的 LLM 进行了微调。我们使用专门设计的测试套件来评估生成的 Verilog 代码的功能正确性,该套件具有自定义问题集和测试台。在这里,我们经过微调的开源 CodeGen-16B 模型的性能优于最先进的商用 GPT-3.5-turbo 模型,整体提高了 1.1%。在使用更多样、更复杂的问题集进行测试后,我们发现经过微调的模型与最先进的 GPT-3.5-turbo 模型相比表现出了竞争力,在某些情况下更胜一筹。值得注意的是,在生成语法正确的 Verilog 代码方面,该模型在各种问题类别中的表现比预先训练的模型提高了 41%,这凸显了小型内部 LLM 在硬件设计自动化中的潜力。我们将训练/评估脚本和 LLM 检查点作为开源贡献发布。
{"title":"VeriGen: A Large Language Model for Verilog Code Generation","authors":"Shailja Thakur, Baleegh Ahmad, Hammond Pearce, Benjamin Tan, Brendan Dolan-Gavitt, Ramesh Karri, Siddharth Garg","doi":"10.1145/3643681","DOIUrl":"https://doi.org/10.1145/3643681","url":null,"abstract":"<p>In this study, we explore the capability of Large Language Models (LLMs) to automate hardware design by automatically completing partial Verilog code, a common language for designing and modeling digital systems. We fine-tune pre-existing LLMs on Verilog datasets compiled from GitHub and Verilog textbooks. We evaluate the functional correctness of the generated Verilog code using a specially designed test suite, featuring a custom problem set and testing benches. Here, our fine-tuned open-source CodeGen-16B model outperforms the commercial state-of-the-art GPT-3.5-turbo model with a 1.1% overall increase. Upon testing with a more diverse and complex problem set, we find that the fine-tuned model shows competitive performance against state-of-the-art gpt-3.5-turbo, excelling in certain scenarios. Notably, it demonstrates a 41% improvement in generating syntactically correct Verilog code across various problem categories compared to its pre-trained counterpart, highlighting the potential of smaller, in-house LLMs in hardware design automation. </p><p>We release our training/evaluation scripts and LLM checkpoints as open-source contributions.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139768885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Efficient FPGA Architecture with Turn-Restricted Switch Boxes 带转角限制开关盒的高效 FPGA 架构
IF 1.4 4区 计算机科学 Q2 Computer Science Pub Date : 2024-02-03 DOI: 10.1145/3643809
Fatemeh Serajeh Hassani, Mohammad Sadrosadati, Nezam Rohbani, Sebastian Pointner, Robert Wille, Hamid Sarbazi-azad

Abstract. Field-Programmable Gate Arrays (FPGAs) employ a large number of SRAM cells to provide a flexible routing architecture which have a significant impact on the FPGA’s area and power consumption. This flexible routing allows for a rather easy realization of the desired functionality, but our evaluations show that the full routing flexibility is not required in many occasions. In this work, we focus on what is actually needed and introduce a new switch-box realization what we call Turn-Restricted Switch-Boxes which supports only a subset of possible turns. The proposed method increases the utilization rate of FPGA switch-boxes by eliminating the unemployed resources. Experimental evaluations confirm that the area and average power consumption can be reduced by 12.8% and 14.1%, on average, respectively and the FPGA routing susceptibility to SEU and MBU can be improved by 18.2%, on average, by imposing negligible performance.

摘要现场可编程门阵列(FPGA)采用大量的 SRAM 单元来提供灵活的路由架构,这对 FPGA 的面积和功耗有重大影响。这种灵活的路由可以轻松实现所需的功能,但我们的评估表明,在很多情况下并不需要完全的路由灵活性。在这项工作中,我们将重点放在实际需要的功能上,并引入了一种新的开关盒实现方式,我们称之为 "转数受限开关盒"(Turn-Restricted Switch-Boxes),它只支持可能的转数子集。所提出的方法通过消除闲置资源提高了 FPGA 开关盒的利用率。实验评估证实,面积和平均功耗可分别平均减少 12.8% 和 14.1%,FPGA 路由对 SEU 和 MBU 的敏感性可平均提高 18.2%,性能可忽略不计。
{"title":"An Efficient FPGA Architecture with Turn-Restricted Switch Boxes","authors":"Fatemeh Serajeh Hassani, Mohammad Sadrosadati, Nezam Rohbani, Sebastian Pointner, Robert Wille, Hamid Sarbazi-azad","doi":"10.1145/3643809","DOIUrl":"https://doi.org/10.1145/3643809","url":null,"abstract":"<p><i>Abstract. Field-Programmable Gate Arrays</i> (FPGAs) employ a large number of SRAM cells to provide a flexible routing architecture which have a significant impact on the FPGA’s area and power consumption. This flexible routing allows for a rather easy realization of the desired functionality, but our evaluations show that the full routing flexibility is not required in many occasions. In this work, we focus on what is actually needed and introduce a new switch-box realization what we call <i>Turn-Restricted Switch-Boxes</i> which supports only a subset of possible turns. The proposed method increases the utilization rate of FPGA switch-boxes by eliminating the unemployed resources. Experimental evaluations confirm that the area and average power consumption can be reduced by 12.8% and 14.1%, on average, respectively and the FPGA routing susceptibility to SEU and MBU can be improved by 18.2%, on average, by imposing negligible performance.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139678094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reduced On-Chip Storage of Seeds for Built-In Test Generation 减少内置测试生成的片上种子存储量
IF 1.4 4区 计算机科学 Q2 Computer Science Pub Date : 2024-02-01 DOI: 10.1145/3643810
Irith Pomeranz

Logic built-in self-test (LBIST) approaches use an on-chip logic block for test generation and thus enable in-field testing. Recent reports of silent data corruption underline the importance of in-field testing. In a class of storage-based LBIST approaches, compressed tests are stored on-chip and decompressed by an on-chip decompression logic. The on-chip storage requirements may become a bottleneck when the number of compressed tests is large. In this case, using each compressed test for applying several different tests allows the storage requirements to be reduced. However, producing different tests from each compressed test has a hardware overhead. This article suggests a new on-chip storage scheme for compressed tests that eliminates the additional hardware overhead. Under the new storage scheme, a set of N B-bit compressed tests targeting a set of faults F0 is translated into a sequence S of N · B bits. Every B consecutive bits of S are considered as a compressed test. The sequence S thus yields close to N · B compressed tests, magnifying the test data stored in S almost B times. Taking advantage of the extra tests, the article describes a software procedure that is applied off-line to reduce S without losing fault coverage of F0. Experimental results for benchmark circuits demonstrate significant reductions in the storage requirements of S, and significant increases in the fault coverage of a second set of faults, F1.

逻辑内置自测试(LBIST)方法使用片上逻辑块生成测试,从而实现了现场测试。最近有关无声数据损坏的报道强调了现场测试的重要性。在一类基于存储的 LBIST 方法中,压缩测试存储在片上,并由片上解压缩逻辑进行解压缩。当压缩测试数量较多时,片上存储要求可能会成为瓶颈。在这种情况下,利用每个压缩测试应用多个不同的测试,可以降低存储要求。但是,从每个压缩测试中生成不同的测试会产生硬件开销。本文提出了一种新的片上压缩测试存储方案,可消除额外的硬件开销。在新的存储方案下,一组针对故障 F0 的 N B 位压缩测试被转换为 N - B 位序列 S。S 的每 B 个连续比特都被视为一个压缩测试。这样,序列 S 就产生了接近 N - B 的压缩测试,将存储在 S 中的测试数据放大了近 B 倍。文章介绍了一种软件程序,利用额外的测试,在不损失 F0 故障覆盖率的情况下,离线减少 S。对基准电路的实验结果表明,S 的存储要求显著降低,第二组故障 F1 的故障覆盖率显著提高。
{"title":"Reduced On-Chip Storage of Seeds for Built-In Test Generation","authors":"Irith Pomeranz","doi":"10.1145/3643810","DOIUrl":"https://doi.org/10.1145/3643810","url":null,"abstract":"<p>Logic built-in self-test (<i>LBIST</i>) approaches use an on-chip logic block for test generation and thus enable in-field testing. Recent reports of silent data corruption underline the importance of in-field testing. In a class of storage-based <i>LBIST</i> approaches, compressed tests are stored on-chip and decompressed by an on-chip decompression logic. The on-chip storage requirements may become a bottleneck when the number of compressed tests is large. In this case, using each compressed test for applying several different tests allows the storage requirements to be reduced. However, producing different tests from each compressed test has a hardware overhead. This article suggests a new on-chip storage scheme for compressed tests that eliminates the additional hardware overhead. Under the new storage scheme, a set of <i>N B</i>-bit compressed tests targeting a set of faults <i>F</i><sub>0</sub> is translated into a sequence <i>S</i> of <i>N</i> · <i>B</i> bits. Every <i>B</i> consecutive bits of <i>S</i> are considered as a compressed test. The sequence <i>S</i> thus yields close to <i>N</i> · <i>B</i> compressed tests, magnifying the test data stored in <i>S</i> almost <i>B</i> times. Taking advantage of the extra tests, the article describes a software procedure that is applied off-line to reduce <i>S</i> without losing fault coverage of <i>F</i><sub>0</sub>. Experimental results for benchmark circuits demonstrate significant reductions in the storage requirements of <i>S</i>, and significant increases in the fault coverage of a second set of faults, <i>F</i><sub>1</sub>.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139656906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
D3PBO: Dynamic Domain Decomposition based Parallel Bayesian Optimization for Large-scale Analog Circuit Sizing D3PBO:基于动态领域分解的并行贝叶斯优化,用于大规模模拟电路选型
IF 1.4 4区 计算机科学 Q2 Computer Science Pub Date : 2024-01-31 DOI: 10.1145/3643811
Aidong Zhao, Tianchen Gu, Zhaori Bi, Fan Yang, Changhao Yan, Xuan Zeng, Zixiao Lin, Wenchuang Hu, Dian Zhou

Bayesian optimization (BO) is an efficient global optimization method for expensive black-box functions. Whereas, the expansion for high-dimensional problems and large sample budgets still remains a severe challenge. In order to extend BO for large-scale analog circuit synthesis, a novel computationally efficient parallel BO method, D3PBO, is proposed for high-dimensional problems in this work. We introduce the dynamic domain decomposition method based on maximum variance between clusters. The search space is decomposed into subdomains progressively to limit the maximal number of observations in each domain. The promising domain is explored by multi-trust region based batch BO with the local Gaussian process (GP) model. As the domain decomposition progresses, the basin-shaped domain is identified using a GP-assisted quadratic regression method and exploited by the local search method BOBYQA to achieve faster convergence rate. The time complexity of D3PBO is constant for each iteration. Experiments demonstrate that D3PBO obtains better results with significant less runtime consumption compared to state-of-the-art methods. For the circuit optimization experiments, D3PBO achieves up to 10 × runtime speedup compared to TuRBO with better solutions.

贝叶斯优化(BO)是一种针对昂贵的黑盒函数的高效全局优化方法。然而,对高维问题和大样本预算的扩展仍然是一个严峻的挑战。为了将 BO 扩展到大规模模拟电路合成中,本研究针对高维问题提出了一种新型计算高效的并行 BO 方法 D3PBO。我们引入了基于簇间最大方差的动态域分解方法。搜索空间被逐步分解成子域,以限制每个域中观测值的最大数量。通过基于多信任区域的批量 BO 和本地高斯过程(GP)模型,探索有希望的域。随着域分解的进行,利用 GP 辅助二次回归方法识别出盆地状域,并通过局部搜索方法 BOBYQA 加以利用,以实现更快的收敛速度。D3PBO 的时间复杂度在每次迭代中都是恒定的。实验证明,与最先进的方法相比,D3PBO 能以更少的运行时间获得更好的结果。在电路优化实验中,与 TuRBO 相比,D3PBO 的运行速度提高了 10 倍,并获得了更好的解决方案。
{"title":"D3PBO: Dynamic Domain Decomposition based Parallel Bayesian Optimization for Large-scale Analog Circuit Sizing","authors":"Aidong Zhao, Tianchen Gu, Zhaori Bi, Fan Yang, Changhao Yan, Xuan Zeng, Zixiao Lin, Wenchuang Hu, Dian Zhou","doi":"10.1145/3643811","DOIUrl":"https://doi.org/10.1145/3643811","url":null,"abstract":"<p>Bayesian optimization (BO) is an efficient global optimization method for expensive black-box functions. Whereas, the expansion for high-dimensional problems and large sample budgets still remains a severe challenge. In order to extend BO for large-scale analog circuit synthesis, a novel computationally efficient parallel BO method, D<sup>3</sup>PBO, is proposed for high-dimensional problems in this work. We introduce the dynamic domain decomposition method based on maximum variance between clusters. The search space is decomposed into subdomains progressively to limit the maximal number of observations in each domain. The promising domain is explored by multi-trust region based batch BO with the local Gaussian process (GP) model. As the domain decomposition progresses, the basin-shaped domain is identified using a GP-assisted quadratic regression method and exploited by the local search method BOBYQA to achieve faster convergence rate. The time complexity of D<sup>3</sup>PBO is constant for each iteration. Experiments demonstrate that D<sup>3</sup>PBO obtains better results with significant less runtime consumption compared to state-of-the-art methods. For the circuit optimization experiments, D<sup>3</sup>PBO achieves up to 10 × runtime speedup compared to TuRBO with better solutions.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139665323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing VLIW Instruction Scheduling via a Two-Dimensional Constrained Dynamic Programming 通过二维约束动态编程优化 VLIW 指令调度
IF 1.4 4区 计算机科学 Q2 Computer Science Pub Date : 2024-01-25 DOI: 10.1145/3643135
Can Deng, Zhaoyun Chen, Yang Shi, Yimin Ma, Mei Wen, Lei Luo

Typical embedded processors, such as Digital Signal Processors (DSPs), usually adopt Very Long Instruction Word (VLIW) architecture to improve computing efficiency. The performance of VLIW processors heavily relies on Instruction-Level Parallelism (ILP). Therefore, it is crucial to develop an efficient instruction scheduling algorithm to explore more ILP. While heuristic algorithms are widely used in modern compilers due to simple implementation and low computational cost, they have limitations in providing accurate solutions and are prone to local optima. On the other hand, exact algorithms can usually find the optimal solution, but their high time overhead makes them less suitable for large-scale problems. This paper proposes a two-dimensional constrained dynamic programming (TDCDP) approach and a quantitative model for instruction scheduling. The TDCDP approach achieves near-optimal solutions within an acceptable time overhead. Furthermore, we integrate our TDCDP approach into mainstream compiler architecture, encompassing Pre- and Post-RA (register allocation) scheduling. We conduct a quantitative evaluation of TDCDP compared to four heuristic algorithms on a typical VLIW processor. Our approach achieves an efficiency improvement of up to 58.34% in final solutions compared to the heuristic algorithms. Additionally, the Post-RA Scheduling enhances programs with an average speedup of 14.04% than solely applying the Pre-RA Scheduling.

典型的嵌入式处理器,如数字信号处理器(DSP),通常采用超长指令字(VLIW)架构来提高计算效率。VLIW 处理器的性能在很大程度上依赖于指令级并行性(ILP)。因此,开发一种高效的指令调度算法以探索更多的 ILP 至关重要。虽然启发式算法由于实施简单、计算成本低廉而被广泛应用于现代编译器中,但它们在提供精确解决方案方面存在局限性,而且容易出现局部最优。另一方面,精确算法通常能找到最优解,但其时间开销大,不太适合大规模问题。本文提出了一种二维约束动态编程(TDCDP)方法和指令调度的定量模型。TDCDP 方法能在可接受的时间开销内实现接近最优的解决方案。此外,我们还将 TDCDP 方法集成到主流编译器架构中,包括前 RA 和后 RA(寄存器分配)调度。我们在典型的 VLIW 处理器上对 TDCDP 与四种启发式算法进行了定量评估。与启发式算法相比,我们的方法使最终解决方案的效率提高了 58.34%。此外,Post-RA Scheduling 比单纯应用 Pre-RA Scheduling 的程序平均提速 14.04%。
{"title":"Optimizing VLIW Instruction Scheduling via a Two-Dimensional Constrained Dynamic Programming","authors":"Can Deng, Zhaoyun Chen, Yang Shi, Yimin Ma, Mei Wen, Lei Luo","doi":"10.1145/3643135","DOIUrl":"https://doi.org/10.1145/3643135","url":null,"abstract":"<p>Typical embedded processors, such as Digital Signal Processors (DSPs), usually adopt Very Long Instruction Word (VLIW) architecture to improve computing efficiency. The performance of VLIW processors heavily relies on Instruction-Level Parallelism (ILP). Therefore, it is crucial to develop an efficient instruction scheduling algorithm to explore more ILP. While heuristic algorithms are widely used in modern compilers due to simple implementation and low computational cost, they have limitations in providing accurate solutions and are prone to local optima. On the other hand, exact algorithms can usually find the optimal solution, but their high time overhead makes them less suitable for large-scale problems. This paper proposes a two-dimensional constrained dynamic programming (TDCDP) approach and a quantitative model for instruction scheduling. The TDCDP approach achieves near-optimal solutions within an acceptable time overhead. Furthermore, we integrate our TDCDP approach into mainstream compiler architecture, encompassing Pre- and Post-RA (register allocation) scheduling. We conduct a quantitative evaluation of TDCDP compared to four heuristic algorithms on a typical VLIW processor. Our approach achieves an efficiency improvement of up to 58.34% in final solutions compared to the heuristic algorithms. Additionally, the Post-RA Scheduling enhances programs with an average speedup of 14.04% than solely applying the Pre-RA Scheduling.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139554221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Transactions on Design Automation of Electronic Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1