Finite field multiplication is a non-linear transformation operator that appears in the majority of symmetric cryptographic algorithms. Numerous specified finite field multiplication units have been proposed as a fundamental module in the coarse-grained reconfigurable cipher logic array to support more cryptographic algorithms, however, it will introduce low flexibility and high overhead, resulting in reduced performance of the coarse-grained reconfigurable cipher logic array. In this paper, a high-flexibility and low-cost reconfigurable Galois field multiplication unit, which is termed as RGMU, is proposed to balance the trade-offs between the function, delay, and area. All the finite field multiplication operations, including maximum distance separable matrix multiplication, parallel update of Fibonacci linear feedback shift register, parallel update of Galois linear feedback shift register, and composite field multiplication, are analyzed and two basic operation components are abstracted. Further, a reconfigurable finite field multiplication computational model is established to demonstrate the efficacy of reconfigurable units and guide the design of RGMU with high performance. Finally, the overall architecture of RGMU and two multiplication circuits are introduced. Experimental results show that the RGMU can not only reduce the hardware overhead and power consumption but also has the unique advantage of satisfying all the finite field multiplication operations in symmetric cryptography algorithms.
{"title":"RGMU: A High-flexibility and Low-cost Reconfigurable Galois Field Multiplication Unit Design Approach for CGRCA","authors":"Danping Jiang, Zibin Dai, Yanjiang Liu, Zongren Zhang","doi":"10.1145/3639820","DOIUrl":"https://doi.org/10.1145/3639820","url":null,"abstract":"<p>Finite field multiplication is a non-linear transformation operator that appears in the majority of symmetric cryptographic algorithms. Numerous specified finite field multiplication units have been proposed as a fundamental module in the coarse-grained reconfigurable cipher logic array to support more cryptographic algorithms, however, it will introduce low flexibility and high overhead, resulting in reduced performance of the coarse-grained reconfigurable cipher logic array. In this paper, a high-flexibility and low-cost reconfigurable Galois field multiplication unit, which is termed as RGMU, is proposed to balance the trade-offs between the function, delay, and area. All the finite field multiplication operations, including maximum distance separable matrix multiplication, parallel update of Fibonacci linear feedback shift register, parallel update of Galois linear feedback shift register, and composite field multiplication, are analyzed and two basic operation components are abstracted. Further, a reconfigurable finite field multiplication computational model is established to demonstrate the efficacy of reconfigurable units and guide the design of RGMU with high performance. Finally, the overall architecture of RGMU and two multiplication circuits are introduced. Experimental results show that the RGMU can not only reduce the hardware overhead and power consumption but also has the unique advantage of satisfying all the finite field multiplication operations in symmetric cryptography algorithms.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"35 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139412945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Syam Sankar, Ruchika Gupta, John Jose, Sukumar Nandi
Multiple software and hardware intellectual property (IP) components are combined on a single chip to form Multi-Processor Systems-on-Chips (MPSoCs). Due to the rigid time-to-market constraints, some of the IPs are from outsourced third parties. Due to the supply-chain management of IP blocks being handled by unreliable third-party vendors, security has grown as a crucial design concern in the MPSoC. These IPs may get exposed to certain unwanted practises like the insertion of malicious circuits called Hardware Trojan (HT) leading to security threats and attacks, including sensitive data leakage or integrity violations. A Network-on-Chip (NoC) connects various units of an MPSoC. Since it serves as the interface between various units in an MPSoC, it has complete access to all the data flowing through the system. This makes NoC security a paramount design issue. Our research focuses on a threat model where the NoC is infiltrated by multiple HTs that can corrupt packets. Data integrity verified at the destination’s network interface (NI) triggers re-transmissions of packets if the verification results in an error. In this paper, we propose an opportunistic trust-aware routing strategy that efficiently avoids HT while ensuring that the packets arrive at their destination unaltered. Experimental results demonstrate the successful movement of packets through opportunistically selected neighbours along a trust-aware path free from the HT effect. We also observe a significant reduction in the rate of packet re-transmissions and latency at the expense of incurring minimum area and power overhead.
多个软件和硬件知识产权(IP)组件在单个芯片上组合成多处理器片上系统(MPSoC)。由于严格的上市时间限制,部分 IP 来自外包的第三方。由于 IP 块的供应链管理由不可靠的第三方供应商负责,安全性已成为 MPSoC 设计中的一个关键问题。这些 IP 可能会暴露在某些不必要的行为中,如插入被称为硬件木马(HT)的恶意电路,从而导致安全威胁和攻击,包括敏感数据泄漏或完整性违规。片上网络(NoC)连接 MPSoC 的各个单元。由于 NoC 是 MPSoC 中各个单元之间的接口,因此可以完全访问流经系统的所有数据。因此,NoC 的安全性是一个至关重要的设计问题。我们的研究重点是 NoC 被多个 HT 入侵的威胁模型,这些 HT 可以破坏数据包。在目的地网络接口(NI)验证数据完整性时,如果验证结果出错,就会触发数据包的重新传输。在本文中,我们提出了一种机会主义信任感知路由策略,它能有效地避免 HT,同时确保数据包在到达目的地时未被更改。实验结果表明,数据包能成功地沿着一条无 HT 影响的信任感知路径,通过机会选择的邻居移动。我们还观察到,数据包重传率和延迟显著降低,而产生的面积和功耗开销却最小。
{"title":"TROP: TRust-aware OPportunistic Routing in NoC with Hardware Trojans","authors":"Syam Sankar, Ruchika Gupta, John Jose, Sukumar Nandi","doi":"10.1145/3639821","DOIUrl":"https://doi.org/10.1145/3639821","url":null,"abstract":"<p> Multiple software and hardware intellectual property (IP) components are combined on a single chip to form Multi-Processor Systems-on-Chips (MPSoCs). Due to the rigid time-to-market constraints, some of the IPs are from outsourced third parties. Due to the supply-chain management of IP blocks being handled by unreliable third-party vendors, security has grown as a crucial design concern in the MPSoC. These IPs may get exposed to certain unwanted practises like the insertion of malicious circuits called Hardware Trojan (HT) leading to security threats and attacks, including sensitive data leakage or integrity violations. A Network-on-Chip (NoC) connects various units of an MPSoC. Since it serves as the interface between various units in an MPSoC, it has complete access to all the data flowing through the system. This makes NoC security a paramount design issue. Our research focuses on a threat model where the NoC is infiltrated by multiple HTs that can corrupt packets. Data integrity verified at the destination’s network interface (NI) triggers re-transmissions of packets if the verification results in an error. In this paper, we propose an opportunistic trust-aware routing strategy that efficiently avoids HT while ensuring that the packets arrive at their destination unaltered. Experimental results demonstrate the successful movement of packets through opportunistically selected neighbours along a trust-aware path free from the HT effect. We also observe a significant reduction in the rate of packet re-transmissions and latency at the expense of incurring minimum area and power overhead.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"22 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139413076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Placement is a critical step in the physical design for digital application specific integrated circuits (ASICs), as it can directly affect the design qualities such as wirelength and timing. For many domain specific designs, the demands for high performance parallel computing result in repetitive hardware instances, such as the processing elements in the neural network accelerators. As these instances can dominate the area of the designs, the runtime of the complete design’s placement can be traded for optimizing and reusing one instance’s placement to achieve higher quality. Therefore, this work proposes a mixed integer programming (MIP)-based placement refinement algorithm for the repetitive instances. By efficiently modeling the rectilinear steiner tree wirelength, the placement can be precisely refined for better quality. Besides, the MIP formulations for timing-driven placement are proposed. A theoretical proof is then provided to show the correctness of the proposed wirelength model. For the instances in various popular fields, the experiments show that given the placement from the commercial placers, the proposed algorithm can perform further placement refinement to reduce 3.76%/3.64% detailed routing wirelength and 1.68%/2.42% critical path delay under wirelength/timing-driven mode, respectively, and also outperforms the state-of-the-art previous work.
{"title":"Mixed Integer Programming based Placement Refinement by RSMT Model with Movable Pins","authors":"Ke Tang, Lang Feng, Zhongfeng Wang","doi":"10.1145/3639365","DOIUrl":"https://doi.org/10.1145/3639365","url":null,"abstract":"<p>Placement is a critical step in the physical design for digital application specific integrated circuits (ASICs), as it can directly affect the design qualities such as wirelength and timing. For many domain specific designs, the demands for high performance parallel computing result in repetitive hardware instances, such as the processing elements in the neural network accelerators. As these instances can dominate the area of the designs, the runtime of the complete design’s placement can be traded for optimizing and reusing one instance’s placement to achieve higher quality. Therefore, this work proposes a mixed integer programming (MIP)-based placement refinement algorithm for the repetitive instances. By efficiently modeling the rectilinear steiner tree wirelength, the placement can be precisely refined for better quality. Besides, the MIP formulations for timing-driven placement are proposed. A theoretical proof is then provided to show the correctness of the proposed wirelength model. For the instances in various popular fields, the experiments show that given the placement from the commercial placers, the proposed algorithm can perform further placement refinement to reduce 3.76%/3.64% detailed routing wirelength and 1.68%/2.42% critical path delay under wirelength/timing-driven mode, respectively, and also outperforms the state-of-the-art previous work.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"7 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139093245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo-Yuan Huang, Steven Lyubomirsky, Yi Li, Mike He, Gus Henry Smith, Thierry Tambe, Akash Gaonkar, Vishal Canumalla, Andrew Cheung, Gu-Yeon Wei, Aarti Gupta, Zachary Tatlock, Sharad Malik
Ideally, accelerator development should be as easy as software development. Several recent design languages/tools are working toward this goal, but actually testing early designs on real applications end-to-end remains prohibitively difficult due to the costs of building specialized compiler and simulator support. We propose a new first-in-class, mostly automated methodology termed “3LA” to enable end-to-end testing of prototype accelerator designs on unmodified source applications. A key contribution of 3LA is the use of a formal software/hardware interface that specifies an accelerator’s operations and their semantics. Specifically, we leverage the Instruction-Level Abstraction (ILA) formal specification for accelerators that has been successfully used thus far for accelerator implementation verification. We show how the ILA for accelerators serves as a software/hardware interface, similar to the Instruction Set Architecture (ISA) for processors, that can be used for automated development of compilers and instruction-level simulators. Another key contribution of this work is to show how ILA-based accelerator semantics enables extending recent work on equality saturation to auto-generate basic compiler support for prototype accelerators in a technique we term “flexible matching.” By combining flexible matching with simulators auto-generated from ILA specifications, our approach enables end-to-end evaluation with modest engineering effort. We detail several case studies of 3LA, which uncovered an unknown flaw in a recently published accelerator and facilitated its fix.
理想情况下,加速器开发应该像软件开发一样简单。最近有几种设计语言/工具正在努力实现这一目标,但由于构建专用编译器和模拟器支持的成本过高,在实际应用中端到端测试早期设计仍然非常困难。我们提出了一种同类首创的全新自动化方法,称为 "3LA",可在未修改的源程序上对加速器原型设计进行端到端测试。3LA 的一个主要贡献是使用正式的软件/硬件接口来指定加速器的操作及其语义。具体来说,我们利用了迄今已成功用于加速器实现验证的加速器指令级抽象(ILA)形式规范。我们展示了加速器的 ILA 如何充当软/硬件接口,类似于处理器的指令集架构 (ISA),可用于编译器和指令级模拟器的自动开发。这项工作的另一个主要贡献是展示了基于 ILA 的加速器语义如何扩展近期的等价饱和工作,从而通过我们称之为 "灵活匹配 "的技术自动生成原型加速器的基本编译器支持。通过将灵活匹配与根据 ILA 规范自动生成的模拟器相结合,我们的方法能够以较小的工程工作量实现端到端的评估。我们详细介绍了 3LA 的几个案例研究,它发现了最近发布的加速器中的一个未知缺陷,并帮助修复了该缺陷。
{"title":"Application-Level Validation of Accelerator Designs Using a Formal Software/Hardware Interface","authors":"Bo-Yuan Huang, Steven Lyubomirsky, Yi Li, Mike He, Gus Henry Smith, Thierry Tambe, Akash Gaonkar, Vishal Canumalla, Andrew Cheung, Gu-Yeon Wei, Aarti Gupta, Zachary Tatlock, Sharad Malik","doi":"10.1145/3639051","DOIUrl":"https://doi.org/10.1145/3639051","url":null,"abstract":"<p>Ideally, accelerator development should be as easy as software development. Several recent design languages/tools are working toward this goal, but actually testing early designs on real applications end-to-end remains prohibitively difficult due to the costs of building specialized compiler and simulator support. We propose a new first-in-class, mostly automated methodology termed “3LA” to enable end-to-end testing of prototype accelerator designs on unmodified source applications. A key contribution of 3LA is the use of a formal software/hardware interface that specifies an accelerator’s operations and their semantics. Specifically, we leverage the Instruction-Level Abstraction (ILA) formal specification for accelerators that has been successfully used thus far for accelerator implementation verification. We show how the ILA for accelerators serves as a software/hardware interface, similar to the Instruction Set Architecture (ISA) for processors, that can be used for automated development of compilers and instruction-level simulators. Another key contribution of this work is to show how ILA-based accelerator semantics enables extending recent work on equality saturation to auto-generate basic compiler support for prototype accelerators in a technique we term “flexible matching.” By combining flexible matching with simulators auto-generated from ILA specifications, our approach enables end-to-end evaluation with modest engineering effort. We detail several case studies of 3LA, which uncovered an unknown flaw in a recently published accelerator and facilitated its fix.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"105 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139064774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Over the past decade, machine learning model complexity has grown at an extraordinary rate, as has the scale of the systems training such large models. However there is an alarmingly low hardware utilization (5-20%) in large scale AI systems. The low system utilization is a cumulative effect of minor losses across different layers of the stack, exacerbated by the disconnect between engineers designing different layers spanning across different industries. To address this challenge, in this work we designed a cross-stack performance modelling and design space exploration framework. First, we introduce CrossFlow, a novel framework that enables cross-layer analysis all the way from the technology layer to the algorithmic layer. Next, we introduce DeepFlow (built on top of CrossFlow using machine learning techniques) to automate the design space exploration and co-optimization across different layers of the stack. We have validated CrossFlow’s accuracy with distributed training on real commercial hardware and showcase several DeepFlow case studies demonstrating pitfalls of not optimizing across the technology-hardware-software stack for what is likely, the most important workload driving large development investments in all aspects of computing stack.
{"title":"DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI Systems","authors":"Newsha Ardalani, Saptadeep Pal, Puneet Gupta","doi":"10.1145/3635867","DOIUrl":"https://doi.org/10.1145/3635867","url":null,"abstract":"<p>Over the past decade, machine learning model complexity has grown at an extraordinary rate, as has the scale of the systems training such large models. However there is an alarmingly low hardware utilization (5-20%) in large scale AI systems. The low system utilization is a cumulative effect of minor losses across different layers of the stack, exacerbated by the disconnect between engineers designing different layers spanning across different industries. To address this challenge, in this work we designed a cross-stack performance modelling and design space exploration framework. First, we introduce CrossFlow, a novel framework that enables cross-layer analysis all the way from the technology layer to the algorithmic layer. Next, we introduce DeepFlow (built on top of CrossFlow using machine learning techniques) to automate the design space exploration and co-optimization across different layers of the stack. We have validated CrossFlow’s accuracy with distributed training on real commercial hardware and showcase several DeepFlow case studies demonstrating pitfalls of not optimizing across the technology-hardware-software stack for what is likely, the most important workload driving large development investments in all aspects of computing stack.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"73 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138824205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Time-Sensitive Networking (TSN) realizes high bandwidth and time determinism for data transmission and thus becomes the crucial communication technology in time-critical systems. The Gate Control List (GCL) is used to control the transmission of different classes of traffic in TSN, including Time-Triggered (TT) flows, Audio-Video-Bridging (AVB) flows, and Best-Effort (BE) flows. Most studies focus on optimizing GCL synthesis by reserving the preceding time slots to serve TT flows with the strict delay requirement, but ignore the deadlines of non-TT flows and cause the large delay. Therefore, this paper proposes a comprehensive scheduling method to enhance the real-time scheduling of AVB flows while guaranteeing the time determinism of TT flows. This method first optimizes GCL synthesis to reserve the preceding time slots for AVB flows, and then introduces the Earliest Deadline First (EDF) method to further improve the transmission of AVB flows by considering their deadlines. Moreover, the worst-case delay (WCD) analysis method is proposed to verify the effectiveness of the proposed method. Experimental results show that the proposed method improves the transmission of AVB flows compared to the state-of-the-art methods.
{"title":"Enhanced Real-time Scheduling of AVB Flows in Time-Sensitive Networking","authors":"Libing Deng, Gang Zeng, Ryo Kurachi, Hiroaki Takada, Xiongren Xiao, Renfa Li, Guoqi Xie","doi":"10.1145/3637878","DOIUrl":"https://doi.org/10.1145/3637878","url":null,"abstract":"<p>Time-Sensitive Networking (TSN) realizes high bandwidth and time determinism for data transmission and thus becomes the crucial communication technology in time-critical systems. The Gate Control List (GCL) is used to control the transmission of different classes of traffic in TSN, including Time-Triggered (TT) flows, Audio-Video-Bridging (AVB) flows, and Best-Effort (BE) flows. Most studies focus on optimizing GCL synthesis by reserving the preceding time slots to serve TT flows with the strict delay requirement, but ignore the deadlines of non-TT flows and cause the large delay. Therefore, this paper proposes a comprehensive scheduling method to enhance the real-time scheduling of AVB flows while guaranteeing the time determinism of TT flows. This method first optimizes GCL synthesis to reserve the preceding time slots for AVB flows, and then introduces the Earliest Deadline First (EDF) method to further improve the transmission of AVB flows by considering their deadlines. Moreover, the worst-case delay (WCD) analysis method is proposed to verify the effectiveness of the proposed method. Experimental results show that the proposed method improves the transmission of AVB flows compared to the state-of-the-art methods.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"28 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138717326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianming Ni, Xiaoqing Wen, Hussam Amrouch, Cheng Zhuo, Peilin Song
The research on design for testability and reliability of security-aware hardware has been important in both academia and industry. With ever-growing globalization, commercial hardware design, manufacturing, transportation, and supply now involve many different countries, resulting in aggravated vulnerability from hardware design to manufacturing. Hardware with malicious purposes implanted from the third-party manufacturing process may control the operation of a circuit and tamper its functions, causing serious security issues. However, hardware includes not only devices and circuits but also systems. An important fact is that testability, reliability, and security technologies come from different design layers, but the impact evaluation is conducted at the system level. In other words, the testability, reliability, and security design of different layers can be carried out in a holistic manner to achieve optimization for the whole system. In addition, the testability, reliability, and security design technologies of each design layer can be collaboratively conducted to achieve better performance. The testability, reliability, and security tradeoff has garnered attention from academia and industry, particularly in the Post-Moore Era, due to the complexities and opportunities arising from new architectures and technologies.
{"title":"Introduction to the Special Issue on Design for Testability and Reliability of Security-aware Hardware","authors":"Tianming Ni, Xiaoqing Wen, Hussam Amrouch, Cheng Zhuo, Peilin Song","doi":"10.1145/3631476","DOIUrl":"https://doi.org/10.1145/3631476","url":null,"abstract":"<p>The research on design for testability and reliability of security-aware hardware has been important in both academia and industry. With ever-growing globalization, commercial hardware design, manufacturing, transportation, and supply now involve many different countries, resulting in aggravated vulnerability from hardware design to manufacturing. Hardware with malicious purposes implanted from the third-party manufacturing process may control the operation of a circuit and tamper its functions, causing serious security issues. However, hardware includes not only devices and circuits but also systems. An important fact is that testability, reliability, and security technologies come from different design layers, but the impact evaluation is conducted at the system level. In other words, the testability, reliability, and security design of different layers can be carried out in a holistic manner to achieve optimization for the whole system. In addition, the testability, reliability, and security design technologies of each design layer can be collaboratively conducted to achieve better performance. The testability, reliability, and security tradeoff has garnered attention from academia and industry, particularly in the Post-Moore Era, due to the complexities and opportunities arising from new architectures and technologies.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"21 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138744333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cache timing channel attacks exploit the inherent properties of cache memories: hit and miss time along with shared nature of cache to leak the secret information. The side channel and covert channel are the two well-known cache timing channel attacks. In this paper, we propose, Restricted Static Pseudo-Partitioning (RSPP), an effective partition based mitigation mechanisms that restricts the cache access of only the adversaries involved in the attack. It has an insignificant impact of only 1% in performance, as the benign process have access to full cache and restrictions are limited only to the suspicious processes and cache sets. It can be implemented with a maximum storage overhead of 1.45% of the total LLC size. This paper presents three variations of the proposed attack mitigation mechanism: RSPP, simplified-RSPP (S-RSPP) and core wise-RSPP (C-RSPP) with different hardware overheads. A full system simulator is used for evaluating the performance impact of RSPP. A detailed experimental analysis with different LLC and attack parameters is also discussed in the paper. RSPP is also compared with the existing defense mechanisms effective against cross-core covert channel attacks.
{"title":"RSPP: Restricted Static Pseudo-Partitioning for Mitigation of Cross-Core Covert Channel Attacks","authors":"Jaspinder Kaur, Shirshendu Das","doi":"10.1145/3637222","DOIUrl":"https://doi.org/10.1145/3637222","url":null,"abstract":"<p>Cache timing channel attacks exploit the inherent properties of cache memories: hit and miss time along with shared nature of cache to leak the secret information. The side channel and covert channel are the two well-known cache timing channel attacks. In this paper, we propose, Restricted Static Pseudo-Partitioning (RSPP), an effective partition based mitigation mechanisms that restricts the cache access of only the adversaries involved in the attack. It has an insignificant impact of only 1% in performance, as the benign process have access to full cache and restrictions are limited only to the suspicious processes and cache sets. It can be implemented with a maximum storage overhead of 1.45% of the total LLC size. This paper presents three variations of the proposed attack mitigation mechanism: RSPP, simplified-RSPP (S-RSPP) and core wise-RSPP (C-RSPP) with different hardware overheads. A full system simulator is used for evaluating the performance impact of RSPP. A detailed experimental analysis with different LLC and attack parameters is also discussed in the paper. RSPP is also compared with the existing defense mechanisms effective against cross-core covert channel attacks.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"31 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138580225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, GPU-accelerated placers such as DREAMPlace and Xplace have demonstrated their superiority over traditional CPU-reliant placers by achieving orders of magnitude speed up in placement runtime. However, due to their limited focus in placement objectives (e.g., wirelength and density), the placement quality achieved by DREAMPlace or Xplace is not comparable to that of commercial tools. In this paper, to bridge the gap between open-source and commercial placers, we present a novel placement optimization framework named GAN-Place that employs generative adversarial learning to transfer the placement quality of the industry-leading commercial placer, Synopsys ICC2, to existing open-source GPU-accelerated placers (DREAMPlace and Xplace). Without the knowledge of the underlying proprietary algorithms or constraints used by the commercial tools, our framework facilitates transfer learning to directly enhance the open-source placers by optimizing the proposed differentiable loss that denotes the “similarity” between DREAMPlace- or Xplace-generated placements and those in commercial databases. Experimental results on 7 industrial designs not only show the our GAN-Place immediately improves the Power, Performance, and Area (PPA) metrics at the placement stage, but also demonstrate that these improvements last firmly to the post-route stage, where we observe improvements by up to 8.3% in wirelength, 7.4% in power, and 37.6% in Total Negative Slack (TNS) on a commercial CPU benchmark.
{"title":"GAN-Place: Advancing Open-Source Placers to Commercial-Quality using Generative Adversarial Networks and Transfer Learning","authors":"Yi-Chen Lu, Haoxing Ren, Hao-Hsiang Hsiao, Sung Kyu Lim","doi":"10.1145/3636461","DOIUrl":"https://doi.org/10.1145/3636461","url":null,"abstract":"<p>Recently, GPU-accelerated placers such as DREAMPlace and Xplace have demonstrated their superiority over traditional CPU-reliant placers by achieving orders of magnitude speed up in placement runtime. However, due to their limited focus in placement objectives (e.g., wirelength and density), the placement quality achieved by DREAMPlace or Xplace is not comparable to that of commercial tools. In this paper, to bridge the gap between open-source and commercial placers, we present a novel placement optimization framework named GAN-Place that employs generative adversarial learning to transfer the placement quality of the industry-leading commercial placer, Synopsys ICC2, to existing open-source GPU-accelerated placers (DREAMPlace and Xplace). Without the knowledge of the underlying proprietary algorithms or constraints used by the commercial tools, our framework facilitates transfer learning to directly enhance the open-source placers by optimizing the proposed differentiable loss that denotes the “similarity” between DREAMPlace- or Xplace-generated placements and those in commercial databases. Experimental results on 7 industrial designs not only show the our GAN-Place immediately improves the Power, Performance, and Area (PPA) metrics at the placement stage, but also demonstrate that these improvements last firmly to the post-route stage, where we observe improvements by up to 8.3% in wirelength, 7.4% in power, and 37.6% in Total Negative Slack (TNS) on a commercial CPU benchmark.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"247 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138554273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Wang, Sheng Ma, Shengbai Luo, Lizhou Wu, Jianmin Zhang, Chunyuan Zhang, Tiejun Li
Deep learning has become a highly popular research field, and previously deep learning algorithms ran primarily on CPUs and GPUs. However, with the rapid development of deep learning, it was discovered that existing processors could not meet the specific large-scale computing requirements of deep learning, and custom deep learning accelerators have become popular. The majority of the primary workloads in deep learning are general matrix-matrix multiplications (GEMM), and emerging GEMMs are highly sparse and irregular. The TPU and SIGMA are typical GEMM accelerators in recent years, but the TPU does not support sparsity, and both the TPU and SIGMA have insufficient utilization rates of the Processing Element (PE). We design and implement the SparGD, a sparse GEMM accelerator with dynamic dataflow. The SparGD has specific PE structures, flexible distribution networks and reduction networks, and a simple dataflow switching module. When running sparse and irregular GEMMs, the SparGD can maintain high PE utilization while utilizing sparsity, and can switch to the optimal dataflow according to the computing environment. For sparse, irregular GEMMs, our experimental results show that the SparGD outperforms systolic arrays by 30 times and SIGMA by 3.6 times.
{"title":"SparGD: A Sparse GEMM Accelerator with Dynamic Dataflow","authors":"Bo Wang, Sheng Ma, Shengbai Luo, Lizhou Wu, Jianmin Zhang, Chunyuan Zhang, Tiejun Li","doi":"10.1145/3634703","DOIUrl":"https://doi.org/10.1145/3634703","url":null,"abstract":"<p>Deep learning has become a highly popular research field, and previously deep learning algorithms ran primarily on CPUs and GPUs. However, with the rapid development of deep learning, it was discovered that existing processors could not meet the specific large-scale computing requirements of deep learning, and custom deep learning accelerators have become popular. The majority of the primary workloads in deep learning are general matrix-matrix multiplications (GEMM), and emerging GEMMs are highly sparse and irregular. The TPU and SIGMA are typical GEMM accelerators in recent years, but the TPU does not support sparsity, and both the TPU and SIGMA have insufficient utilization rates of the Processing Element (PE). We design and implement the SparGD, a sparse GEMM accelerator with dynamic dataflow. The SparGD has specific PE structures, flexible distribution networks and reduction networks, and a simple dataflow switching module. When running sparse and irregular GEMMs, the SparGD can maintain high PE utilization while utilizing sparsity, and can switch to the optimal dataflow according to the computing environment. For sparse, irregular GEMMs, our experimental results show that the SparGD outperforms systolic arrays by 30 times and SIGMA by 3.6 times.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"47 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138537677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}