首页 > 最新文献

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)最新文献

英文 中文
Simultaneous Transistor Folding and Placement in Standard Cell Layout Synthesis 标准单元布局合成中晶体管的同步折叠和放置
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643537
Kyeonghyeon Baek, Taewhan Kim
The three major tasks in standard cell layout synthesis are transistor folding, transistor placement, and in-cell routing, which are tightly inter-related, but generally performed one at a time to reduce the extremely high complexity of design space. In this paper, we propose an integrated approach to the two problems of transistor folding and placement. Precisely, we propose a globally optimal algorithm of search tree based design space exploration, devising a set of effective speeding up techniques as well as dynamic programming based fast cost computation. In addition, our algorithm incorporates the minimum OD (oxide diffusion) jog constraint, which closely relies on both of transistor folding and placement. To our knowledge, this is the first work that tries to simultaneously solve the two problems. Through experiments with the transistor netlists and design rules in the ASAP 7nm library, it is shown that our proposed method is able to synthesize fully routable cell layouts of minimal size within 1 second for each netlist, outperforming the cell layout quality in the ASAP 7nm library, which otherwise, may take several hours or days to manually complete layouts of the quality level comparable to ours.
标准单元布局合成中的三个主要任务是晶体管折叠、晶体管放置和单元内布线,它们紧密相关,但通常一次执行一个,以降低极高的设计空间复杂性。在本文中,我们提出了一个集成的方法来解决晶体管折叠和放置这两个问题。提出了一种基于搜索树的设计空间探索全局最优算法,设计了一套有效的加速技术和基于动态规划的快速代价计算。此外,我们的算法结合了最小OD(氧化物扩散)慢跑约束,这密切依赖于晶体管的折叠和放置。据我们所知,这是第一个试图同时解决这两个问题的作品。通过对ASAP 7nm库中的晶体管网表和设计规则的实验表明,我们提出的方法能够在1秒内为每个网表合成最小尺寸的完全可路由的电池布局,优于ASAP 7nm库中的电池布局质量,否则可能需要数小时或数天的时间才能手动完成与我们的质量水平相当的布局。
{"title":"Simultaneous Transistor Folding and Placement in Standard Cell Layout Synthesis","authors":"Kyeonghyeon Baek, Taewhan Kim","doi":"10.1109/ICCAD51958.2021.9643537","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643537","url":null,"abstract":"The three major tasks in standard cell layout synthesis are transistor folding, transistor placement, and in-cell routing, which are tightly inter-related, but generally performed one at a time to reduce the extremely high complexity of design space. In this paper, we propose an integrated approach to the two problems of transistor folding and placement. Precisely, we propose a globally optimal algorithm of search tree based design space exploration, devising a set of effective speeding up techniques as well as dynamic programming based fast cost computation. In addition, our algorithm incorporates the minimum OD (oxide diffusion) jog constraint, which closely relies on both of transistor folding and placement. To our knowledge, this is the first work that tries to simultaneously solve the two problems. Through experiments with the transistor netlists and design rules in the ASAP 7nm library, it is shown that our proposed method is able to synthesize fully routable cell layouts of minimal size within 1 second for each netlist, outperforming the cell layout quality in the ASAP 7nm library, which otherwise, may take several hours or days to manually complete layouts of the quality level comparable to ours.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126996354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Row-Based Algorithm for Non-Integer Multiple-Cell-Height Placement 一种基于行的非整数多单元格高度放置算法
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643550
Zih-Yao Lin, Yao-Wen Chang
A circuit design with non-integer multiple cell height (NIMCH) is more flexible for optimizing area, timing, and power simultaneously. A cell with a larger height provides higher pin accessibility, higher drive strength, and shorter delay. In contrast, one with a smaller height has a smaller area, pin capacitance, and power consumption. Such NIMCH design must satisfy additional layout constraints that existing tool flows cannot handle well. This paper presents a row-based algorithm for non-integer multiple-cell-height placement. Our algorithm consists of two main techniques: (1) a k-mean-based clustering method to assign heights to each row to define the regions of particular cell heights, and (2) a legalization method to move cells to satisfy NIMCH constraints. Experimental results show that our approach can significantly reduce the average routed wirelength and the average total power compared with the state-of-the-art approach.
具有非整数多单元高度(NIMCH)的电路设计在同时优化面积、时序和功率方面更为灵活。高度较大的单元提供更高的引脚可及性,更高的驱动强度和更短的延迟。相反,高度越小,面积、引脚电容和功耗越小。这种NIMCH设计必须满足现有工具流无法很好处理的额外布局约束。提出了一种基于行的非整数多单元高度放置算法。我们的算法包括两个主要技术:(1)基于k均值的聚类方法,为每一行分配高度,以定义特定细胞高度的区域;(2)合法化方法,移动细胞以满足NIMCH约束。实验结果表明,与现有方法相比,该方法可以显著降低平均路由长度和平均总功耗。
{"title":"A Row-Based Algorithm for Non-Integer Multiple-Cell-Height Placement","authors":"Zih-Yao Lin, Yao-Wen Chang","doi":"10.1109/ICCAD51958.2021.9643550","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643550","url":null,"abstract":"A circuit design with non-integer multiple cell height (NIMCH) is more flexible for optimizing area, timing, and power simultaneously. A cell with a larger height provides higher pin accessibility, higher drive strength, and shorter delay. In contrast, one with a smaller height has a smaller area, pin capacitance, and power consumption. Such NIMCH design must satisfy additional layout constraints that existing tool flows cannot handle well. This paper presents a row-based algorithm for non-integer multiple-cell-height placement. Our algorithm consists of two main techniques: (1) a k-mean-based clustering method to assign heights to each row to define the regions of particular cell heights, and (2) a legalization method to move cells to satisfy NIMCH constraints. Experimental results show that our approach can significantly reduce the average routed wirelength and the average total power compared with the state-of-the-art approach.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133493619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
MORE2: Morphable Encryption and Encoding for Secure NVM MORE2:安全NVM的可变形加密与编码
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643529
Wei Zhao, D. Feng, Yu Hua, Wei Tong, Jingning Liu, Jie Xu, Chunyan Li, Gaoxiang Xu, Yiran Chen
Memory encryption can enhance the security of Non-volatile memories (NVMs), but it significantly increases the data bits written to NVMs and leads to severe lifetime and performance degradation. Current encryption techniques aim to reduce the re-encryption to many existing clean words, which unfortunately suffer from high encryption overheads (i.e. latency and energy) and many unnecessary writes. In the meantime, compression techniques can reduce the writes of encrypted NVM. However, we find that they may destroy the data patterns and increase the modified words, resulting in many encryptions in secure NVM. In this paper, we propose the MORphable Encryption and Encoding (MORE2) scheme to address these problems. Our MORphable Encryption (MORE) technique aims to reduce the full-line re-encryption and avoid clean line encryption. Besides, MORE proposes a prediction-based write scheme to avoid the encryption of clean lines, and pre-encrypt the lines that are predicted as dirty. Therefore, MORE can remove the encryption from the critical path of NVM. Furthermore, MORE2 proposes the Morphable Selective Encoding (MSE) scheme to compress the modified words while preserving clean words. MORE2 encrypts all metadata with the line counter to guarantee high security. Experimental results show that MORE2 reduces the bit flips of encrypted NVM by 53.5 %, decreases the access latency by 27.32%, improves the IPC performance by 12.1 %, and reduces the write energy by 29.1 % compared with the state-of-the-art design.
内存加密可以提高非易失性内存(Non-volatile Memory, nvm)的安全性,但它会显著增加写入nvm的数据位,并导致严重的寿命和性能下降。目前的加密技术的目标是减少对许多现有的干净字的重新加密,不幸的是,这遭受了高加密开销(即延迟和能量)和许多不必要的写入。同时,压缩技术可以减少加密NVM的写操作。然而,我们发现它们可能会破坏数据模式并增加修改字,从而导致安全NVM中的许多加密。在本文中,我们提出了MORphable Encryption and Encoding (MORE2)方案来解决这些问题。我们的MORphable Encryption (MORE)技术旨在减少整行重复加密,避免整行加密。此外,MORE还提出了一种基于预测的写方案,以避免对干净行进行加密,并对预测为脏行的行进行预加密。因此,MORE可以将加密从NVM的关键路径上移除。此外,MORE2提出了变形选择性编码(Morphable Selective Encoding, MSE)方案来压缩修改后的单词,同时保留干净的单词。MORE2采用行计数器对所有元数据进行加密,保证高安全性。实验结果表明,与现有设计相比,MORE2使加密NVM的比特翻转率降低了53.5%,访问延迟降低了27.32%,IPC性能提高了12.1%,写能量降低了29.1%。
{"title":"MORE2: Morphable Encryption and Encoding for Secure NVM","authors":"Wei Zhao, D. Feng, Yu Hua, Wei Tong, Jingning Liu, Jie Xu, Chunyan Li, Gaoxiang Xu, Yiran Chen","doi":"10.1109/ICCAD51958.2021.9643529","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643529","url":null,"abstract":"Memory encryption can enhance the security of Non-volatile memories (NVMs), but it significantly increases the data bits written to NVMs and leads to severe lifetime and performance degradation. Current encryption techniques aim to reduce the re-encryption to many existing clean words, which unfortunately suffer from high encryption overheads (i.e. latency and energy) and many unnecessary writes. In the meantime, compression techniques can reduce the writes of encrypted NVM. However, we find that they may destroy the data patterns and increase the modified words, resulting in many encryptions in secure NVM. In this paper, we propose the MORphable Encryption and Encoding (MORE2) scheme to address these problems. Our MORphable Encryption (MORE) technique aims to reduce the full-line re-encryption and avoid clean line encryption. Besides, MORE proposes a prediction-based write scheme to avoid the encryption of clean lines, and pre-encrypt the lines that are predicted as dirty. Therefore, MORE can remove the encryption from the critical path of NVM. Furthermore, MORE2 proposes the Morphable Selective Encoding (MSE) scheme to compress the modified words while preserving clean words. MORE2 encrypts all metadata with the line counter to guarantee high security. Experimental results show that MORE2 reduces the bit flips of encrypted NVM by 53.5 %, decreases the access latency by 27.32%, improves the IPC performance by 12.1 %, and reduces the write energy by 29.1 % compared with the state-of-the-art design.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131332677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
MinSC: An Exact Synthesis-Based Method for Minimal-Area Stochastic Circuits under Relaxed Error Bound MinSC:松弛误差界下最小面积随机电路的精确综合方法
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643580
Xuan Wang, Zhufei Chu, Weikang Qian
Stochastic computing (SC) operates on stochastic bit streams, which can realize complex arithmetic functions with simple circuits. A previous work shows that by introducing a little approximation error for the target function, the cost of SC circuits can be dramatically reduced. However, the previous heuristic method only explores a limited subset of the solution space, so the optimality of the results cannot be guaranteed. In this paper, we propose MinSC, an exact synthesis-based method for minimal-area stochastic circuits under relaxed error bound. First, a novel search method is proposed to find the best approximation polynomial for a target function. Then, considering gates with different fanin numbers and areas, an exact SC synthesis method using satisfiability modulo theories is designed to obtain an area-optimal SC circuit realizing the best approximation polynomial. The experimental results show that compared with the state-of-the-art method, given an error ratio 0.05, MinSC on average reduces the gate number, area, delay, and area-delay-product of the SC circuits by 60.24%, 47.24%, 7.10%, 57.07%, respectively.
随机计算(SC)是一种基于随机比特流的计算方法,可以用简单的电路实现复杂的算术函数。先前的研究表明,通过对目标函数引入一点近似误差,可以显著降低SC电路的成本。然而,以往的启发式方法只探索解空间的有限子集,因此不能保证结果的最优性。本文提出了一种基于精确综合的最小面积随机电路松弛误差界解算方法MinSC。首先,提出了一种新的搜索方法来寻找目标函数的最佳逼近多项式。然后,考虑不同扇数和面积的栅极,设计了一种基于可满足模理论的精确SC综合方法,得到了实现最佳近似多项式的面积最优SC电路。实验结果表明,在错误率为0.05的情况下,MinSC算法使SC电路的栅极数、面积、延迟和面积延迟积分别平均减少60.24%、47.24%、7.10%和57.07%。
{"title":"MinSC: An Exact Synthesis-Based Method for Minimal-Area Stochastic Circuits under Relaxed Error Bound","authors":"Xuan Wang, Zhufei Chu, Weikang Qian","doi":"10.1109/ICCAD51958.2021.9643580","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643580","url":null,"abstract":"Stochastic computing (SC) operates on stochastic bit streams, which can realize complex arithmetic functions with simple circuits. A previous work shows that by introducing a little approximation error for the target function, the cost of SC circuits can be dramatically reduced. However, the previous heuristic method only explores a limited subset of the solution space, so the optimality of the results cannot be guaranteed. In this paper, we propose MinSC, an exact synthesis-based method for minimal-area stochastic circuits under relaxed error bound. First, a novel search method is proposed to find the best approximation polynomial for a target function. Then, considering gates with different fanin numbers and areas, an exact SC synthesis method using satisfiability modulo theories is designed to obtain an area-optimal SC circuit realizing the best approximation polynomial. The experimental results show that compared with the state-of-the-art method, given an error ratio 0.05, MinSC on average reduces the gate number, area, delay, and area-delay-product of the SC circuits by 60.24%, 47.24%, 7.10%, 57.07%, respectively.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127931785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Demystifying the Characteristics of High Bandwidth Memory for Real-Time Systems 实时系统中高带宽存储器的特性
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643473
Kazi Asifuzzaman, Mohamed Abuelala, Mohamed Hassan, F. Cazorla
The number of functionalities controlled by software on every critical real-time product is on the rise in domains like automotive, avionics and space. To implement these advanced functionalities, software applications increasingly adopt artificial intelligence algorithms that manage massive amounts of data transmitted from various sensors. This translates into unprecedented memory performance requirements in critical systems that the commonly used DRAM memories struggle to provide. High-Bandwidth Memory (HBM) can satisfy these requirements offering high bandwidth, low power and high-integration capacity features. However, it remains unclear whether the predictability and isolation properties of HBM are compatible with the requirements of critical embedded systems. In this work, we perform to our knowledge the first timing analysis of HBM. We show the unique structural and timing characteristics of HBM with respect to DRAM memories and how they can be exploited for better time predictability, with emphasis on increased isolation among tasks and reduced worst-case memory latency.
在汽车、航空电子和航天等领域,每个关键实时产品上由软件控制的功能数量正在上升。为了实现这些高级功能,软件应用程序越来越多地采用人工智能算法来管理从各种传感器传输的大量数据。这转化为关键系统中前所未有的内存性能要求,通常使用的DRAM存储器难以提供。高带宽内存(HBM)具有高带宽、低功耗和高集成容量的特点,可以满足这些需求。然而,HBM的可预测性和隔离性是否与关键嵌入式系统的要求兼容仍不清楚。在这项工作中,据我们所知,我们进行了HBM的第一次时序分析。我们展示了HBM在DRAM存储器方面的独特结构和时序特性,以及如何利用它们来获得更好的时间可预测性,重点是增加任务之间的隔离性和减少最坏情况下的内存延迟。
{"title":"Demystifying the Characteristics of High Bandwidth Memory for Real-Time Systems","authors":"Kazi Asifuzzaman, Mohamed Abuelala, Mohamed Hassan, F. Cazorla","doi":"10.1109/ICCAD51958.2021.9643473","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643473","url":null,"abstract":"The number of functionalities controlled by software on every critical real-time product is on the rise in domains like automotive, avionics and space. To implement these advanced functionalities, software applications increasingly adopt artificial intelligence algorithms that manage massive amounts of data transmitted from various sensors. This translates into unprecedented memory performance requirements in critical systems that the commonly used DRAM memories struggle to provide. High-Bandwidth Memory (HBM) can satisfy these requirements offering high bandwidth, low power and high-integration capacity features. However, it remains unclear whether the predictability and isolation properties of HBM are compatible with the requirements of critical embedded systems. In this work, we perform to our knowledge the first timing analysis of HBM. We show the unique structural and timing characteristics of HBM with respect to DRAM memories and how they can be exploited for better time predictability, with emphasis on increased isolation among tasks and reduced worst-case memory latency.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129942546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Polyhedral-based Pipelining of Imperfectly-Nested Loop for CGRAs 基于多面体的CGRAs非完美嵌套循环流水线
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643542
Dajiang Liu, Ting Liu, Xingyu Mo, Jiaxing Shang, S. Yin
Coarse-Grained Reconfigurable Architectures (CGRAs) are promising architectures with high energy efficiency and flexibility. The computation-intensive portions of an application (e.g. loops) are often executed on CGRAs for acceleration and modulo scheduling is commonly used for loop mapping. However, for imperfectly-nested loops, existing methods don't fully explore the structure of the loops before performing modulo scheduling, resulting in poor execution performance. To tackle this problem, we propose a polyhedral-based pipelining approach for mapping imperfectly-nested loops on CGRA. By efficiently exploring the transformation space for imperfectly-nested loops using the polyhedral model and taking total execution time as an optimization metric, our approach could improve the execution performance greatly. On a $4times 4$ mesh-connected CGRA, the experimental results show that our approach can reduce the total execution time of nested loop by 50.1 % on average, as compared to the state-of-the-art techniques. Moreover, the compilation time is moderate in practice.
粗粒度可重构体系结构(CGRAs)是一种具有高能效和灵活性的有前途的体系结构。应用程序的计算密集型部分(例如循环)通常在CGRAs上执行以加速,模调度通常用于循环映射。然而,对于嵌套不完美的循环,现有方法在进行模调度之前没有充分探索循环的结构,导致执行性能较差。为了解决这个问题,我们提出了一种基于多面体的流水线方法,用于在CGRA上映射不完美嵌套循环。该方法利用多面体模型有效地探索不完美嵌套循环的变换空间,并以总执行时间为优化指标,大大提高了执行性能。在$4 × 4$网格连接的CGRA上,实验结果表明,与目前的技术相比,我们的方法可以将嵌套循环的总执行时间平均减少50.1%。此外,在实践中,编译时间是适度的。
{"title":"Polyhedral-based Pipelining of Imperfectly-Nested Loop for CGRAs","authors":"Dajiang Liu, Ting Liu, Xingyu Mo, Jiaxing Shang, S. Yin","doi":"10.1109/ICCAD51958.2021.9643542","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643542","url":null,"abstract":"Coarse-Grained Reconfigurable Architectures (CGRAs) are promising architectures with high energy efficiency and flexibility. The computation-intensive portions of an application (e.g. loops) are often executed on CGRAs for acceleration and modulo scheduling is commonly used for loop mapping. However, for imperfectly-nested loops, existing methods don't fully explore the structure of the loops before performing modulo scheduling, resulting in poor execution performance. To tackle this problem, we propose a polyhedral-based pipelining approach for mapping imperfectly-nested loops on CGRA. By efficiently exploring the transformation space for imperfectly-nested loops using the polyhedral model and taking total execution time as an optimization metric, our approach could improve the execution performance greatly. On a $4times 4$ mesh-connected CGRA, the experimental results show that our approach can reduce the total execution time of nested loop by 50.1 % on average, as compared to the state-of-the-art techniques. Moreover, the compilation time is moderate in practice.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"181 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132549005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Peripheral Circuitry Assisted Mapping Framework for Resistive Logic-In-Memory Computing 电阻式内存逻辑计算的外围电路辅助映射框架
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643588
Shuhang Zhang, Hai Helen Li, Ulf Schlichtmann
In-memory computing has been applied in different fields due to its superior speed and energy efficiency. Among a variety of memory technologies that have been explored, resistive memory has widely been adopted for various purposes, including Processing-In-Memory (PIM) for neural networks and Logic-In-Memory (LIM) for general logic operations. PIM has intensively been studied in recent years, while the progress in developing LIM computing falls behind. LIM computing is usually implemented based on MAGIC operations, which require inputs to be aligned regularly along rows or columns in a memory crossbar. As the intermediate data generated during the logic execution are normally scattered across the memory crossbar, alignment operations are inserted to align the data, which often costs numerous cycles and dominates the overall latency. In current MAGIC-based designs, alignment operations induce a significant overhead in either area or latency. Therefore, the Area-Latency-Product (ALP), known as a key metric for circuit performance, still has significant optimization potential in LIM computing. In this work, we leverage peripheral circuitry to conduct alignment operations and propose a novel mapping framework to optimize the latency and area costs. Intermediate data are read out, processed in peripheral circuits, then in parallel written back into target cells of the memory crossbar. The approach eliminates the use of redundant memory cells, leading to area reduction. Moreover, it enables simultaneous alignments of multiple intermediate data, which can decrease the overall latency significantly. Based on simulation results, our proposed mapping framework can achieve around 93% ALP reductions on average compared with prior designs with merely 2.13% total area overhead.
内存计算以其优越的速度和能源效率在各个领域得到了广泛的应用。在已经探索的各种存储技术中,电阻式存储已被广泛用于各种用途,包括用于神经网络的内存处理(PIM)和用于一般逻辑运算的内存逻辑(LIM)。近年来,PIM得到了广泛的研究,而LIM计算的发展却相对滞后。LIM计算通常是基于MAGIC操作实现的,它要求输入在内存交叉栏中沿行或列有规律地对齐。由于在逻辑执行期间生成的中间数据通常分散在内存交叉栏中,因此需要插入对齐操作来对齐数据,这通常需要花费大量的周期,并且占据了总体延迟时间。在当前的基于magic的设计中,对齐操作在区域或延迟方面都会产生很大的开销。因此,面积延迟积(Area-Latency-Product, ALP)作为衡量电路性能的关键指标,在LIM计算中仍有很大的优化潜力。在这项工作中,我们利用外围电路进行校准操作,并提出了一种新的映射框架来优化延迟和面积成本。中间数据被读出,在外围电路中处理,然后并行地写回存储器横杆的目标单元。该方法消除了冗余存储单元的使用,从而减少了面积。此外,它支持同时对齐多个中间数据,这可以显著降低总体延迟。基于仿真结果,与之前的设计相比,我们提出的制图框架平均可以减少约93%的ALP,而总面积开销仅为2.13%。
{"title":"Peripheral Circuitry Assisted Mapping Framework for Resistive Logic-In-Memory Computing","authors":"Shuhang Zhang, Hai Helen Li, Ulf Schlichtmann","doi":"10.1109/ICCAD51958.2021.9643588","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643588","url":null,"abstract":"In-memory computing has been applied in different fields due to its superior speed and energy efficiency. Among a variety of memory technologies that have been explored, resistive memory has widely been adopted for various purposes, including Processing-In-Memory (PIM) for neural networks and Logic-In-Memory (LIM) for general logic operations. PIM has intensively been studied in recent years, while the progress in developing LIM computing falls behind. LIM computing is usually implemented based on MAGIC operations, which require inputs to be aligned regularly along rows or columns in a memory crossbar. As the intermediate data generated during the logic execution are normally scattered across the memory crossbar, alignment operations are inserted to align the data, which often costs numerous cycles and dominates the overall latency. In current MAGIC-based designs, alignment operations induce a significant overhead in either area or latency. Therefore, the Area-Latency-Product (ALP), known as a key metric for circuit performance, still has significant optimization potential in LIM computing. In this work, we leverage peripheral circuitry to conduct alignment operations and propose a novel mapping framework to optimize the latency and area costs. Intermediate data are read out, processed in peripheral circuits, then in parallel written back into target cells of the memory crossbar. The approach eliminates the use of redundant memory cells, leading to area reduction. Moreover, it enables simultaneous alignments of multiple intermediate data, which can decrease the overall latency significantly. Based on simulation results, our proposed mapping framework can achieve around 93% ALP reductions on average compared with prior designs with merely 2.13% total area overhead.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128225264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
FedSwap: A Federated Learning based 5G Decentralized Dynamic Spectrum Access System 基于联邦学习的5G分散动态频谱接入系统
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643496
Zhihui Gao, Ang Li, Yunfan Gao, Bing Li, Yu Wang, Yiran Chen
The era of 5G extends the available spectrum from the microwave band to the millimeter-wave band. The thriving Internet of Things (IoT) also enriches the user equipment (UEs) we used in our daily life, such as smart glasses, smart watches, and drones. With such a larger spectrum and massive UEs, existing dynamic spectrum access (DSA) suffers both low spectrum utilization efficiency and unfair spectrum allocation. Thus, a more sophisticated dynamic spectrum access (DSA) system is required in the 5G context. In this paper, we propose a federated learning based system, FedSwap, the first decentralized DSA system that improves both efficiency and fairness simultaneously. In FedSwap, we deploy an improved multi-agent reinforcement learning (iMARL) algorithm on each UE, enabling UEs to share the spectrum coordinately with fewer collisions. Furthermore, we also propose a novel swapping mechanism for aggregating UEs' models periodically so that UEs can fairly share the spectrum resources. Meanwhile, the sensory data of UEs are not transmitted and hence privacy is protected. We evaluate FedSwap's performance in 5G simulations with various settings. Compared to the state-of-the-art decentralized DSA methods, FedSwap can significantly improve the efficiency and fairness of spectrum utilization.
5G时代将可用频谱从微波频段扩展到毫米波频段。蓬勃发展的物联网(IoT)也丰富了我们日常生活中使用的用户设备(ue),例如智能眼镜、智能手表和无人机。面对如此大的频谱和海量的终端,现有的动态频谱接入(DSA)存在频谱利用效率低和频谱分配不公平的问题。因此,在5G环境中需要更复杂的动态频谱接入(DSA)系统。在本文中,我们提出了一个基于联邦学习的系统,FedSwap,这是第一个同时提高效率和公平性的分散DSA系统。在FedSwap中,我们在每个UE上部署了改进的多智能体强化学习(iMARL)算法,使UE能够以更少的冲突协调共享频谱。此外,我们还提出了一种新的交换机制,用于定期聚合ue的模型,从而使ue能够公平地共享频谱资源。同时,ue的感官数据不会被传输,因此隐私得到了保护。我们在不同设置下评估了FedSwap在5G模拟中的性能。与目前最先进的分散式DSA方法相比,fedsswap可以显著提高频谱利用的效率和公平性。
{"title":"FedSwap: A Federated Learning based 5G Decentralized Dynamic Spectrum Access System","authors":"Zhihui Gao, Ang Li, Yunfan Gao, Bing Li, Yu Wang, Yiran Chen","doi":"10.1109/ICCAD51958.2021.9643496","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643496","url":null,"abstract":"The era of 5G extends the available spectrum from the microwave band to the millimeter-wave band. The thriving Internet of Things (IoT) also enriches the user equipment (UEs) we used in our daily life, such as smart glasses, smart watches, and drones. With such a larger spectrum and massive UEs, existing dynamic spectrum access (DSA) suffers both low spectrum utilization efficiency and unfair spectrum allocation. Thus, a more sophisticated dynamic spectrum access (DSA) system is required in the 5G context. In this paper, we propose a federated learning based system, FedSwap, the first decentralized DSA system that improves both efficiency and fairness simultaneously. In FedSwap, we deploy an improved multi-agent reinforcement learning (iMARL) algorithm on each UE, enabling UEs to share the spectrum coordinately with fewer collisions. Furthermore, we also propose a novel swapping mechanism for aggregating UEs' models periodically so that UEs can fairly share the spectrum resources. Meanwhile, the sensory data of UEs are not transmitted and hence privacy is protected. We evaluate FedSwap's performance in 5G simulations with various settings. Compared to the state-of-the-art decentralized DSA methods, FedSwap can significantly improve the efficiency and fairness of spectrum utilization.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129097121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Improving Inter-kernel Data Reuse With CTA-Page Coordination in GPGPU 基于CTA-Page协调的GPGPU内核间数据重用
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643535
Xuanyi Li, Chen Li, Yang Guo, Rachata Ausavarungnirun
Although modern GPUs are equipped with expanding memory, accommodating the entire working set of large-scale workloads can still be a challenge. With the support of unified virtual memory and demand paging, programmers can transparently oversubscribe the main memory. However, this transparent management still comes at a severe performance cost, especially for applications with inter-kernel data sharing. While there have been many efforts to reduce additional data migrations caused by the memory oversubscription, few consider the reuse of shared data during the boundary of adjacent kernels. Due to limited memory capacity, we observe that adjacent kernel often demands shared pages that were evicted by the previous kernel, resulting in a significant number of costly data migrations. In this paper, we propose a CTA-Page collaborative framework, called CPC, that transparently reduces the impact of memory oversubscription using CTA dispatch switching and page replacement switching coordinately to reuse inter-kernel shared data. We evaluate CPC with a variety of GPGPU benchmark suites. Experimental results show that the system performance is improved by 65 % compared with the state-of-the-art technique for applications with inter-kernel data sharing.
尽管现代gpu配备了扩展内存,但容纳大规模工作负载的整个工作集仍然是一个挑战。在统一虚拟内存和需求分页的支持下,程序员可以透明地超额订阅主内存。然而,这种透明的管理仍然以严重的性能成本为代价,特别是对于具有内核间数据共享的应用程序。虽然已经有很多努力来减少由内存超额订阅引起的额外数据迁移,但很少有人考虑在相邻内核边界期间重用共享数据。由于内存容量有限,我们观察到相邻的内核经常需要被前一个内核驱逐的共享页面,从而导致大量昂贵的数据迁移。在本文中,我们提出了一个CTA- page协作框架,称为CPC,该框架通过协调使用CTA调度切换和页面替换切换来重用内核间共享数据,从而透明地降低了内存超额订阅的影响。我们使用各种GPGPU基准套件来评估CPC。实验结果表明,对于具有内核间数据共享的应用程序,该系统的性能比现有技术提高了65%。
{"title":"Improving Inter-kernel Data Reuse With CTA-Page Coordination in GPGPU","authors":"Xuanyi Li, Chen Li, Yang Guo, Rachata Ausavarungnirun","doi":"10.1109/ICCAD51958.2021.9643535","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643535","url":null,"abstract":"Although modern GPUs are equipped with expanding memory, accommodating the entire working set of large-scale workloads can still be a challenge. With the support of unified virtual memory and demand paging, programmers can transparently oversubscribe the main memory. However, this transparent management still comes at a severe performance cost, especially for applications with inter-kernel data sharing. While there have been many efforts to reduce additional data migrations caused by the memory oversubscription, few consider the reuse of shared data during the boundary of adjacent kernels. Due to limited memory capacity, we observe that adjacent kernel often demands shared pages that were evicted by the previous kernel, resulting in a significant number of costly data migrations. In this paper, we propose a CTA-Page collaborative framework, called CPC, that transparently reduces the impact of memory oversubscription using CTA dispatch switching and page replacement switching coordinately to reuse inter-kernel shared data. We evaluate CPC with a variety of GPGPU benchmark suites. Experimental results show that the system performance is improved by 65 % compared with the state-of-the-art technique for applications with inter-kernel data sharing.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129480647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
iSTELLAR: intermittent Signature aTtenuation Embedded CRYPTO with Low-Level metAl Routing iSTELLAR:间歇性签名衰减嵌入加密与低级别金属路由
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643540
Jeremy Blackstone, D. Das, Alric Althoff, Shreyas Sen, R. Kastner
An adversary can exploit side-channel information such as power consumption, electromagnetic (EM) emanations, acoustic vibrations or the timing of encryption operations to derive the secret key from an electronic device. Signature aTtenuation Embedded CRYPTO with Low-Level metAl Routing (STELLAR) is a technique to mitigate power and EM-based attacks, however, it incurs 50% power overhead. This work presents iSTELLAR, which reduces the power overhead by operating STELLAR intermittently utilizing an intelligent scheduling algorithm. The proposed scheduling algorithm for iSTELLAR determines the optimal locations during the crypto operation to turn STELLAR ON, and thereby reduces the power overhead by $> 30%$ compared to the normal STELLAR operation, while eliminating the information leakage.
攻击者可以利用侧信道信息,如功耗、电磁(EM)发射、声学振动或加密操作的时间,从电子设备中获取密钥。具有低级别金属路由的嵌入式加密(STELLAR)是一种减轻基于功率和em的攻击的技术,然而,它会产生50%的功率开销。这项工作提出了iSTELLAR,它通过使用智能调度算法间歇运行STELLAR来降低功耗。所提出的iSTELLAR调度算法确定了在加密操作期间打开STELLAR的最佳位置,从而与正常的STELLAR操作相比减少了> 30%的功耗开销,同时消除了信息泄漏。
{"title":"iSTELLAR: intermittent Signature aTtenuation Embedded CRYPTO with Low-Level metAl Routing","authors":"Jeremy Blackstone, D. Das, Alric Althoff, Shreyas Sen, R. Kastner","doi":"10.1109/ICCAD51958.2021.9643540","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643540","url":null,"abstract":"An adversary can exploit side-channel information such as power consumption, electromagnetic (EM) emanations, acoustic vibrations or the timing of encryption operations to derive the secret key from an electronic device. Signature aTtenuation Embedded CRYPTO with Low-Level metAl Routing (STELLAR) is a technique to mitigate power and EM-based attacks, however, it incurs 50% power overhead. This work presents iSTELLAR, which reduces the power overhead by operating STELLAR intermittently utilizing an intelligent scheduling algorithm. The proposed scheduling algorithm for iSTELLAR determines the optimal locations during the crypto operation to turn STELLAR ON, and thereby reduces the power overhead by $> 30%$ compared to the normal STELLAR operation, while eliminating the information leakage.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121658277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1