首页 > 最新文献

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)最新文献

英文 中文
[Copyright notice] (版权)
Pub Date : 2021-11-01 DOI: 10.1109/iccad51958.2021.9643482
{"title":"[Copyright notice]","authors":"","doi":"10.1109/iccad51958.2021.9643482","DOIUrl":"https://doi.org/10.1109/iccad51958.2021.9643482","url":null,"abstract":"","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124518779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MinSC: An Exact Synthesis-Based Method for Minimal-Area Stochastic Circuits under Relaxed Error Bound MinSC:松弛误差界下最小面积随机电路的精确综合方法
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643580
Xuan Wang, Zhufei Chu, Weikang Qian
Stochastic computing (SC) operates on stochastic bit streams, which can realize complex arithmetic functions with simple circuits. A previous work shows that by introducing a little approximation error for the target function, the cost of SC circuits can be dramatically reduced. However, the previous heuristic method only explores a limited subset of the solution space, so the optimality of the results cannot be guaranteed. In this paper, we propose MinSC, an exact synthesis-based method for minimal-area stochastic circuits under relaxed error bound. First, a novel search method is proposed to find the best approximation polynomial for a target function. Then, considering gates with different fanin numbers and areas, an exact SC synthesis method using satisfiability modulo theories is designed to obtain an area-optimal SC circuit realizing the best approximation polynomial. The experimental results show that compared with the state-of-the-art method, given an error ratio 0.05, MinSC on average reduces the gate number, area, delay, and area-delay-product of the SC circuits by 60.24%, 47.24%, 7.10%, 57.07%, respectively.
随机计算(SC)是一种基于随机比特流的计算方法,可以用简单的电路实现复杂的算术函数。先前的研究表明,通过对目标函数引入一点近似误差,可以显著降低SC电路的成本。然而,以往的启发式方法只探索解空间的有限子集,因此不能保证结果的最优性。本文提出了一种基于精确综合的最小面积随机电路松弛误差界解算方法MinSC。首先,提出了一种新的搜索方法来寻找目标函数的最佳逼近多项式。然后,考虑不同扇数和面积的栅极,设计了一种基于可满足模理论的精确SC综合方法,得到了实现最佳近似多项式的面积最优SC电路。实验结果表明,在错误率为0.05的情况下,MinSC算法使SC电路的栅极数、面积、延迟和面积延迟积分别平均减少60.24%、47.24%、7.10%和57.07%。
{"title":"MinSC: An Exact Synthesis-Based Method for Minimal-Area Stochastic Circuits under Relaxed Error Bound","authors":"Xuan Wang, Zhufei Chu, Weikang Qian","doi":"10.1109/ICCAD51958.2021.9643580","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643580","url":null,"abstract":"Stochastic computing (SC) operates on stochastic bit streams, which can realize complex arithmetic functions with simple circuits. A previous work shows that by introducing a little approximation error for the target function, the cost of SC circuits can be dramatically reduced. However, the previous heuristic method only explores a limited subset of the solution space, so the optimality of the results cannot be guaranteed. In this paper, we propose MinSC, an exact synthesis-based method for minimal-area stochastic circuits under relaxed error bound. First, a novel search method is proposed to find the best approximation polynomial for a target function. Then, considering gates with different fanin numbers and areas, an exact SC synthesis method using satisfiability modulo theories is designed to obtain an area-optimal SC circuit realizing the best approximation polynomial. The experimental results show that compared with the state-of-the-art method, given an error ratio 0.05, MinSC on average reduces the gate number, area, delay, and area-delay-product of the SC circuits by 60.24%, 47.24%, 7.10%, 57.07%, respectively.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127931785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Row-Based Algorithm for Non-Integer Multiple-Cell-Height Placement 一种基于行的非整数多单元格高度放置算法
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643550
Zih-Yao Lin, Yao-Wen Chang
A circuit design with non-integer multiple cell height (NIMCH) is more flexible for optimizing area, timing, and power simultaneously. A cell with a larger height provides higher pin accessibility, higher drive strength, and shorter delay. In contrast, one with a smaller height has a smaller area, pin capacitance, and power consumption. Such NIMCH design must satisfy additional layout constraints that existing tool flows cannot handle well. This paper presents a row-based algorithm for non-integer multiple-cell-height placement. Our algorithm consists of two main techniques: (1) a k-mean-based clustering method to assign heights to each row to define the regions of particular cell heights, and (2) a legalization method to move cells to satisfy NIMCH constraints. Experimental results show that our approach can significantly reduce the average routed wirelength and the average total power compared with the state-of-the-art approach.
具有非整数多单元高度(NIMCH)的电路设计在同时优化面积、时序和功率方面更为灵活。高度较大的单元提供更高的引脚可及性,更高的驱动强度和更短的延迟。相反,高度越小,面积、引脚电容和功耗越小。这种NIMCH设计必须满足现有工具流无法很好处理的额外布局约束。提出了一种基于行的非整数多单元高度放置算法。我们的算法包括两个主要技术:(1)基于k均值的聚类方法,为每一行分配高度,以定义特定细胞高度的区域;(2)合法化方法,移动细胞以满足NIMCH约束。实验结果表明,与现有方法相比,该方法可以显著降低平均路由长度和平均总功耗。
{"title":"A Row-Based Algorithm for Non-Integer Multiple-Cell-Height Placement","authors":"Zih-Yao Lin, Yao-Wen Chang","doi":"10.1109/ICCAD51958.2021.9643550","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643550","url":null,"abstract":"A circuit design with non-integer multiple cell height (NIMCH) is more flexible for optimizing area, timing, and power simultaneously. A cell with a larger height provides higher pin accessibility, higher drive strength, and shorter delay. In contrast, one with a smaller height has a smaller area, pin capacitance, and power consumption. Such NIMCH design must satisfy additional layout constraints that existing tool flows cannot handle well. This paper presents a row-based algorithm for non-integer multiple-cell-height placement. Our algorithm consists of two main techniques: (1) a k-mean-based clustering method to assign heights to each row to define the regions of particular cell heights, and (2) a legalization method to move cells to satisfy NIMCH constraints. Experimental results show that our approach can significantly reduce the average routed wirelength and the average total power compared with the state-of-the-art approach.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133493619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
MORE2: Morphable Encryption and Encoding for Secure NVM MORE2:安全NVM的可变形加密与编码
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643529
Wei Zhao, D. Feng, Yu Hua, Wei Tong, Jingning Liu, Jie Xu, Chunyan Li, Gaoxiang Xu, Yiran Chen
Memory encryption can enhance the security of Non-volatile memories (NVMs), but it significantly increases the data bits written to NVMs and leads to severe lifetime and performance degradation. Current encryption techniques aim to reduce the re-encryption to many existing clean words, which unfortunately suffer from high encryption overheads (i.e. latency and energy) and many unnecessary writes. In the meantime, compression techniques can reduce the writes of encrypted NVM. However, we find that they may destroy the data patterns and increase the modified words, resulting in many encryptions in secure NVM. In this paper, we propose the MORphable Encryption and Encoding (MORE2) scheme to address these problems. Our MORphable Encryption (MORE) technique aims to reduce the full-line re-encryption and avoid clean line encryption. Besides, MORE proposes a prediction-based write scheme to avoid the encryption of clean lines, and pre-encrypt the lines that are predicted as dirty. Therefore, MORE can remove the encryption from the critical path of NVM. Furthermore, MORE2 proposes the Morphable Selective Encoding (MSE) scheme to compress the modified words while preserving clean words. MORE2 encrypts all metadata with the line counter to guarantee high security. Experimental results show that MORE2 reduces the bit flips of encrypted NVM by 53.5 %, decreases the access latency by 27.32%, improves the IPC performance by 12.1 %, and reduces the write energy by 29.1 % compared with the state-of-the-art design.
内存加密可以提高非易失性内存(Non-volatile Memory, nvm)的安全性,但它会显著增加写入nvm的数据位,并导致严重的寿命和性能下降。目前的加密技术的目标是减少对许多现有的干净字的重新加密,不幸的是,这遭受了高加密开销(即延迟和能量)和许多不必要的写入。同时,压缩技术可以减少加密NVM的写操作。然而,我们发现它们可能会破坏数据模式并增加修改字,从而导致安全NVM中的许多加密。在本文中,我们提出了MORphable Encryption and Encoding (MORE2)方案来解决这些问题。我们的MORphable Encryption (MORE)技术旨在减少整行重复加密,避免整行加密。此外,MORE还提出了一种基于预测的写方案,以避免对干净行进行加密,并对预测为脏行的行进行预加密。因此,MORE可以将加密从NVM的关键路径上移除。此外,MORE2提出了变形选择性编码(Morphable Selective Encoding, MSE)方案来压缩修改后的单词,同时保留干净的单词。MORE2采用行计数器对所有元数据进行加密,保证高安全性。实验结果表明,与现有设计相比,MORE2使加密NVM的比特翻转率降低了53.5%,访问延迟降低了27.32%,IPC性能提高了12.1%,写能量降低了29.1%。
{"title":"MORE2: Morphable Encryption and Encoding for Secure NVM","authors":"Wei Zhao, D. Feng, Yu Hua, Wei Tong, Jingning Liu, Jie Xu, Chunyan Li, Gaoxiang Xu, Yiran Chen","doi":"10.1109/ICCAD51958.2021.9643529","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643529","url":null,"abstract":"Memory encryption can enhance the security of Non-volatile memories (NVMs), but it significantly increases the data bits written to NVMs and leads to severe lifetime and performance degradation. Current encryption techniques aim to reduce the re-encryption to many existing clean words, which unfortunately suffer from high encryption overheads (i.e. latency and energy) and many unnecessary writes. In the meantime, compression techniques can reduce the writes of encrypted NVM. However, we find that they may destroy the data patterns and increase the modified words, resulting in many encryptions in secure NVM. In this paper, we propose the MORphable Encryption and Encoding (MORE2) scheme to address these problems. Our MORphable Encryption (MORE) technique aims to reduce the full-line re-encryption and avoid clean line encryption. Besides, MORE proposes a prediction-based write scheme to avoid the encryption of clean lines, and pre-encrypt the lines that are predicted as dirty. Therefore, MORE can remove the encryption from the critical path of NVM. Furthermore, MORE2 proposes the Morphable Selective Encoding (MSE) scheme to compress the modified words while preserving clean words. MORE2 encrypts all metadata with the line counter to guarantee high security. Experimental results show that MORE2 reduces the bit flips of encrypted NVM by 53.5 %, decreases the access latency by 27.32%, improves the IPC performance by 12.1 %, and reduces the write energy by 29.1 % compared with the state-of-the-art design.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131332677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Demystifying the Characteristics of High Bandwidth Memory for Real-Time Systems 实时系统中高带宽存储器的特性
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643473
Kazi Asifuzzaman, Mohamed Abuelala, Mohamed Hassan, F. Cazorla
The number of functionalities controlled by software on every critical real-time product is on the rise in domains like automotive, avionics and space. To implement these advanced functionalities, software applications increasingly adopt artificial intelligence algorithms that manage massive amounts of data transmitted from various sensors. This translates into unprecedented memory performance requirements in critical systems that the commonly used DRAM memories struggle to provide. High-Bandwidth Memory (HBM) can satisfy these requirements offering high bandwidth, low power and high-integration capacity features. However, it remains unclear whether the predictability and isolation properties of HBM are compatible with the requirements of critical embedded systems. In this work, we perform to our knowledge the first timing analysis of HBM. We show the unique structural and timing characteristics of HBM with respect to DRAM memories and how they can be exploited for better time predictability, with emphasis on increased isolation among tasks and reduced worst-case memory latency.
在汽车、航空电子和航天等领域,每个关键实时产品上由软件控制的功能数量正在上升。为了实现这些高级功能,软件应用程序越来越多地采用人工智能算法来管理从各种传感器传输的大量数据。这转化为关键系统中前所未有的内存性能要求,通常使用的DRAM存储器难以提供。高带宽内存(HBM)具有高带宽、低功耗和高集成容量的特点,可以满足这些需求。然而,HBM的可预测性和隔离性是否与关键嵌入式系统的要求兼容仍不清楚。在这项工作中,据我们所知,我们进行了HBM的第一次时序分析。我们展示了HBM在DRAM存储器方面的独特结构和时序特性,以及如何利用它们来获得更好的时间可预测性,重点是增加任务之间的隔离性和减少最坏情况下的内存延迟。
{"title":"Demystifying the Characteristics of High Bandwidth Memory for Real-Time Systems","authors":"Kazi Asifuzzaman, Mohamed Abuelala, Mohamed Hassan, F. Cazorla","doi":"10.1109/ICCAD51958.2021.9643473","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643473","url":null,"abstract":"The number of functionalities controlled by software on every critical real-time product is on the rise in domains like automotive, avionics and space. To implement these advanced functionalities, software applications increasingly adopt artificial intelligence algorithms that manage massive amounts of data transmitted from various sensors. This translates into unprecedented memory performance requirements in critical systems that the commonly used DRAM memories struggle to provide. High-Bandwidth Memory (HBM) can satisfy these requirements offering high bandwidth, low power and high-integration capacity features. However, it remains unclear whether the predictability and isolation properties of HBM are compatible with the requirements of critical embedded systems. In this work, we perform to our knowledge the first timing analysis of HBM. We show the unique structural and timing characteristics of HBM with respect to DRAM memories and how they can be exploited for better time predictability, with emphasis on increased isolation among tasks and reduced worst-case memory latency.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129942546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Polyhedral-based Pipelining of Imperfectly-Nested Loop for CGRAs 基于多面体的CGRAs非完美嵌套循环流水线
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643542
Dajiang Liu, Ting Liu, Xingyu Mo, Jiaxing Shang, S. Yin
Coarse-Grained Reconfigurable Architectures (CGRAs) are promising architectures with high energy efficiency and flexibility. The computation-intensive portions of an application (e.g. loops) are often executed on CGRAs for acceleration and modulo scheduling is commonly used for loop mapping. However, for imperfectly-nested loops, existing methods don't fully explore the structure of the loops before performing modulo scheduling, resulting in poor execution performance. To tackle this problem, we propose a polyhedral-based pipelining approach for mapping imperfectly-nested loops on CGRA. By efficiently exploring the transformation space for imperfectly-nested loops using the polyhedral model and taking total execution time as an optimization metric, our approach could improve the execution performance greatly. On a $4times 4$ mesh-connected CGRA, the experimental results show that our approach can reduce the total execution time of nested loop by 50.1 % on average, as compared to the state-of-the-art techniques. Moreover, the compilation time is moderate in practice.
粗粒度可重构体系结构(CGRAs)是一种具有高能效和灵活性的有前途的体系结构。应用程序的计算密集型部分(例如循环)通常在CGRAs上执行以加速,模调度通常用于循环映射。然而,对于嵌套不完美的循环,现有方法在进行模调度之前没有充分探索循环的结构,导致执行性能较差。为了解决这个问题,我们提出了一种基于多面体的流水线方法,用于在CGRA上映射不完美嵌套循环。该方法利用多面体模型有效地探索不完美嵌套循环的变换空间,并以总执行时间为优化指标,大大提高了执行性能。在$4 × 4$网格连接的CGRA上,实验结果表明,与目前的技术相比,我们的方法可以将嵌套循环的总执行时间平均减少50.1%。此外,在实践中,编译时间是适度的。
{"title":"Polyhedral-based Pipelining of Imperfectly-Nested Loop for CGRAs","authors":"Dajiang Liu, Ting Liu, Xingyu Mo, Jiaxing Shang, S. Yin","doi":"10.1109/ICCAD51958.2021.9643542","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643542","url":null,"abstract":"Coarse-Grained Reconfigurable Architectures (CGRAs) are promising architectures with high energy efficiency and flexibility. The computation-intensive portions of an application (e.g. loops) are often executed on CGRAs for acceleration and modulo scheduling is commonly used for loop mapping. However, for imperfectly-nested loops, existing methods don't fully explore the structure of the loops before performing modulo scheduling, resulting in poor execution performance. To tackle this problem, we propose a polyhedral-based pipelining approach for mapping imperfectly-nested loops on CGRA. By efficiently exploring the transformation space for imperfectly-nested loops using the polyhedral model and taking total execution time as an optimization metric, our approach could improve the execution performance greatly. On a $4times 4$ mesh-connected CGRA, the experimental results show that our approach can reduce the total execution time of nested loop by 50.1 % on average, as compared to the state-of-the-art techniques. Moreover, the compilation time is moderate in practice.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"181 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132549005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Massively Parallel Big Data Classification on a Programmable Processing In-Memory Architecture 基于可编程内存处理架构的大规模并行大数据分类
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643480
Yeseong Kim, M. Imani, Saransh Gupta, Minxuan Zhou, T. Simunic
With the emergence of Internet of Things, massive data created in the world pose huge technical challenges for efficient processing. Processing in-memory (PIM) technology has been widely investigated to overcome expensive data movements between processors and memory blocks. However, existing PIM designs incur large area overhead to enable computing capability via additional near-data processing cores and analog/mixed signal circuits. In this paper, we propose a new massively-parallel processing in-memory (PIM) architecture, called CHOIR, based on emerging nonvolatile memory technology for big data classification. Unlike existing PIM designs which demand large analog/mixed signal circuits, we support the parallel PIM instructions for conditional and arithmetic operations in an area-efficient way. As a result, the classification solution performs both training and testing on the PIM architecture by fully utilizing the massive parallelism. Our design significantly improves the performance and energy efficiency of the classification tasks by 123× and 52× respectively as compared to the state-of-the-art tree boosting library running on GPU.
随着物联网的出现,世界范围内产生的海量数据对高效处理提出了巨大的技术挑战。为了克服处理器和内存块之间昂贵的数据移动,人们广泛研究了内存中处理(PIM)技术。然而,现有的PIM设计需要通过额外的近数据处理内核和模拟/混合信号电路来实现计算能力,从而产生较大的面积开销。在本文中,我们提出了一种新的大规模并行处理内存(PIM)架构,称为CHOIR,基于新兴的用于大数据分类的非易失性存储技术。不像现有的PIM设计需要大的模拟/混合信号电路,我们支持并行PIM指令的条件和算术运算在一个面积有效的方式。因此,该分类解决方案通过充分利用大规模并行性,在PIM体系结构上进行训练和测试。与运行在GPU上的最先进的树提升库相比,我们的设计显着提高了分类任务的性能和能效,分别提高了123倍和52倍。
{"title":"Massively Parallel Big Data Classification on a Programmable Processing In-Memory Architecture","authors":"Yeseong Kim, M. Imani, Saransh Gupta, Minxuan Zhou, T. Simunic","doi":"10.1109/ICCAD51958.2021.9643480","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643480","url":null,"abstract":"With the emergence of Internet of Things, massive data created in the world pose huge technical challenges for efficient processing. Processing in-memory (PIM) technology has been widely investigated to overcome expensive data movements between processors and memory blocks. However, existing PIM designs incur large area overhead to enable computing capability via additional near-data processing cores and analog/mixed signal circuits. In this paper, we propose a new massively-parallel processing in-memory (PIM) architecture, called CHOIR, based on emerging nonvolatile memory technology for big data classification. Unlike existing PIM designs which demand large analog/mixed signal circuits, we support the parallel PIM instructions for conditional and arithmetic operations in an area-efficient way. As a result, the classification solution performs both training and testing on the PIM architecture by fully utilizing the massive parallelism. Our design significantly improves the performance and energy efficiency of the classification tasks by 123× and 52× respectively as compared to the state-of-the-art tree boosting library running on GPU.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125092462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Banshee: A Fast LLVM-Based RISC-V Binary Translator Banshee:一个快速的基于llvm的RISC-V二进制转换器
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643546
Samuel Riedel, Fabian Schuiki, Paul Scheffler, Florian Zaruba, L. Benini
System simulators are essential for the exploration, evaluation, and verification of manycore processors and are vital for writing software and developing programming models in conjunction with architecture design. A promising approach to fast, scalable, and instruction-accurate simulation is binary translation. In this paper, we present Banshee, an instruction-accurate full-system RISC-V multi-core simulator based on LLVM-powered ahead-of-time binary translation that can simulate systems with thousands of cores. Banshee supports the RV32IMAFD instruction set. It also models peripherals, custom ISA extensions, and a multi-level, actively-managed memory hierarchy used in existing multi-cluster systems. Banshee is agnostic to the host architecture, fully open-source, and easily extensible to facilitate the exploration and evaluation of new ISA extensions. As a key novelty with respect to existing binary translation approaches, Banshee supports performance estimation through a lightweight extension, modeling the effect of architectural latencies with an average deviation of only 2 % from their actual impact. We evaluate Banshee by simulating various compute-intensive workloads on two large-scale open-source RISC-V manycore systems, Manticore and MemPool (with 4096 and 256 cores, respectively). We achieve simulation speeds of up to 618 MIPS per core or 72 GIPS for complete systems, exhibiting almost perfect scaling, competitive single-core performance, and leading multi-core performance. We demonstrate Banshee's extensibility by implementing multiple custom RISC-V ISA extensions.
系统模拟器对于探索、评估和验证多核处理器是必不可少的,并且对于编写软件和开发与体系结构设计相结合的编程模型至关重要。二进制翻译是一种快速、可扩展且指令准确的模拟方法。在本文中,我们提出了Banshee,一个指令精确的全系统RISC-V多核模拟器,基于llvm驱动的提前二进制转换,可以模拟具有数千核的系统。女妖支持RV32IMAFD指令集。它还对现有多集群系统中使用的外设、自定义ISA扩展和多层次、主动管理的内存层次结构进行建模。Banshee与主机架构无关,完全开源,易于扩展,便于探索和评估新的ISA扩展。作为现有二进制翻译方法的一个关键创新,Banshee通过轻量级扩展支持性能评估,对架构延迟的影响进行建模,与实际影响的平均偏差仅为2%。我们通过在两个大型开源RISC-V多核系统,Manticore和MemPool(分别具有4096和256核)上模拟各种计算密集型工作负载来评估Banshee。我们实现了高达每核618 MIPS或完整系统72 GIPS的模拟速度,展示了几乎完美的扩展,具有竞争力的单核性能和领先的多核性能。我们通过实现多个自定义RISC-V ISA扩展来演示Banshee的可扩展性。
{"title":"Banshee: A Fast LLVM-Based RISC-V Binary Translator","authors":"Samuel Riedel, Fabian Schuiki, Paul Scheffler, Florian Zaruba, L. Benini","doi":"10.1109/ICCAD51958.2021.9643546","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643546","url":null,"abstract":"System simulators are essential for the exploration, evaluation, and verification of manycore processors and are vital for writing software and developing programming models in conjunction with architecture design. A promising approach to fast, scalable, and instruction-accurate simulation is binary translation. In this paper, we present Banshee, an instruction-accurate full-system RISC-V multi-core simulator based on LLVM-powered ahead-of-time binary translation that can simulate systems with thousands of cores. Banshee supports the RV32IMAFD instruction set. It also models peripherals, custom ISA extensions, and a multi-level, actively-managed memory hierarchy used in existing multi-cluster systems. Banshee is agnostic to the host architecture, fully open-source, and easily extensible to facilitate the exploration and evaluation of new ISA extensions. As a key novelty with respect to existing binary translation approaches, Banshee supports performance estimation through a lightweight extension, modeling the effect of architectural latencies with an average deviation of only 2 % from their actual impact. We evaluate Banshee by simulating various compute-intensive workloads on two large-scale open-source RISC-V manycore systems, Manticore and MemPool (with 4096 and 256 cores, respectively). We achieve simulation speeds of up to 618 MIPS per core or 72 GIPS for complete systems, exhibiting almost perfect scaling, competitive single-core performance, and leading multi-core performance. We demonstrate Banshee's extensibility by implementing multiple custom RISC-V ISA extensions.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115292617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
An Area-Efficient Scannable In Situ Timing Error Detection Technique Featuring Low Test Overhead for Resilient Circuits 一种面向弹性电路的低测试开销、面积高效的可扫描原位时序误差检测技术
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643525
Hao Zhang, Weifeng He, Yanan Sun, Mingoo Seok
Timing error detection is a key technique for resilient circuits to explore the timing margins, yet it hinders the scan shift operations and increases the excessive test overhead. In this paper, we propose an area-efficient scannable in situ timing error detection technique consisting of a lightweight scannable error-detection cell and propagation logics, featuring low design-for-test effort and test overhead. The proposed error-detection cell fully reuses its main and shadow latches to construct the latch-based error-detection structure in normal mode, or the flip-flop-based datapath in scan mode. Therefore, it not only offers the time-borrowing ability to lower the correction overheads, but also supports the scan shift operations and detection logic tests. Besides, the dependency of error signal generation on the critical path sensitization is eliminated by configuring input and clock signals of error propagation logics, and thereby the detection and propagation logic can be tested easily. Benefiting from the technique, a set of test methods is presented with lower test pattern scales and test cycle overheads. As compared with previous works, the proposed cell saves at least 30.5% area overhead. Besides, experimental results across several benchmark circuits show that 116x of test patterns, 232x of static test cycles, and 26x of at-speed test cycles are saved on average, proving the effectiveness of the proposed technique for the design-for-test requirement.
时序误差检测是弹性电路探索时序裕度的关键技术,但它阻碍了扫描移位操作,增加了过高的测试开销。在本文中,我们提出了一种区域高效的可扫描原位时序误差检测技术,该技术由轻量级可扫描误差检测单元和传播逻辑组成,具有低测试设计工作量和测试开销的特点。所提出的错误检测单元充分重用其主锁存器和阴影锁存器,在正常模式下构建基于锁存器的错误检测结构,在扫描模式下构建基于触发器的数据路径。因此,它不仅提供了时间借用能力,以降低校正开销,而且还支持扫描移位操作和检测逻辑测试。此外,通过配置错误传播逻辑的输入和时钟信号,消除了错误信号产生对关键路径敏化的依赖,从而可以方便地测试检测和传播逻辑。得益于该技术,提供了一套测试方法,具有较低的测试模式规模和测试周期开销。与以前的工作相比,所提出的单元节省了至少30.5%的面积开销。此外,在多个基准电路上的实验结果表明,该方法平均节省了116x的测试模式、232x的静态测试周期和26x的高速测试周期,证明了该方法满足“为测试而设计”要求的有效性。
{"title":"An Area-Efficient Scannable In Situ Timing Error Detection Technique Featuring Low Test Overhead for Resilient Circuits","authors":"Hao Zhang, Weifeng He, Yanan Sun, Mingoo Seok","doi":"10.1109/ICCAD51958.2021.9643525","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643525","url":null,"abstract":"Timing error detection is a key technique for resilient circuits to explore the timing margins, yet it hinders the scan shift operations and increases the excessive test overhead. In this paper, we propose an area-efficient scannable in situ timing error detection technique consisting of a lightweight scannable error-detection cell and propagation logics, featuring low design-for-test effort and test overhead. The proposed error-detection cell fully reuses its main and shadow latches to construct the latch-based error-detection structure in normal mode, or the flip-flop-based datapath in scan mode. Therefore, it not only offers the time-borrowing ability to lower the correction overheads, but also supports the scan shift operations and detection logic tests. Besides, the dependency of error signal generation on the critical path sensitization is eliminated by configuring input and clock signals of error propagation logics, and thereby the detection and propagation logic can be tested easily. Benefiting from the technique, a set of test methods is presented with lower test pattern scales and test cycle overheads. As compared with previous works, the proposed cell saves at least 30.5% area overhead. Besides, experimental results across several benchmark circuits show that 116x of test patterns, 232x of static test cycles, and 26x of at-speed test cycles are saved on average, proving the effectiveness of the proposed technique for the design-for-test requirement.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116735277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Improving Inter-kernel Data Reuse With CTA-Page Coordination in GPGPU 基于CTA-Page协调的GPGPU内核间数据重用
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643535
Xuanyi Li, Chen Li, Yang Guo, Rachata Ausavarungnirun
Although modern GPUs are equipped with expanding memory, accommodating the entire working set of large-scale workloads can still be a challenge. With the support of unified virtual memory and demand paging, programmers can transparently oversubscribe the main memory. However, this transparent management still comes at a severe performance cost, especially for applications with inter-kernel data sharing. While there have been many efforts to reduce additional data migrations caused by the memory oversubscription, few consider the reuse of shared data during the boundary of adjacent kernels. Due to limited memory capacity, we observe that adjacent kernel often demands shared pages that were evicted by the previous kernel, resulting in a significant number of costly data migrations. In this paper, we propose a CTA-Page collaborative framework, called CPC, that transparently reduces the impact of memory oversubscription using CTA dispatch switching and page replacement switching coordinately to reuse inter-kernel shared data. We evaluate CPC with a variety of GPGPU benchmark suites. Experimental results show that the system performance is improved by 65 % compared with the state-of-the-art technique for applications with inter-kernel data sharing.
尽管现代gpu配备了扩展内存,但容纳大规模工作负载的整个工作集仍然是一个挑战。在统一虚拟内存和需求分页的支持下,程序员可以透明地超额订阅主内存。然而,这种透明的管理仍然以严重的性能成本为代价,特别是对于具有内核间数据共享的应用程序。虽然已经有很多努力来减少由内存超额订阅引起的额外数据迁移,但很少有人考虑在相邻内核边界期间重用共享数据。由于内存容量有限,我们观察到相邻的内核经常需要被前一个内核驱逐的共享页面,从而导致大量昂贵的数据迁移。在本文中,我们提出了一个CTA- page协作框架,称为CPC,该框架通过协调使用CTA调度切换和页面替换切换来重用内核间共享数据,从而透明地降低了内存超额订阅的影响。我们使用各种GPGPU基准套件来评估CPC。实验结果表明,对于具有内核间数据共享的应用程序,该系统的性能比现有技术提高了65%。
{"title":"Improving Inter-kernel Data Reuse With CTA-Page Coordination in GPGPU","authors":"Xuanyi Li, Chen Li, Yang Guo, Rachata Ausavarungnirun","doi":"10.1109/ICCAD51958.2021.9643535","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643535","url":null,"abstract":"Although modern GPUs are equipped with expanding memory, accommodating the entire working set of large-scale workloads can still be a challenge. With the support of unified virtual memory and demand paging, programmers can transparently oversubscribe the main memory. However, this transparent management still comes at a severe performance cost, especially for applications with inter-kernel data sharing. While there have been many efforts to reduce additional data migrations caused by the memory oversubscription, few consider the reuse of shared data during the boundary of adjacent kernels. Due to limited memory capacity, we observe that adjacent kernel often demands shared pages that were evicted by the previous kernel, resulting in a significant number of costly data migrations. In this paper, we propose a CTA-Page collaborative framework, called CPC, that transparently reduces the impact of memory oversubscription using CTA dispatch switching and page replacement switching coordinately to reuse inter-kernel shared data. We evaluate CPC with a variety of GPGPU benchmark suites. Experimental results show that the system performance is improved by 65 % compared with the state-of-the-art technique for applications with inter-kernel data sharing.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129480647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1