Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643580
Xuan Wang, Zhufei Chu, Weikang Qian
Stochastic computing (SC) operates on stochastic bit streams, which can realize complex arithmetic functions with simple circuits. A previous work shows that by introducing a little approximation error for the target function, the cost of SC circuits can be dramatically reduced. However, the previous heuristic method only explores a limited subset of the solution space, so the optimality of the results cannot be guaranteed. In this paper, we propose MinSC, an exact synthesis-based method for minimal-area stochastic circuits under relaxed error bound. First, a novel search method is proposed to find the best approximation polynomial for a target function. Then, considering gates with different fanin numbers and areas, an exact SC synthesis method using satisfiability modulo theories is designed to obtain an area-optimal SC circuit realizing the best approximation polynomial. The experimental results show that compared with the state-of-the-art method, given an error ratio 0.05, MinSC on average reduces the gate number, area, delay, and area-delay-product of the SC circuits by 60.24%, 47.24%, 7.10%, 57.07%, respectively.
{"title":"MinSC: An Exact Synthesis-Based Method for Minimal-Area Stochastic Circuits under Relaxed Error Bound","authors":"Xuan Wang, Zhufei Chu, Weikang Qian","doi":"10.1109/ICCAD51958.2021.9643580","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643580","url":null,"abstract":"Stochastic computing (SC) operates on stochastic bit streams, which can realize complex arithmetic functions with simple circuits. A previous work shows that by introducing a little approximation error for the target function, the cost of SC circuits can be dramatically reduced. However, the previous heuristic method only explores a limited subset of the solution space, so the optimality of the results cannot be guaranteed. In this paper, we propose MinSC, an exact synthesis-based method for minimal-area stochastic circuits under relaxed error bound. First, a novel search method is proposed to find the best approximation polynomial for a target function. Then, considering gates with different fanin numbers and areas, an exact SC synthesis method using satisfiability modulo theories is designed to obtain an area-optimal SC circuit realizing the best approximation polynomial. The experimental results show that compared with the state-of-the-art method, given an error ratio 0.05, MinSC on average reduces the gate number, area, delay, and area-delay-product of the SC circuits by 60.24%, 47.24%, 7.10%, 57.07%, respectively.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127931785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643550
Zih-Yao Lin, Yao-Wen Chang
A circuit design with non-integer multiple cell height (NIMCH) is more flexible for optimizing area, timing, and power simultaneously. A cell with a larger height provides higher pin accessibility, higher drive strength, and shorter delay. In contrast, one with a smaller height has a smaller area, pin capacitance, and power consumption. Such NIMCH design must satisfy additional layout constraints that existing tool flows cannot handle well. This paper presents a row-based algorithm for non-integer multiple-cell-height placement. Our algorithm consists of two main techniques: (1) a k-mean-based clustering method to assign heights to each row to define the regions of particular cell heights, and (2) a legalization method to move cells to satisfy NIMCH constraints. Experimental results show that our approach can significantly reduce the average routed wirelength and the average total power compared with the state-of-the-art approach.
{"title":"A Row-Based Algorithm for Non-Integer Multiple-Cell-Height Placement","authors":"Zih-Yao Lin, Yao-Wen Chang","doi":"10.1109/ICCAD51958.2021.9643550","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643550","url":null,"abstract":"A circuit design with non-integer multiple cell height (NIMCH) is more flexible for optimizing area, timing, and power simultaneously. A cell with a larger height provides higher pin accessibility, higher drive strength, and shorter delay. In contrast, one with a smaller height has a smaller area, pin capacitance, and power consumption. Such NIMCH design must satisfy additional layout constraints that existing tool flows cannot handle well. This paper presents a row-based algorithm for non-integer multiple-cell-height placement. Our algorithm consists of two main techniques: (1) a k-mean-based clustering method to assign heights to each row to define the regions of particular cell heights, and (2) a legalization method to move cells to satisfy NIMCH constraints. Experimental results show that our approach can significantly reduce the average routed wirelength and the average total power compared with the state-of-the-art approach.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133493619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643529
Wei Zhao, D. Feng, Yu Hua, Wei Tong, Jingning Liu, Jie Xu, Chunyan Li, Gaoxiang Xu, Yiran Chen
Memory encryption can enhance the security of Non-volatile memories (NVMs), but it significantly increases the data bits written to NVMs and leads to severe lifetime and performance degradation. Current encryption techniques aim to reduce the re-encryption to many existing clean words, which unfortunately suffer from high encryption overheads (i.e. latency and energy) and many unnecessary writes. In the meantime, compression techniques can reduce the writes of encrypted NVM. However, we find that they may destroy the data patterns and increase the modified words, resulting in many encryptions in secure NVM. In this paper, we propose the MORphable Encryption and Encoding (MORE2) scheme to address these problems. Our MORphable Encryption (MORE) technique aims to reduce the full-line re-encryption and avoid clean line encryption. Besides, MORE proposes a prediction-based write scheme to avoid the encryption of clean lines, and pre-encrypt the lines that are predicted as dirty. Therefore, MORE can remove the encryption from the critical path of NVM. Furthermore, MORE2 proposes the Morphable Selective Encoding (MSE) scheme to compress the modified words while preserving clean words. MORE2 encrypts all metadata with the line counter to guarantee high security. Experimental results show that MORE2 reduces the bit flips of encrypted NVM by 53.5 %, decreases the access latency by 27.32%, improves the IPC performance by 12.1 %, and reduces the write energy by 29.1 % compared with the state-of-the-art design.
{"title":"MORE2: Morphable Encryption and Encoding for Secure NVM","authors":"Wei Zhao, D. Feng, Yu Hua, Wei Tong, Jingning Liu, Jie Xu, Chunyan Li, Gaoxiang Xu, Yiran Chen","doi":"10.1109/ICCAD51958.2021.9643529","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643529","url":null,"abstract":"Memory encryption can enhance the security of Non-volatile memories (NVMs), but it significantly increases the data bits written to NVMs and leads to severe lifetime and performance degradation. Current encryption techniques aim to reduce the re-encryption to many existing clean words, which unfortunately suffer from high encryption overheads (i.e. latency and energy) and many unnecessary writes. In the meantime, compression techniques can reduce the writes of encrypted NVM. However, we find that they may destroy the data patterns and increase the modified words, resulting in many encryptions in secure NVM. In this paper, we propose the MORphable Encryption and Encoding (MORE2) scheme to address these problems. Our MORphable Encryption (MORE) technique aims to reduce the full-line re-encryption and avoid clean line encryption. Besides, MORE proposes a prediction-based write scheme to avoid the encryption of clean lines, and pre-encrypt the lines that are predicted as dirty. Therefore, MORE can remove the encryption from the critical path of NVM. Furthermore, MORE2 proposes the Morphable Selective Encoding (MSE) scheme to compress the modified words while preserving clean words. MORE2 encrypts all metadata with the line counter to guarantee high security. Experimental results show that MORE2 reduces the bit flips of encrypted NVM by 53.5 %, decreases the access latency by 27.32%, improves the IPC performance by 12.1 %, and reduces the write energy by 29.1 % compared with the state-of-the-art design.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131332677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643473
Kazi Asifuzzaman, Mohamed Abuelala, Mohamed Hassan, F. Cazorla
The number of functionalities controlled by software on every critical real-time product is on the rise in domains like automotive, avionics and space. To implement these advanced functionalities, software applications increasingly adopt artificial intelligence algorithms that manage massive amounts of data transmitted from various sensors. This translates into unprecedented memory performance requirements in critical systems that the commonly used DRAM memories struggle to provide. High-Bandwidth Memory (HBM) can satisfy these requirements offering high bandwidth, low power and high-integration capacity features. However, it remains unclear whether the predictability and isolation properties of HBM are compatible with the requirements of critical embedded systems. In this work, we perform to our knowledge the first timing analysis of HBM. We show the unique structural and timing characteristics of HBM with respect to DRAM memories and how they can be exploited for better time predictability, with emphasis on increased isolation among tasks and reduced worst-case memory latency.
{"title":"Demystifying the Characteristics of High Bandwidth Memory for Real-Time Systems","authors":"Kazi Asifuzzaman, Mohamed Abuelala, Mohamed Hassan, F. Cazorla","doi":"10.1109/ICCAD51958.2021.9643473","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643473","url":null,"abstract":"The number of functionalities controlled by software on every critical real-time product is on the rise in domains like automotive, avionics and space. To implement these advanced functionalities, software applications increasingly adopt artificial intelligence algorithms that manage massive amounts of data transmitted from various sensors. This translates into unprecedented memory performance requirements in critical systems that the commonly used DRAM memories struggle to provide. High-Bandwidth Memory (HBM) can satisfy these requirements offering high bandwidth, low power and high-integration capacity features. However, it remains unclear whether the predictability and isolation properties of HBM are compatible with the requirements of critical embedded systems. In this work, we perform to our knowledge the first timing analysis of HBM. We show the unique structural and timing characteristics of HBM with respect to DRAM memories and how they can be exploited for better time predictability, with emphasis on increased isolation among tasks and reduced worst-case memory latency.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129942546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643542
Dajiang Liu, Ting Liu, Xingyu Mo, Jiaxing Shang, S. Yin
Coarse-Grained Reconfigurable Architectures (CGRAs) are promising architectures with high energy efficiency and flexibility. The computation-intensive portions of an application (e.g. loops) are often executed on CGRAs for acceleration and modulo scheduling is commonly used for loop mapping. However, for imperfectly-nested loops, existing methods don't fully explore the structure of the loops before performing modulo scheduling, resulting in poor execution performance. To tackle this problem, we propose a polyhedral-based pipelining approach for mapping imperfectly-nested loops on CGRA. By efficiently exploring the transformation space for imperfectly-nested loops using the polyhedral model and taking total execution time as an optimization metric, our approach could improve the execution performance greatly. On a $4times 4$ mesh-connected CGRA, the experimental results show that our approach can reduce the total execution time of nested loop by 50.1 % on average, as compared to the state-of-the-art techniques. Moreover, the compilation time is moderate in practice.
{"title":"Polyhedral-based Pipelining of Imperfectly-Nested Loop for CGRAs","authors":"Dajiang Liu, Ting Liu, Xingyu Mo, Jiaxing Shang, S. Yin","doi":"10.1109/ICCAD51958.2021.9643542","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643542","url":null,"abstract":"Coarse-Grained Reconfigurable Architectures (CGRAs) are promising architectures with high energy efficiency and flexibility. The computation-intensive portions of an application (e.g. loops) are often executed on CGRAs for acceleration and modulo scheduling is commonly used for loop mapping. However, for imperfectly-nested loops, existing methods don't fully explore the structure of the loops before performing modulo scheduling, resulting in poor execution performance. To tackle this problem, we propose a polyhedral-based pipelining approach for mapping imperfectly-nested loops on CGRA. By efficiently exploring the transformation space for imperfectly-nested loops using the polyhedral model and taking total execution time as an optimization metric, our approach could improve the execution performance greatly. On a $4times 4$ mesh-connected CGRA, the experimental results show that our approach can reduce the total execution time of nested loop by 50.1 % on average, as compared to the state-of-the-art techniques. Moreover, the compilation time is moderate in practice.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"181 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132549005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643480
Yeseong Kim, M. Imani, Saransh Gupta, Minxuan Zhou, T. Simunic
With the emergence of Internet of Things, massive data created in the world pose huge technical challenges for efficient processing. Processing in-memory (PIM) technology has been widely investigated to overcome expensive data movements between processors and memory blocks. However, existing PIM designs incur large area overhead to enable computing capability via additional near-data processing cores and analog/mixed signal circuits. In this paper, we propose a new massively-parallel processing in-memory (PIM) architecture, called CHOIR, based on emerging nonvolatile memory technology for big data classification. Unlike existing PIM designs which demand large analog/mixed signal circuits, we support the parallel PIM instructions for conditional and arithmetic operations in an area-efficient way. As a result, the classification solution performs both training and testing on the PIM architecture by fully utilizing the massive parallelism. Our design significantly improves the performance and energy efficiency of the classification tasks by 123× and 52× respectively as compared to the state-of-the-art tree boosting library running on GPU.
{"title":"Massively Parallel Big Data Classification on a Programmable Processing In-Memory Architecture","authors":"Yeseong Kim, M. Imani, Saransh Gupta, Minxuan Zhou, T. Simunic","doi":"10.1109/ICCAD51958.2021.9643480","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643480","url":null,"abstract":"With the emergence of Internet of Things, massive data created in the world pose huge technical challenges for efficient processing. Processing in-memory (PIM) technology has been widely investigated to overcome expensive data movements between processors and memory blocks. However, existing PIM designs incur large area overhead to enable computing capability via additional near-data processing cores and analog/mixed signal circuits. In this paper, we propose a new massively-parallel processing in-memory (PIM) architecture, called CHOIR, based on emerging nonvolatile memory technology for big data classification. Unlike existing PIM designs which demand large analog/mixed signal circuits, we support the parallel PIM instructions for conditional and arithmetic operations in an area-efficient way. As a result, the classification solution performs both training and testing on the PIM architecture by fully utilizing the massive parallelism. Our design significantly improves the performance and energy efficiency of the classification tasks by 123× and 52× respectively as compared to the state-of-the-art tree boosting library running on GPU.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125092462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643546
Samuel Riedel, Fabian Schuiki, Paul Scheffler, Florian Zaruba, L. Benini
System simulators are essential for the exploration, evaluation, and verification of manycore processors and are vital for writing software and developing programming models in conjunction with architecture design. A promising approach to fast, scalable, and instruction-accurate simulation is binary translation. In this paper, we present Banshee, an instruction-accurate full-system RISC-V multi-core simulator based on LLVM-powered ahead-of-time binary translation that can simulate systems with thousands of cores. Banshee supports the RV32IMAFD instruction set. It also models peripherals, custom ISA extensions, and a multi-level, actively-managed memory hierarchy used in existing multi-cluster systems. Banshee is agnostic to the host architecture, fully open-source, and easily extensible to facilitate the exploration and evaluation of new ISA extensions. As a key novelty with respect to existing binary translation approaches, Banshee supports performance estimation through a lightweight extension, modeling the effect of architectural latencies with an average deviation of only 2 % from their actual impact. We evaluate Banshee by simulating various compute-intensive workloads on two large-scale open-source RISC-V manycore systems, Manticore and MemPool (with 4096 and 256 cores, respectively). We achieve simulation speeds of up to 618 MIPS per core or 72 GIPS for complete systems, exhibiting almost perfect scaling, competitive single-core performance, and leading multi-core performance. We demonstrate Banshee's extensibility by implementing multiple custom RISC-V ISA extensions.
{"title":"Banshee: A Fast LLVM-Based RISC-V Binary Translator","authors":"Samuel Riedel, Fabian Schuiki, Paul Scheffler, Florian Zaruba, L. Benini","doi":"10.1109/ICCAD51958.2021.9643546","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643546","url":null,"abstract":"System simulators are essential for the exploration, evaluation, and verification of manycore processors and are vital for writing software and developing programming models in conjunction with architecture design. A promising approach to fast, scalable, and instruction-accurate simulation is binary translation. In this paper, we present Banshee, an instruction-accurate full-system RISC-V multi-core simulator based on LLVM-powered ahead-of-time binary translation that can simulate systems with thousands of cores. Banshee supports the RV32IMAFD instruction set. It also models peripherals, custom ISA extensions, and a multi-level, actively-managed memory hierarchy used in existing multi-cluster systems. Banshee is agnostic to the host architecture, fully open-source, and easily extensible to facilitate the exploration and evaluation of new ISA extensions. As a key novelty with respect to existing binary translation approaches, Banshee supports performance estimation through a lightweight extension, modeling the effect of architectural latencies with an average deviation of only 2 % from their actual impact. We evaluate Banshee by simulating various compute-intensive workloads on two large-scale open-source RISC-V manycore systems, Manticore and MemPool (with 4096 and 256 cores, respectively). We achieve simulation speeds of up to 618 MIPS per core or 72 GIPS for complete systems, exhibiting almost perfect scaling, competitive single-core performance, and leading multi-core performance. We demonstrate Banshee's extensibility by implementing multiple custom RISC-V ISA extensions.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115292617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643525
Hao Zhang, Weifeng He, Yanan Sun, Mingoo Seok
Timing error detection is a key technique for resilient circuits to explore the timing margins, yet it hinders the scan shift operations and increases the excessive test overhead. In this paper, we propose an area-efficient scannable in situ timing error detection technique consisting of a lightweight scannable error-detection cell and propagation logics, featuring low design-for-test effort and test overhead. The proposed error-detection cell fully reuses its main and shadow latches to construct the latch-based error-detection structure in normal mode, or the flip-flop-based datapath in scan mode. Therefore, it not only offers the time-borrowing ability to lower the correction overheads, but also supports the scan shift operations and detection logic tests. Besides, the dependency of error signal generation on the critical path sensitization is eliminated by configuring input and clock signals of error propagation logics, and thereby the detection and propagation logic can be tested easily. Benefiting from the technique, a set of test methods is presented with lower test pattern scales and test cycle overheads. As compared with previous works, the proposed cell saves at least 30.5% area overhead. Besides, experimental results across several benchmark circuits show that 116x of test patterns, 232x of static test cycles, and 26x of at-speed test cycles are saved on average, proving the effectiveness of the proposed technique for the design-for-test requirement.
{"title":"An Area-Efficient Scannable In Situ Timing Error Detection Technique Featuring Low Test Overhead for Resilient Circuits","authors":"Hao Zhang, Weifeng He, Yanan Sun, Mingoo Seok","doi":"10.1109/ICCAD51958.2021.9643525","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643525","url":null,"abstract":"Timing error detection is a key technique for resilient circuits to explore the timing margins, yet it hinders the scan shift operations and increases the excessive test overhead. In this paper, we propose an area-efficient scannable in situ timing error detection technique consisting of a lightweight scannable error-detection cell and propagation logics, featuring low design-for-test effort and test overhead. The proposed error-detection cell fully reuses its main and shadow latches to construct the latch-based error-detection structure in normal mode, or the flip-flop-based datapath in scan mode. Therefore, it not only offers the time-borrowing ability to lower the correction overheads, but also supports the scan shift operations and detection logic tests. Besides, the dependency of error signal generation on the critical path sensitization is eliminated by configuring input and clock signals of error propagation logics, and thereby the detection and propagation logic can be tested easily. Benefiting from the technique, a set of test methods is presented with lower test pattern scales and test cycle overheads. As compared with previous works, the proposed cell saves at least 30.5% area overhead. Besides, experimental results across several benchmark circuits show that 116x of test patterns, 232x of static test cycles, and 26x of at-speed test cycles are saved on average, proving the effectiveness of the proposed technique for the design-for-test requirement.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116735277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643535
Xuanyi Li, Chen Li, Yang Guo, Rachata Ausavarungnirun
Although modern GPUs are equipped with expanding memory, accommodating the entire working set of large-scale workloads can still be a challenge. With the support of unified virtual memory and demand paging, programmers can transparently oversubscribe the main memory. However, this transparent management still comes at a severe performance cost, especially for applications with inter-kernel data sharing. While there have been many efforts to reduce additional data migrations caused by the memory oversubscription, few consider the reuse of shared data during the boundary of adjacent kernels. Due to limited memory capacity, we observe that adjacent kernel often demands shared pages that were evicted by the previous kernel, resulting in a significant number of costly data migrations. In this paper, we propose a CTA-Page collaborative framework, called CPC, that transparently reduces the impact of memory oversubscription using CTA dispatch switching and page replacement switching coordinately to reuse inter-kernel shared data. We evaluate CPC with a variety of GPGPU benchmark suites. Experimental results show that the system performance is improved by 65 % compared with the state-of-the-art technique for applications with inter-kernel data sharing.
{"title":"Improving Inter-kernel Data Reuse With CTA-Page Coordination in GPGPU","authors":"Xuanyi Li, Chen Li, Yang Guo, Rachata Ausavarungnirun","doi":"10.1109/ICCAD51958.2021.9643535","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643535","url":null,"abstract":"Although modern GPUs are equipped with expanding memory, accommodating the entire working set of large-scale workloads can still be a challenge. With the support of unified virtual memory and demand paging, programmers can transparently oversubscribe the main memory. However, this transparent management still comes at a severe performance cost, especially for applications with inter-kernel data sharing. While there have been many efforts to reduce additional data migrations caused by the memory oversubscription, few consider the reuse of shared data during the boundary of adjacent kernels. Due to limited memory capacity, we observe that adjacent kernel often demands shared pages that were evicted by the previous kernel, resulting in a significant number of costly data migrations. In this paper, we propose a CTA-Page collaborative framework, called CPC, that transparently reduces the impact of memory oversubscription using CTA dispatch switching and page replacement switching coordinately to reuse inter-kernel shared data. We evaluate CPC with a variety of GPGPU benchmark suites. Experimental results show that the system performance is improved by 65 % compared with the state-of-the-art technique for applications with inter-kernel data sharing.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129480647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}