Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643537
Kyeonghyeon Baek, Taewhan Kim
The three major tasks in standard cell layout synthesis are transistor folding, transistor placement, and in-cell routing, which are tightly inter-related, but generally performed one at a time to reduce the extremely high complexity of design space. In this paper, we propose an integrated approach to the two problems of transistor folding and placement. Precisely, we propose a globally optimal algorithm of search tree based design space exploration, devising a set of effective speeding up techniques as well as dynamic programming based fast cost computation. In addition, our algorithm incorporates the minimum OD (oxide diffusion) jog constraint, which closely relies on both of transistor folding and placement. To our knowledge, this is the first work that tries to simultaneously solve the two problems. Through experiments with the transistor netlists and design rules in the ASAP 7nm library, it is shown that our proposed method is able to synthesize fully routable cell layouts of minimal size within 1 second for each netlist, outperforming the cell layout quality in the ASAP 7nm library, which otherwise, may take several hours or days to manually complete layouts of the quality level comparable to ours.
{"title":"Simultaneous Transistor Folding and Placement in Standard Cell Layout Synthesis","authors":"Kyeonghyeon Baek, Taewhan Kim","doi":"10.1109/ICCAD51958.2021.9643537","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643537","url":null,"abstract":"The three major tasks in standard cell layout synthesis are transistor folding, transistor placement, and in-cell routing, which are tightly inter-related, but generally performed one at a time to reduce the extremely high complexity of design space. In this paper, we propose an integrated approach to the two problems of transistor folding and placement. Precisely, we propose a globally optimal algorithm of search tree based design space exploration, devising a set of effective speeding up techniques as well as dynamic programming based fast cost computation. In addition, our algorithm incorporates the minimum OD (oxide diffusion) jog constraint, which closely relies on both of transistor folding and placement. To our knowledge, this is the first work that tries to simultaneously solve the two problems. Through experiments with the transistor netlists and design rules in the ASAP 7nm library, it is shown that our proposed method is able to synthesize fully routable cell layouts of minimal size within 1 second for each netlist, outperforming the cell layout quality in the ASAP 7nm library, which otherwise, may take several hours or days to manually complete layouts of the quality level comparable to ours.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126996354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643550
Zih-Yao Lin, Yao-Wen Chang
A circuit design with non-integer multiple cell height (NIMCH) is more flexible for optimizing area, timing, and power simultaneously. A cell with a larger height provides higher pin accessibility, higher drive strength, and shorter delay. In contrast, one with a smaller height has a smaller area, pin capacitance, and power consumption. Such NIMCH design must satisfy additional layout constraints that existing tool flows cannot handle well. This paper presents a row-based algorithm for non-integer multiple-cell-height placement. Our algorithm consists of two main techniques: (1) a k-mean-based clustering method to assign heights to each row to define the regions of particular cell heights, and (2) a legalization method to move cells to satisfy NIMCH constraints. Experimental results show that our approach can significantly reduce the average routed wirelength and the average total power compared with the state-of-the-art approach.
{"title":"A Row-Based Algorithm for Non-Integer Multiple-Cell-Height Placement","authors":"Zih-Yao Lin, Yao-Wen Chang","doi":"10.1109/ICCAD51958.2021.9643550","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643550","url":null,"abstract":"A circuit design with non-integer multiple cell height (NIMCH) is more flexible for optimizing area, timing, and power simultaneously. A cell with a larger height provides higher pin accessibility, higher drive strength, and shorter delay. In contrast, one with a smaller height has a smaller area, pin capacitance, and power consumption. Such NIMCH design must satisfy additional layout constraints that existing tool flows cannot handle well. This paper presents a row-based algorithm for non-integer multiple-cell-height placement. Our algorithm consists of two main techniques: (1) a k-mean-based clustering method to assign heights to each row to define the regions of particular cell heights, and (2) a legalization method to move cells to satisfy NIMCH constraints. Experimental results show that our approach can significantly reduce the average routed wirelength and the average total power compared with the state-of-the-art approach.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133493619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643529
Wei Zhao, D. Feng, Yu Hua, Wei Tong, Jingning Liu, Jie Xu, Chunyan Li, Gaoxiang Xu, Yiran Chen
Memory encryption can enhance the security of Non-volatile memories (NVMs), but it significantly increases the data bits written to NVMs and leads to severe lifetime and performance degradation. Current encryption techniques aim to reduce the re-encryption to many existing clean words, which unfortunately suffer from high encryption overheads (i.e. latency and energy) and many unnecessary writes. In the meantime, compression techniques can reduce the writes of encrypted NVM. However, we find that they may destroy the data patterns and increase the modified words, resulting in many encryptions in secure NVM. In this paper, we propose the MORphable Encryption and Encoding (MORE2) scheme to address these problems. Our MORphable Encryption (MORE) technique aims to reduce the full-line re-encryption and avoid clean line encryption. Besides, MORE proposes a prediction-based write scheme to avoid the encryption of clean lines, and pre-encrypt the lines that are predicted as dirty. Therefore, MORE can remove the encryption from the critical path of NVM. Furthermore, MORE2 proposes the Morphable Selective Encoding (MSE) scheme to compress the modified words while preserving clean words. MORE2 encrypts all metadata with the line counter to guarantee high security. Experimental results show that MORE2 reduces the bit flips of encrypted NVM by 53.5 %, decreases the access latency by 27.32%, improves the IPC performance by 12.1 %, and reduces the write energy by 29.1 % compared with the state-of-the-art design.
{"title":"MORE2: Morphable Encryption and Encoding for Secure NVM","authors":"Wei Zhao, D. Feng, Yu Hua, Wei Tong, Jingning Liu, Jie Xu, Chunyan Li, Gaoxiang Xu, Yiran Chen","doi":"10.1109/ICCAD51958.2021.9643529","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643529","url":null,"abstract":"Memory encryption can enhance the security of Non-volatile memories (NVMs), but it significantly increases the data bits written to NVMs and leads to severe lifetime and performance degradation. Current encryption techniques aim to reduce the re-encryption to many existing clean words, which unfortunately suffer from high encryption overheads (i.e. latency and energy) and many unnecessary writes. In the meantime, compression techniques can reduce the writes of encrypted NVM. However, we find that they may destroy the data patterns and increase the modified words, resulting in many encryptions in secure NVM. In this paper, we propose the MORphable Encryption and Encoding (MORE2) scheme to address these problems. Our MORphable Encryption (MORE) technique aims to reduce the full-line re-encryption and avoid clean line encryption. Besides, MORE proposes a prediction-based write scheme to avoid the encryption of clean lines, and pre-encrypt the lines that are predicted as dirty. Therefore, MORE can remove the encryption from the critical path of NVM. Furthermore, MORE2 proposes the Morphable Selective Encoding (MSE) scheme to compress the modified words while preserving clean words. MORE2 encrypts all metadata with the line counter to guarantee high security. Experimental results show that MORE2 reduces the bit flips of encrypted NVM by 53.5 %, decreases the access latency by 27.32%, improves the IPC performance by 12.1 %, and reduces the write energy by 29.1 % compared with the state-of-the-art design.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131332677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643580
Xuan Wang, Zhufei Chu, Weikang Qian
Stochastic computing (SC) operates on stochastic bit streams, which can realize complex arithmetic functions with simple circuits. A previous work shows that by introducing a little approximation error for the target function, the cost of SC circuits can be dramatically reduced. However, the previous heuristic method only explores a limited subset of the solution space, so the optimality of the results cannot be guaranteed. In this paper, we propose MinSC, an exact synthesis-based method for minimal-area stochastic circuits under relaxed error bound. First, a novel search method is proposed to find the best approximation polynomial for a target function. Then, considering gates with different fanin numbers and areas, an exact SC synthesis method using satisfiability modulo theories is designed to obtain an area-optimal SC circuit realizing the best approximation polynomial. The experimental results show that compared with the state-of-the-art method, given an error ratio 0.05, MinSC on average reduces the gate number, area, delay, and area-delay-product of the SC circuits by 60.24%, 47.24%, 7.10%, 57.07%, respectively.
{"title":"MinSC: An Exact Synthesis-Based Method for Minimal-Area Stochastic Circuits under Relaxed Error Bound","authors":"Xuan Wang, Zhufei Chu, Weikang Qian","doi":"10.1109/ICCAD51958.2021.9643580","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643580","url":null,"abstract":"Stochastic computing (SC) operates on stochastic bit streams, which can realize complex arithmetic functions with simple circuits. A previous work shows that by introducing a little approximation error for the target function, the cost of SC circuits can be dramatically reduced. However, the previous heuristic method only explores a limited subset of the solution space, so the optimality of the results cannot be guaranteed. In this paper, we propose MinSC, an exact synthesis-based method for minimal-area stochastic circuits under relaxed error bound. First, a novel search method is proposed to find the best approximation polynomial for a target function. Then, considering gates with different fanin numbers and areas, an exact SC synthesis method using satisfiability modulo theories is designed to obtain an area-optimal SC circuit realizing the best approximation polynomial. The experimental results show that compared with the state-of-the-art method, given an error ratio 0.05, MinSC on average reduces the gate number, area, delay, and area-delay-product of the SC circuits by 60.24%, 47.24%, 7.10%, 57.07%, respectively.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127931785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643473
Kazi Asifuzzaman, Mohamed Abuelala, Mohamed Hassan, F. Cazorla
The number of functionalities controlled by software on every critical real-time product is on the rise in domains like automotive, avionics and space. To implement these advanced functionalities, software applications increasingly adopt artificial intelligence algorithms that manage massive amounts of data transmitted from various sensors. This translates into unprecedented memory performance requirements in critical systems that the commonly used DRAM memories struggle to provide. High-Bandwidth Memory (HBM) can satisfy these requirements offering high bandwidth, low power and high-integration capacity features. However, it remains unclear whether the predictability and isolation properties of HBM are compatible with the requirements of critical embedded systems. In this work, we perform to our knowledge the first timing analysis of HBM. We show the unique structural and timing characteristics of HBM with respect to DRAM memories and how they can be exploited for better time predictability, with emphasis on increased isolation among tasks and reduced worst-case memory latency.
{"title":"Demystifying the Characteristics of High Bandwidth Memory for Real-Time Systems","authors":"Kazi Asifuzzaman, Mohamed Abuelala, Mohamed Hassan, F. Cazorla","doi":"10.1109/ICCAD51958.2021.9643473","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643473","url":null,"abstract":"The number of functionalities controlled by software on every critical real-time product is on the rise in domains like automotive, avionics and space. To implement these advanced functionalities, software applications increasingly adopt artificial intelligence algorithms that manage massive amounts of data transmitted from various sensors. This translates into unprecedented memory performance requirements in critical systems that the commonly used DRAM memories struggle to provide. High-Bandwidth Memory (HBM) can satisfy these requirements offering high bandwidth, low power and high-integration capacity features. However, it remains unclear whether the predictability and isolation properties of HBM are compatible with the requirements of critical embedded systems. In this work, we perform to our knowledge the first timing analysis of HBM. We show the unique structural and timing characteristics of HBM with respect to DRAM memories and how they can be exploited for better time predictability, with emphasis on increased isolation among tasks and reduced worst-case memory latency.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129942546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643542
Dajiang Liu, Ting Liu, Xingyu Mo, Jiaxing Shang, S. Yin
Coarse-Grained Reconfigurable Architectures (CGRAs) are promising architectures with high energy efficiency and flexibility. The computation-intensive portions of an application (e.g. loops) are often executed on CGRAs for acceleration and modulo scheduling is commonly used for loop mapping. However, for imperfectly-nested loops, existing methods don't fully explore the structure of the loops before performing modulo scheduling, resulting in poor execution performance. To tackle this problem, we propose a polyhedral-based pipelining approach for mapping imperfectly-nested loops on CGRA. By efficiently exploring the transformation space for imperfectly-nested loops using the polyhedral model and taking total execution time as an optimization metric, our approach could improve the execution performance greatly. On a $4times 4$ mesh-connected CGRA, the experimental results show that our approach can reduce the total execution time of nested loop by 50.1 % on average, as compared to the state-of-the-art techniques. Moreover, the compilation time is moderate in practice.
{"title":"Polyhedral-based Pipelining of Imperfectly-Nested Loop for CGRAs","authors":"Dajiang Liu, Ting Liu, Xingyu Mo, Jiaxing Shang, S. Yin","doi":"10.1109/ICCAD51958.2021.9643542","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643542","url":null,"abstract":"Coarse-Grained Reconfigurable Architectures (CGRAs) are promising architectures with high energy efficiency and flexibility. The computation-intensive portions of an application (e.g. loops) are often executed on CGRAs for acceleration and modulo scheduling is commonly used for loop mapping. However, for imperfectly-nested loops, existing methods don't fully explore the structure of the loops before performing modulo scheduling, resulting in poor execution performance. To tackle this problem, we propose a polyhedral-based pipelining approach for mapping imperfectly-nested loops on CGRA. By efficiently exploring the transformation space for imperfectly-nested loops using the polyhedral model and taking total execution time as an optimization metric, our approach could improve the execution performance greatly. On a $4times 4$ mesh-connected CGRA, the experimental results show that our approach can reduce the total execution time of nested loop by 50.1 % on average, as compared to the state-of-the-art techniques. Moreover, the compilation time is moderate in practice.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"181 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132549005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643588
Shuhang Zhang, Hai Helen Li, Ulf Schlichtmann
In-memory computing has been applied in different fields due to its superior speed and energy efficiency. Among a variety of memory technologies that have been explored, resistive memory has widely been adopted for various purposes, including Processing-In-Memory (PIM) for neural networks and Logic-In-Memory (LIM) for general logic operations. PIM has intensively been studied in recent years, while the progress in developing LIM computing falls behind. LIM computing is usually implemented based on MAGIC operations, which require inputs to be aligned regularly along rows or columns in a memory crossbar. As the intermediate data generated during the logic execution are normally scattered across the memory crossbar, alignment operations are inserted to align the data, which often costs numerous cycles and dominates the overall latency. In current MAGIC-based designs, alignment operations induce a significant overhead in either area or latency. Therefore, the Area-Latency-Product (ALP), known as a key metric for circuit performance, still has significant optimization potential in LIM computing. In this work, we leverage peripheral circuitry to conduct alignment operations and propose a novel mapping framework to optimize the latency and area costs. Intermediate data are read out, processed in peripheral circuits, then in parallel written back into target cells of the memory crossbar. The approach eliminates the use of redundant memory cells, leading to area reduction. Moreover, it enables simultaneous alignments of multiple intermediate data, which can decrease the overall latency significantly. Based on simulation results, our proposed mapping framework can achieve around 93% ALP reductions on average compared with prior designs with merely 2.13% total area overhead.
{"title":"Peripheral Circuitry Assisted Mapping Framework for Resistive Logic-In-Memory Computing","authors":"Shuhang Zhang, Hai Helen Li, Ulf Schlichtmann","doi":"10.1109/ICCAD51958.2021.9643588","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643588","url":null,"abstract":"In-memory computing has been applied in different fields due to its superior speed and energy efficiency. Among a variety of memory technologies that have been explored, resistive memory has widely been adopted for various purposes, including Processing-In-Memory (PIM) for neural networks and Logic-In-Memory (LIM) for general logic operations. PIM has intensively been studied in recent years, while the progress in developing LIM computing falls behind. LIM computing is usually implemented based on MAGIC operations, which require inputs to be aligned regularly along rows or columns in a memory crossbar. As the intermediate data generated during the logic execution are normally scattered across the memory crossbar, alignment operations are inserted to align the data, which often costs numerous cycles and dominates the overall latency. In current MAGIC-based designs, alignment operations induce a significant overhead in either area or latency. Therefore, the Area-Latency-Product (ALP), known as a key metric for circuit performance, still has significant optimization potential in LIM computing. In this work, we leverage peripheral circuitry to conduct alignment operations and propose a novel mapping framework to optimize the latency and area costs. Intermediate data are read out, processed in peripheral circuits, then in parallel written back into target cells of the memory crossbar. The approach eliminates the use of redundant memory cells, leading to area reduction. Moreover, it enables simultaneous alignments of multiple intermediate data, which can decrease the overall latency significantly. Based on simulation results, our proposed mapping framework can achieve around 93% ALP reductions on average compared with prior designs with merely 2.13% total area overhead.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128225264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The era of 5G extends the available spectrum from the microwave band to the millimeter-wave band. The thriving Internet of Things (IoT) also enriches the user equipment (UEs) we used in our daily life, such as smart glasses, smart watches, and drones. With such a larger spectrum and massive UEs, existing dynamic spectrum access (DSA) suffers both low spectrum utilization efficiency and unfair spectrum allocation. Thus, a more sophisticated dynamic spectrum access (DSA) system is required in the 5G context. In this paper, we propose a federated learning based system, FedSwap, the first decentralized DSA system that improves both efficiency and fairness simultaneously. In FedSwap, we deploy an improved multi-agent reinforcement learning (iMARL) algorithm on each UE, enabling UEs to share the spectrum coordinately with fewer collisions. Furthermore, we also propose a novel swapping mechanism for aggregating UEs' models periodically so that UEs can fairly share the spectrum resources. Meanwhile, the sensory data of UEs are not transmitted and hence privacy is protected. We evaluate FedSwap's performance in 5G simulations with various settings. Compared to the state-of-the-art decentralized DSA methods, FedSwap can significantly improve the efficiency and fairness of spectrum utilization.
{"title":"FedSwap: A Federated Learning based 5G Decentralized Dynamic Spectrum Access System","authors":"Zhihui Gao, Ang Li, Yunfan Gao, Bing Li, Yu Wang, Yiran Chen","doi":"10.1109/ICCAD51958.2021.9643496","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643496","url":null,"abstract":"The era of 5G extends the available spectrum from the microwave band to the millimeter-wave band. The thriving Internet of Things (IoT) also enriches the user equipment (UEs) we used in our daily life, such as smart glasses, smart watches, and drones. With such a larger spectrum and massive UEs, existing dynamic spectrum access (DSA) suffers both low spectrum utilization efficiency and unfair spectrum allocation. Thus, a more sophisticated dynamic spectrum access (DSA) system is required in the 5G context. In this paper, we propose a federated learning based system, FedSwap, the first decentralized DSA system that improves both efficiency and fairness simultaneously. In FedSwap, we deploy an improved multi-agent reinforcement learning (iMARL) algorithm on each UE, enabling UEs to share the spectrum coordinately with fewer collisions. Furthermore, we also propose a novel swapping mechanism for aggregating UEs' models periodically so that UEs can fairly share the spectrum resources. Meanwhile, the sensory data of UEs are not transmitted and hence privacy is protected. We evaluate FedSwap's performance in 5G simulations with various settings. Compared to the state-of-the-art decentralized DSA methods, FedSwap can significantly improve the efficiency and fairness of spectrum utilization.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129097121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643535
Xuanyi Li, Chen Li, Yang Guo, Rachata Ausavarungnirun
Although modern GPUs are equipped with expanding memory, accommodating the entire working set of large-scale workloads can still be a challenge. With the support of unified virtual memory and demand paging, programmers can transparently oversubscribe the main memory. However, this transparent management still comes at a severe performance cost, especially for applications with inter-kernel data sharing. While there have been many efforts to reduce additional data migrations caused by the memory oversubscription, few consider the reuse of shared data during the boundary of adjacent kernels. Due to limited memory capacity, we observe that adjacent kernel often demands shared pages that were evicted by the previous kernel, resulting in a significant number of costly data migrations. In this paper, we propose a CTA-Page collaborative framework, called CPC, that transparently reduces the impact of memory oversubscription using CTA dispatch switching and page replacement switching coordinately to reuse inter-kernel shared data. We evaluate CPC with a variety of GPGPU benchmark suites. Experimental results show that the system performance is improved by 65 % compared with the state-of-the-art technique for applications with inter-kernel data sharing.
{"title":"Improving Inter-kernel Data Reuse With CTA-Page Coordination in GPGPU","authors":"Xuanyi Li, Chen Li, Yang Guo, Rachata Ausavarungnirun","doi":"10.1109/ICCAD51958.2021.9643535","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643535","url":null,"abstract":"Although modern GPUs are equipped with expanding memory, accommodating the entire working set of large-scale workloads can still be a challenge. With the support of unified virtual memory and demand paging, programmers can transparently oversubscribe the main memory. However, this transparent management still comes at a severe performance cost, especially for applications with inter-kernel data sharing. While there have been many efforts to reduce additional data migrations caused by the memory oversubscription, few consider the reuse of shared data during the boundary of adjacent kernels. Due to limited memory capacity, we observe that adjacent kernel often demands shared pages that were evicted by the previous kernel, resulting in a significant number of costly data migrations. In this paper, we propose a CTA-Page collaborative framework, called CPC, that transparently reduces the impact of memory oversubscription using CTA dispatch switching and page replacement switching coordinately to reuse inter-kernel shared data. We evaluate CPC with a variety of GPGPU benchmark suites. Experimental results show that the system performance is improved by 65 % compared with the state-of-the-art technique for applications with inter-kernel data sharing.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129480647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643540
Jeremy Blackstone, D. Das, Alric Althoff, Shreyas Sen, R. Kastner
An adversary can exploit side-channel information such as power consumption, electromagnetic (EM) emanations, acoustic vibrations or the timing of encryption operations to derive the secret key from an electronic device. Signature aTtenuation Embedded CRYPTO with Low-Level metAl Routing (STELLAR) is a technique to mitigate power and EM-based attacks, however, it incurs 50% power overhead. This work presents iSTELLAR, which reduces the power overhead by operating STELLAR intermittently utilizing an intelligent scheduling algorithm. The proposed scheduling algorithm for iSTELLAR determines the optimal locations during the crypto operation to turn STELLAR ON, and thereby reduces the power overhead by $> 30%$ compared to the normal STELLAR operation, while eliminating the information leakage.
{"title":"iSTELLAR: intermittent Signature aTtenuation Embedded CRYPTO with Low-Level metAl Routing","authors":"Jeremy Blackstone, D. Das, Alric Althoff, Shreyas Sen, R. Kastner","doi":"10.1109/ICCAD51958.2021.9643540","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643540","url":null,"abstract":"An adversary can exploit side-channel information such as power consumption, electromagnetic (EM) emanations, acoustic vibrations or the timing of encryption operations to derive the secret key from an electronic device. Signature aTtenuation Embedded CRYPTO with Low-Level metAl Routing (STELLAR) is a technique to mitigate power and EM-based attacks, however, it incurs 50% power overhead. This work presents iSTELLAR, which reduces the power overhead by operating STELLAR intermittently utilizing an intelligent scheduling algorithm. The proposed scheduling algorithm for iSTELLAR determines the optimal locations during the crypto operation to turn STELLAR ON, and thereby reduces the power overhead by $> 30%$ compared to the normal STELLAR operation, while eliminating the information leakage.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121658277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}