Modern automotive systems require increased performance to implement Advanced Driving Assistance Systems (ADAS). GPU-powered platforms are promising candidates for such computational tasks, however current low-level programming models challenge the accelerator software certification process, while they limit the hardware selection to a fraction of the available platforms. In this paper we present Brook Auto, a high-level programming language for automotive GPU systems which removes these limitations. We describe the challenges and solutions we faced in its implementation, as well as a complete evaluation in terms of performance and productivity, which shows the effectiveness of our method.
{"title":"Brook Auto: High-Level Certification-Friendly Programming for GPU-powered Automotive Systems","authors":"Matina Maria Trompouki, Leonidas Kosmidis","doi":"10.1145/3195970.3196002","DOIUrl":"https://doi.org/10.1145/3195970.3196002","url":null,"abstract":"Modern automotive systems require increased performance to implement Advanced Driving Assistance Systems (ADAS). GPU-powered platforms are promising candidates for such computational tasks, however current low-level programming models challenge the accelerator software certification process, while they limit the hardware selection to a fraction of the available platforms. In this paper we present Brook Auto, a high-level programming language for automotive GPU systems which removes these limitations. We describe the challenges and solutions we faced in its implementation, as well as a complete evaluation in terms of performance and productivity, which shows the effectiveness of our method.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"17 3 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76990721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Performance failure has become a growing concern for the robustness and reliability of memory circuits. It is challenging to accurately estimate the extremely small failure probability when failed samples are distributed in multiple disjoint failure regions. In this paper, we develop an adaptive importance sampling (AlS) method. AIS has several iterations of sampling region adjusbnents, while existing methods pre-decide a static sampling distribution. By iteratively searching for failure regions, AIS may lead to better efficiency and accuracy. This is validated by our experiments. For SRAM cell with single failure region, AIS uses 5–10X fewer samples and reaches better accuracy when compared to several recent methods. For sense amplifier circuit with multiple failure regions, AIS is 4369X faster than MC without compromising accuracy, while other methods fail to cover all failure regions in our experiment
{"title":"A Fast and Robust Failure Analysis of Memory Circuits Using Adaptive Importance Sampling Method","authors":"Xiao Shi, Jun Yang, Fengyuan Liu, Lei He","doi":"10.1145/3195970.3195972","DOIUrl":"https://doi.org/10.1145/3195970.3195972","url":null,"abstract":"Performance failure has become a growing concern for the robustness and reliability of memory circuits. It is challenging to accurately estimate the extremely small failure probability when failed samples are distributed in multiple disjoint failure regions. In this paper, we develop an adaptive importance sampling (AlS) method. AIS has several iterations of sampling region adjusbnents, while existing methods pre-decide a static sampling distribution. By iteratively searching for failure regions, AIS may lead to better efficiency and accuracy. This is validated by our experiments. For SRAM cell with single failure region, AIS uses 5–10X fewer samples and reaches better accuracy when compared to several recent methods. For sense amplifier circuit with multiple failure regions, AIS is 4369X faster than MC without compromising accuracy, while other methods fail to cover all failure regions in our experiment","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"97 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76188637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the trend toward specialization, an efficient memory-path design is vital to capitalize customization in data-path. A monolithic memory hierarchy is often highly inefficient for irregular applications, traditionally targeted for CPUs. New approaches and tools are required to offer application-specific memory customization combining the benefits of cache and scratchpad memory simultaneously.This paper introduces a novel approach for automated application-specific on-chip memory assignment and tiling. The approach offers two major tools: (1) static memory access analysis and (2) variable-level memory assignment. Static memory analysis performs at the LLVM abstraction. It extracts target-independent pointer behaviors, measures the access strides and analyze the prefetchability of variables. (2) variable-level memory assignment creates a memory allocation graph for memory assignment (cache vs. scratchpad) based on the variables size and their estimated locality. It also explores the opportunity for tiling memory access. For the exploration and results, this paper uses Machsuite benchmarks (with both regular & irregular memory access behaviors), and gem5-Aladdin tool for performance & power evaluation. The proposed approach optimizes the memory hierarchy by automatically combining the benefits of cache, (tiled-) scratchpad at variable level granularity per individual applications. The results demonstrate more than 45% improvement in our power-stall product, on average, over the monolithic cache or scratchpad design.
随着专门化趋势的发展,有效的内存路径设计对于实现数据路径的定制化至关重要。对于不规则的应用程序(传统上以cpu为目标),单片内存层次结构通常效率非常低。需要新的方法和工具来提供特定于应用程序的内存定制,同时结合缓存和暂存的优点。本文介绍了一种用于特定应用的片上内存自动分配和平铺的新方法。该方法提供了两个主要工具:(1)静态内存访问分析和(2)变量级内存分配。静态内存分析在LLVM抽象中执行。它提取了与目标无关的指针行为,测量了访问步幅并分析了变量的可预取性。(2)变量级内存分配创建内存分配图(cache vs. scratchpad)基于变量大小和它们的估计位置。它还探讨了平铺内存访问的可能性。为了探索和结果,本文使用Machsuite基准测试(包括规则和不规则的内存访问行为),并使用gem5-Aladdin工具进行性能和功耗评估。所提出的方法通过自动结合每个应用程序可变粒度的缓存(平铺)刮擦板的优点来优化内存层次结构。结果表明,与单片缓存或刮擦板设计相比,我们的产品在电源失速方面平均改善了45%以上。
{"title":"Locality Aware Memory Assignment and Tiling","authors":"Samuel Rogers, H. Tabkhi","doi":"10.1145/3195970.3196070","DOIUrl":"https://doi.org/10.1145/3195970.3196070","url":null,"abstract":"With the trend toward specialization, an efficient memory-path design is vital to capitalize customization in data-path. A monolithic memory hierarchy is often highly inefficient for irregular applications, traditionally targeted for CPUs. New approaches and tools are required to offer application-specific memory customization combining the benefits of cache and scratchpad memory simultaneously.This paper introduces a novel approach for automated application-specific on-chip memory assignment and tiling. The approach offers two major tools: (1) static memory access analysis and (2) variable-level memory assignment. Static memory analysis performs at the LLVM abstraction. It extracts target-independent pointer behaviors, measures the access strides and analyze the prefetchability of variables. (2) variable-level memory assignment creates a memory allocation graph for memory assignment (cache vs. scratchpad) based on the variables size and their estimated locality. It also explores the opportunity for tiling memory access. For the exploration and results, this paper uses Machsuite benchmarks (with both regular & irregular memory access behaviors), and gem5-Aladdin tool for performance & power evaluation. The proposed approach optimizes the memory hierarchy by automatically combining the benefits of cache, (tiled-) scratchpad at variable level granularity per individual applications. The results demonstrate more than 45% improvement in our power-stall product, on average, over the monolithic cache or scratchpad design.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"1 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88374071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data stored in an off-chip memory, such as DRAM or non-volatile main memory, can potentially be extracted or tampered by an attacker with physical access to a device. Protecting such attacks requires storing message authentication codes and counters - which incur a 22% storage overhead. In this work, we propose techniques for reducing these overheads.We first present a scheme that leverages ECC DRAMs to reduce MAC verification & storage overheads. We replace the parity bits in standard ECC by a combination of MAC and parity bits to provide both authentication and error correction. This eliminates the extra MAC storage and minimizes the verification overhead as MACs can be read in parallel with data through the ECC bus. Next, we use efficient integer encodings to reduce counter storage overhead by 6 × while enhancing application performance.
{"title":"Reducing the Overhead of Authenticated Memory Encryption Using Delta Encoding and ECC Memory","authors":"Salessawi Ferede Yitbarek, T. Austin","doi":"10.1145/3195970.3196102","DOIUrl":"https://doi.org/10.1145/3195970.3196102","url":null,"abstract":"Data stored in an off-chip memory, such as DRAM or non-volatile main memory, can potentially be extracted or tampered by an attacker with physical access to a device. Protecting such attacks requires storing message authentication codes and counters - which incur a 22% storage overhead. In this work, we propose techniques for reducing these overheads.We first present a scheme that leverages ECC DRAMs to reduce MAC verification & storage overheads. We replace the parity bits in standard ECC by a combination of MAC and parity bits to provide both authentication and error correction. This eliminates the extra MAC storage and minimizes the verification overhead as MACs can be read in parallel with data through the ECC bus. Next, we use efficient integer encodings to reduce counter storage overhead by 6 × while enhancing application performance.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"74 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80564750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunfei Gu, Dengxue Yan, Vaibhav Verma, M. Stan, Xuan Zhang
Energy-efficient microprocessors are essential for a wide range of applications. While near-threshold computing is a promising technique to improve energy efficiency, optimal supply demands from logic core and on-chip memory are confiicting. In this paper, we perform reliability analysis of 6T SRAM and discover imbalanced minimum voltage requirements between read and write operations. We leverage this imbalance property in near-threshold processors equipped with voltage boosting capability by proposing an opportunistic dual-supply switching scheme with a write aggregation buffer. Our results show that proposed technique improves energy efficiency by more than 18% with approximate 8.54% performance speed-up.
{"title":"SRAM based Opportunistic Energy Efficiency Improvement in Dual-Supply Near-Threshold Processors","authors":"Yunfei Gu, Dengxue Yan, Vaibhav Verma, M. Stan, Xuan Zhang","doi":"10.1145/3195970.3196121","DOIUrl":"https://doi.org/10.1145/3195970.3196121","url":null,"abstract":"Energy-efficient microprocessors are essential for a wide range of applications. While near-threshold computing is a promising technique to improve energy efficiency, optimal supply demands from logic core and on-chip memory are confiicting. In this paper, we perform reliability analysis of 6T SRAM and discover imbalanced minimum voltage requirements between read and write operations. We leverage this imbalance property in near-threshold processors equipped with voltage boosting capability by proposing an opportunistic dual-supply switching scheme with a write aggregation buffer. Our results show that proposed technique improves energy efficiency by more than 18% with approximate 8.54% performance speed-up.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"26 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83689313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To store the desired data on MLC and TLC flash memories, the conventional programming strategies need to divide a fixed range of threshold voltage (Vt) window into several parts. The narrowly partitioned Vt window in turn limits the design of programming strategy and becomes the main reason to cause flash-memory defects, i.e., the longer read/write latency and worse data reliability. This motivates this work to explore the innovative programming design for solving the flash-memory defects. Thus, to achieve the defect-free 3D NAND flash memory, this paper presents and realizes a one-shot program design to significantly eliminate the negative impacts caused by conventional programming strategies. The proposed one-shot program design includes two strategies, i.e., prophetic and classification programming, for MLC flash memories, and the idea is extended to TLC flash memories. The measurement results show that it can accelerate programming speed by 31x and reduce RBER by 1000x for the MLC flash memory, and it can broaden the available window of threshold voltage up to 5.1x for the TLC flash memory.
{"title":"Achieving Defect-Free Multilevel 3D Flash Memories with One-Shot Program Design","authors":"Chien-Chung Ho, Yung-Chun Li, Yuan-Hao Chang, Yu-Ming Chang","doi":"10.1145/3195970.3195982","DOIUrl":"https://doi.org/10.1145/3195970.3195982","url":null,"abstract":"To store the desired data on MLC and TLC flash memories, the conventional programming strategies need to divide a fixed range of threshold voltage (Vt) window into several parts. The narrowly partitioned Vt window in turn limits the design of programming strategy and becomes the main reason to cause flash-memory defects, i.e., the longer read/write latency and worse data reliability. This motivates this work to explore the innovative programming design for solving the flash-memory defects. Thus, to achieve the defect-free 3D NAND flash memory, this paper presents and realizes a one-shot program design to significantly eliminate the negative impacts caused by conventional programming strategies. The proposed one-shot program design includes two strategies, i.e., prophetic and classification programming, for MLC flash memories, and the idea is extended to TLC flash memories. The measurement results show that it can accelerate programming speed by 31x and reduce RBER by 1000x for the MLC flash memory, and it can broaden the available window of threshold voltage up to 5.1x for the TLC flash memory.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"1 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83551396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Dao, Nian-Ze Lee, Li-Cheng Chen, Mark Po-Hung Lin, J. H. Jiang, A. Mishchenko, R. Brayton
Engineering Change Orders (ECO) modify a synthesized netlist after its specification has changed. ECO is divided into two major tasks: finding target signals whose functions should be updated and synthesizing the patch that produces the desired change. This paper proposes an efficient SAT-based solution for the second task: resource-aware computation of multi-output patch functions. The solution is based on several new algorithms and outperforms the top three winners of the 2017 ICCAD CAD Contest (Problem A).
{"title":"Efficient Computation of ECO Patch Functions","authors":"A. Dao, Nian-Ze Lee, Li-Cheng Chen, Mark Po-Hung Lin, J. H. Jiang, A. Mishchenko, R. Brayton","doi":"10.1145/3195970.3196039","DOIUrl":"https://doi.org/10.1145/3195970.3196039","url":null,"abstract":"Engineering Change Orders (ECO) modify a synthesized netlist after its specification has changed. ECO is divided into two major tasks: finding target signals whose functions should be updated and synthesizing the patch that produces the desired change. This paper proposes an efficient SAT-based solution for the second task: resource-aware computation of multi-output patch functions. The solution is based on several new algorithms and outperforms the top three winners of the 2017 ICCAD CAD Contest (Problem A).","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"47 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90706485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A DRAM chip requires periodic refresh operations to prevent data loss due to charge leakage in DRAM cells. Refresh operations incur significant performance overhead as a DRAM bank/rank becomes unavailable to service access requests while being refreshed. In this work, our goal is to reduce the performance overhead of DRAM refresh by reducing the latency of a refresh operation. We observe that a significant number of DRAM cells can retain their data for longer than the worst-case refresh period of 64ms. Such cells do not always need to be fully refreshed; a low-latency partial refresh is sufficient for them.We propose Variable Refresh Latency DRAM (VRL-DRAM), a mechanism that fully refreshes a DRAM cell only when necessary, and otherwise ensures data integrity by issuing low-latency partial refresh operations. We develop a new detailed analytical model to estimate the minimum latency of a refresh operation that ensures data integrity of a cell with a given retention time profile. We evaluate VRL-DRAM with memory traces from real workloads, and show that it reduces the average refresh performance overhead by 34% compared to the state-of-the-art approach.
{"title":"VRL-DRAM: Improving DRAM Performance via Variable Refresh Latency","authors":"Anup Das, Hasan Hassan, O. Mutlu","doi":"10.1145/3195970.3196136","DOIUrl":"https://doi.org/10.1145/3195970.3196136","url":null,"abstract":"A DRAM chip requires periodic refresh operations to prevent data loss due to charge leakage in DRAM cells. Refresh operations incur significant performance overhead as a DRAM bank/rank becomes unavailable to service access requests while being refreshed. In this work, our goal is to reduce the performance overhead of DRAM refresh by reducing the latency of a refresh operation. We observe that a significant number of DRAM cells can retain their data for longer than the worst-case refresh period of 64ms. Such cells do not always need to be fully refreshed; a low-latency partial refresh is sufficient for them.We propose Variable Refresh Latency DRAM (VRL-DRAM), a mechanism that fully refreshes a DRAM cell only when necessary, and otherwise ensures data integrity by issuing low-latency partial refresh operations. We develop a new detailed analytical model to estimate the minimum latency of a refresh operation that ensures data integrity of a cell with a given retention time profile. We evaluate VRL-DRAM with memory traces from real workloads, and show that it reduces the average refresh performance overhead by 34% compared to the state-of-the-art approach.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"37 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89955223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Markus Miettinen, T. D. Nguyen, A. Sadeghi, N. Asokan
The emergence of IoT poses new challenges towards solutions for authenticating numerous very heterogeneous IoT devices to their respective trust domains. Using passwords or pre-defined keys have drawbacks that limit their use in IoT scenarios. Recent works propose to use contextual information about ambient physical properties of devices' surroundings as a shared secret to mutually authenticate devices that are co-located, e.g., the same room. In this paper, we analyze these context-based authentication solutions with regard to their security and requirements on context quality. We quantify their achievable security based on empirical real-world data from context measurements in typical IoT environments.
{"title":"Revisiting Context-Based Authentication in IoT","authors":"Markus Miettinen, T. D. Nguyen, A. Sadeghi, N. Asokan","doi":"10.1145/3195970.3196106","DOIUrl":"https://doi.org/10.1145/3195970.3196106","url":null,"abstract":"The emergence of IoT poses new challenges towards solutions for authenticating numerous very heterogeneous IoT devices to their respective trust domains. Using passwords or pre-defined keys have drawbacks that limit their use in IoT scenarios. Recent works propose to use contextual information about ambient physical properties of devices' surroundings as a shared secret to mutually authenticate devices that are co-located, e.g., the same room. In this paper, we analyze these context-based authentication solutions with regard to their security and requirements on context quality. We quantify their achievable security based on empirical real-world data from context measurements in typical IoT environments.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"18 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72880805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}