Tanguy Sassolas, C. Sandionigi, Alexandre Guerre, Alexandre Aminot, P. Vivet, H. Boussetta, L. Ferro, N. Peltier
To offer more computing power to modern SoCs, transistors keep scaling in new technology nodes. Consequently, the power density is increasing, leading to higher thermal risks. Thermal issues need to be addressed as early as possible in the design flow, when the optimization opportunities are the highest. For early design stages, architects rely on virtual prototypes to model their designs' behavior with an adapted trade-off between accuracy and simulation speed. Unfortunately, accurate virtual prototypes fail to encompass thermal effects timescale. In this paper, we demonstrate that less accurate high-level architectural models, in conjunction with efficient power and thermal simulation tools, provide an adapted environment to analyze thermal issues and design software thermal mitigation solutions in the case of the Locomotiv MPSoC architecture.
{"title":"Early design stage thermal evaluation and mitigation: The locomotiv architectural case","authors":"Tanguy Sassolas, C. Sandionigi, Alexandre Guerre, Alexandre Aminot, P. Vivet, H. Boussetta, L. Ferro, N. Peltier","doi":"10.7873/DATE.2014.327","DOIUrl":"https://doi.org/10.7873/DATE.2014.327","url":null,"abstract":"To offer more computing power to modern SoCs, transistors keep scaling in new technology nodes. Consequently, the power density is increasing, leading to higher thermal risks. Thermal issues need to be addressed as early as possible in the design flow, when the optimization opportunities are the highest. For early design stages, architects rely on virtual prototypes to model their designs' behavior with an adapted trade-off between accuracy and simulation speed. Unfortunately, accurate virtual prototypes fail to encompass thermal effects timescale. In this paper, we demonstrate that less accurate high-level architectural models, in conjunction with efficient power and thermal simulation tools, provide an adapted environment to analyze thermal issues and design software thermal mitigation solutions in the case of the Locomotiv MPSoC architecture.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"36 1","pages":"1-2"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81336444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The emerging trend toward utilizing chip multi-core processors (CMPs) that support dynamic voltage and frequency scaling (DVFS) is driven by user requirements for high performance and low power. To overcome limitations of the conventional chip-wide DVFS and achieve the maximum possible energy saving, per-core DVFS is being enabled in the recent CMP offerings. While power consumed by the CMP is reduced by per-core DVFS, power dissipated by many voltage regulators (VRs) needed to support per-core DVFS becomes critical. This paper focuses on the dynamic control of the VRs in a CMP platform. Starting with a proposed platform with a configurable VR-to-core power distribution network, two optimization methods are presented to maximize the system-wide energy savings: (i) reactive VR consolidation to reconfigure the network for maximizing the power conversion efficiency of the VRs performed under the pre-determined DVFS levels for the cores, and (ii) proactive VR consolidation to determine new DVFS levels for maximizing the total energy savings without any performance degradation. Results from detailed experiments demonstrate up to 35% VR energy loss reduction and 14% total energy saving.
{"title":"VRCon: Dynamic reconfiguration of voltage regulators in a multicore platform","authors":"Woojoo Lee, Yanzhi Wang, Massoud Pedram","doi":"10.7873/DATE2014.378","DOIUrl":"https://doi.org/10.7873/DATE2014.378","url":null,"abstract":"The emerging trend toward utilizing chip multi-core processors (CMPs) that support dynamic voltage and frequency scaling (DVFS) is driven by user requirements for high performance and low power. To overcome limitations of the conventional chip-wide DVFS and achieve the maximum possible energy saving, per-core DVFS is being enabled in the recent CMP offerings. While power consumed by the CMP is reduced by per-core DVFS, power dissipated by many voltage regulators (VRs) needed to support per-core DVFS becomes critical. This paper focuses on the dynamic control of the VRs in a CMP platform. Starting with a proposed platform with a configurable VR-to-core power distribution network, two optimization methods are presented to maximize the system-wide energy savings: (i) reactive VR consolidation to reconfigure the network for maximizing the power conversion efficiency of the VRs performed under the pre-determined DVFS levels for the cores, and (ii) proactive VR consolidation to determine new DVFS levels for maximizing the total energy savings without any performance degradation. Results from detailed experiments demonstrate up to 35% VR energy loss reduction and 14% total energy saving.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"104 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78776769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The poor performance of NAND Flash memory, such as long access latency and large granularity access, is the major bottleneck of graph processing. This paper proposes an intelligent storage for graph processing which is based on fast and low cost racetrack memory and a pointer-assisted graph representation. Our experiments show that the proposed intelligent storage based on racetrack memory reduces total processing time of three representative graph computations by 40.2%~86.9% compared to the graph processing, GraphChi, which exploits sequential accesses based on normal NAND Flash memory-based SSD. Faster execution also reduces energy consumption by 39.6%~90.0%. The in-storage processing capability gives additional 10.5%~16.4% performance improvements and 12.0%~14.4% reduction of energy consumption.
{"title":"Accelerating graph computation with racetrack memory and pointer-assisted graph representation","authors":"Eunhyuk Park, S. Yoo, Sunggu Lee, Hai Helen Li","doi":"10.7873/DATE.2014.172","DOIUrl":"https://doi.org/10.7873/DATE.2014.172","url":null,"abstract":"The poor performance of NAND Flash memory, such as long access latency and large granularity access, is the major bottleneck of graph processing. This paper proposes an intelligent storage for graph processing which is based on fast and low cost racetrack memory and a pointer-assisted graph representation. Our experiments show that the proposed intelligent storage based on racetrack memory reduces total processing time of three representative graph computations by 40.2%~86.9% compared to the graph processing, GraphChi, which exploits sequential accesses based on normal NAND Flash memory-based SSD. Faster execution also reduces energy consumption by 39.6%~90.0%. The in-storage processing capability gives additional 10.5%~16.4% performance improvements and 12.0%~14.4% reduction of energy consumption.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"28 1","pages":"1-4"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90042962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Per-core power proxies for multi-core processors are known to use several dozens of hardware activity monitors to achieve a 2% accuracy on core power estimation. These activity monitors are typically not accessible to the user, and even if they were accessible, there would be a significant overhead in using them at the kernel or OS level for power monitoring or control. Furthermore, when scaled up to hundreds of cores per chip, such power proxies become a computational bottleneck for power management operations such as chip power capping. In this paper, we show that a 4% accuracy or better for per-core power estimation can be achieved using an ultra compact power proxy based on a hybrid set of only four user-accessible parameters, namely core frequency, core temperature, instruction-per-cycle and active-state residency. Our proxy is nonlinear, valid across all P and C states, and is based on a randomized power data collection strategy that aims at exercising all the P and C levels of each core. We illustrate the accuracy of the model using the full suite of the SPEC CPU 2006 benchmarks on a 12-core processor.
众所周知,多核处理器的每核功率代理需要使用几十个硬件活动监视器来实现2%的核心功率估计精度。这些活动监视器通常是用户无法访问的,即使可以访问,在内核或操作系统级别使用它们进行电源监视或控制也会有很大的开销。此外,当每个芯片扩展到数百个内核时,这种功率代理将成为电源管理操作(如芯片功率上限)的计算瓶颈。在本文中,我们表明,使用基于只有四个用户可访问参数的混合集的超紧凑功率代理,即核心频率,核心温度,每周期指令和活动状态驻留,可以实现4%或更高的每核功率估计精度。我们的代理是非线性的,在所有P和C状态下都有效,并且基于随机的功率数据收集策略,旨在行使每个核心的所有P和C级别。我们使用12核处理器上的全套SPEC CPU 2006基准测试来说明模型的准确性。
{"title":"Unified, ultra compact, quadratic power proxies for multi-core processors","authors":"M. Yasin, Anas Shahrour, I. Elfadel","doi":"10.7873/DATE.2014.347","DOIUrl":"https://doi.org/10.7873/DATE.2014.347","url":null,"abstract":"Per-core power proxies for multi-core processors are known to use several dozens of hardware activity monitors to achieve a 2% accuracy on core power estimation. These activity monitors are typically not accessible to the user, and even if they were accessible, there would be a significant overhead in using them at the kernel or OS level for power monitoring or control. Furthermore, when scaled up to hundreds of cores per chip, such power proxies become a computational bottleneck for power management operations such as chip power capping. In this paper, we show that a 4% accuracy or better for per-core power estimation can be achieved using an ultra compact power proxy based on a hybrid set of only four user-accessible parameters, namely core frequency, core temperature, instruction-per-cycle and active-state residency. Our proxy is nonlinear, valid across all P and C states, and is based on a randomized power data collection strategy that aims at exercising all the P and C levels of each core. We illustrate the accuracy of the model using the full suite of the SPEC CPU 2006 benchmarks on a 12-core processor.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"10 1","pages":"1-4"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89654435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a reasoning-based approach to analog circuit synthesis using ordered node clustering representations (ONCR) to describe alternative circuit features and symbolic circuit comparison to characterize performance tradeoffs of synthesized solutions. Case studies illustrate application of the proposed methods to topology selection and refinement.
{"title":"Novel circuit topology synthesis method using circuit feature mining and symbolic comparison","authors":"C. Ferent, A. Doboli","doi":"10.7873/DATE.2014.030","DOIUrl":"https://doi.org/10.7873/DATE.2014.030","url":null,"abstract":"This paper presents a reasoning-based approach to analog circuit synthesis using ordered node clustering representations (ONCR) to describe alternative circuit features and symbolic circuit comparison to characterize performance tradeoffs of synthesized solutions. Case studies illustrate application of the proposed methods to topology selection and refinement.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"1 1","pages":"1-4"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89827441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditional dynamic simulation with standard delay format (SDF) back-annotation cannot be reliably performed on large designs. The large size of SDF files makes the event-driven timing simulation extremely slow as it has to process an excessive number of events. In order to accelerate gate-level timing simulation we propose an automated fast prediction-based gatelevel timing simulation that combines static timing analysis (STA) at the block level with dynamic timing simulation at the I/O interfaces. We demonstrate that the proposed timing simulation can be done earlier in the design cycle in parallel with synthesis.
{"title":"Fast STA prediction-based gate-level timing simulation","authors":"T. B. Ahmad, M. Ciesielski","doi":"10.7873/DATE.2014.261","DOIUrl":"https://doi.org/10.7873/DATE.2014.261","url":null,"abstract":"Traditional dynamic simulation with standard delay format (SDF) back-annotation cannot be reliably performed on large designs. The large size of SDF files makes the event-driven timing simulation extremely slow as it has to process an excessive number of events. In order to accelerate gate-level timing simulation we propose an automated fast prediction-based gatelevel timing simulation that combines static timing analysis (STA) at the block level with dynamic timing simulation at the I/O interfaces. We demonstrate that the proposed timing simulation can be done earlier in the design cycle in parallel with synthesis.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"32 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89937581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shengcheng Wang, F. Firouzi, Fabian Oboril, M. Tahoori
In recent years, interconnect issues emerged as major performance challenges for Two-Dimensional-Integrated-Circuits (2D-ICs). In this context, Three-Dimensional-ICs (3D-ICs), which consist of several active layers stacked above each other, offer a very attractive alternative to conventional 2D-ICs. However, 3D-ICs also face many challenges associated with the Power Distribution Network (PDN) design due to the increasing power density and larger supply current compared to 2D-ICs. As an important part of 3D-IC PDNs, Power/Ground (P/G) Through-Silicon-Vias (TSVs) should be well-managed. Excessive or ill-placed P/G TSVs impact the power integrity (e.g. IR-drop), and also consume a considerable amount of chip real estate. In this work, we propose a Mixed-Integer-Linear-Programming (MILP)-based technique to plan the P/G TSVs. The goal of our approach is to minimize the average IR-drop while satisfying the total area constraint of TSVs by optimizing the P/G TSV placement. Therefore, the locations, sizes and the total number of the P/G TSVs are co-optimized simultaneously. The experimental results show that the average IR-drop can be reduced by 11.8 % in average using the proposed method compared to a random placement technique with a much smaller runtime.
{"title":"P/G TSV planning for IR-drop reduction in 3D-ICs","authors":"Shengcheng Wang, F. Firouzi, Fabian Oboril, M. Tahoori","doi":"10.7873/DATE.2014.057","DOIUrl":"https://doi.org/10.7873/DATE.2014.057","url":null,"abstract":"In recent years, interconnect issues emerged as major performance challenges for Two-Dimensional-Integrated-Circuits (2D-ICs). In this context, Three-Dimensional-ICs (3D-ICs), which consist of several active layers stacked above each other, offer a very attractive alternative to conventional 2D-ICs. However, 3D-ICs also face many challenges associated with the Power Distribution Network (PDN) design due to the increasing power density and larger supply current compared to 2D-ICs. As an important part of 3D-IC PDNs, Power/Ground (P/G) Through-Silicon-Vias (TSVs) should be well-managed. Excessive or ill-placed P/G TSVs impact the power integrity (e.g. IR-drop), and also consume a considerable amount of chip real estate. In this work, we propose a Mixed-Integer-Linear-Programming (MILP)-based technique to plan the P/G TSVs. The goal of our approach is to minimize the average IR-drop while satisfying the total area constraint of TSVs by optimizing the P/G TSV placement. Therefore, the locations, sizes and the total number of the P/G TSVs are co-optimized simultaneously. The experimental results show that the average IR-drop can be reduced by 11.8 % in average using the proposed method compared to a random placement technique with a much smaller runtime.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"76 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86870675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Nikitakis, Theofilos Paganos, I. Papaefstathiou
One of the most important challenges in the field of Computer Vision is the implementation of low-power embedded systems that will execute very accurate, yet real-time, algorithms. In the visual tracking sector one of the most promising approaches is the recently introduced OpenTLD algorithm which uses a random forest classification method. While it is very robust, it cannot be efficiently parallelized in its native form as its memory access pattern has certain characteristics that make it hard to take advantage of the conventional memory hierarchies. In this paper, we present a novel embedded system implementing this algorithm. We accelerate the bottleneck of the algorithm by designing and implementing a high bandwidth distributed memory sub-system which is independent of the various software parameters. We demonstrate the applicability and efficiency of this novel approach by implementing our scheme in a modern FPGA.
{"title":"A novel embedded system for vision tracking","authors":"A. Nikitakis, Theofilos Paganos, I. Papaefstathiou","doi":"10.7873/DATE.2014.353","DOIUrl":"https://doi.org/10.7873/DATE.2014.353","url":null,"abstract":"One of the most important challenges in the field of Computer Vision is the implementation of low-power embedded systems that will execute very accurate, yet real-time, algorithms. In the visual tracking sector one of the most promising approaches is the recently introduced OpenTLD algorithm which uses a random forest classification method. While it is very robust, it cannot be efficiently parallelized in its native form as its memory access pattern has certain characteristics that make it hard to take advantage of the conventional memory hierarchies. In this paper, we present a novel embedded system implementing this algorithm. We accelerate the bottleneck of the algorithm by designing and implementing a high bandwidth distributed memory sub-system which is independent of the various software parameters. We demonstrate the applicability and efficiency of this novel approach by implementing our scheme in a modern FPGA.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"49 1","pages":"1-4"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87483699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mobile computing has been weaved into everyday lives to a great extend. Their usage is clearly imprinted with user's personal signature. The ability to learn such signature enables immense potential in workload prediction and resource management. In this work, we investigate the user behavior modeling and apply the model for energy management. Our goal is to maximize the quality of service (QoS) provided by the mobile device (i.e., smartphone), while keep the risk of battery depletion below a given threshold. A Markov Decision Process (MDP) is constructed from history user behavior. The optimal management policy is solved using linear programing. Simulations based on real user traces validate that, compared to existing battery energy management techniques, the stochastic control performs better in boosting the mobile devices' QoS without significantly increasing the chance of battery depletion.
{"title":"Battery aware stochastic QoS boosting in mobile computing devices","authors":"Hao Shen, Qiuwen Chen, Qinru Qiu","doi":"10.5555/2616606.2616818","DOIUrl":"https://doi.org/10.5555/2616606.2616818","url":null,"abstract":"Mobile computing has been weaved into everyday lives to a great extend. Their usage is clearly imprinted with user's personal signature. The ability to learn such signature enables immense potential in workload prediction and resource management. In this work, we investigate the user behavior modeling and apply the model for energy management. Our goal is to maximize the quality of service (QoS) provided by the mobile device (i.e., smartphone), while keep the risk of battery depletion below a given threshold. A Markov Decision Process (MDP) is constructed from history user behavior. The optimal management policy is solved using linear programing. Simulations based on real user traces validate that, compared to existing battery energy management techniques, the stochastic control performs better in boosting the mobile devices' QoS without significantly increasing the chance of battery depletion.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"12 1","pages":"1-4"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88716973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Chandrasekar, Sven Goossens, C. Weis, Martijn Koedam, B. Akesson, N. Wehn, K. Goossens
Manufacturing-time process (P) variations and runtime voltage (V) and temperature (T) variations can affect a DRAM's performance severely. To counter these effects, DRAM vendors provide substantial design-time PVT timing margins to guarantee correct DRAM functionality under worst-case operating conditions. Unfortunately, with technology scaling these timing margins have become large and very pessimistic for a majority of the manufactured DRAMs. While run-time variations are specific to operating conditions and as a result, their margins difficult to optimize, process variations are manufacturing-time effects and excessive process-margins can be reduced at run-time, on a per-device basis, if properly identified. In this paper, we propose a generic post-manufacturing performance characterization methodology for DRAMs that identifies this excess in process-margins for any given DRAM device at runtime, while retaining the requisite margins for voltage (noise) and temperature variations. By doing so, the methodology ascertains the actual impact of process-variations on the particular DRAM device and optimizes its access latencies (timings), thereby improving its overall performance. We evaluate this methodology on 48 DDR3 devices (from 12 DIMMs) and verify the derived timings under worst-case operating conditions, showing up to 33.3% and 25.9% reduction in DRAM read and write latencies, respectively.
{"title":"Exploiting expendable process-margins in DRAMs for run-time performance optimization","authors":"K. Chandrasekar, Sven Goossens, C. Weis, Martijn Koedam, B. Akesson, N. Wehn, K. Goossens","doi":"10.7873/DATE.2014.186","DOIUrl":"https://doi.org/10.7873/DATE.2014.186","url":null,"abstract":"Manufacturing-time process (P) variations and runtime voltage (V) and temperature (T) variations can affect a DRAM's performance severely. To counter these effects, DRAM vendors provide substantial design-time PVT timing margins to guarantee correct DRAM functionality under worst-case operating conditions. Unfortunately, with technology scaling these timing margins have become large and very pessimistic for a majority of the manufactured DRAMs. While run-time variations are specific to operating conditions and as a result, their margins difficult to optimize, process variations are manufacturing-time effects and excessive process-margins can be reduced at run-time, on a per-device basis, if properly identified. In this paper, we propose a generic post-manufacturing performance characterization methodology for DRAMs that identifies this excess in process-margins for any given DRAM device at runtime, while retaining the requisite margins for voltage (noise) and temperature variations. By doing so, the methodology ascertains the actual impact of process-variations on the particular DRAM device and optimizes its access latencies (timings), thereby improving its overall performance. We evaluate this methodology on 48 DDR3 devices (from 12 DIMMs) and verify the derived timings under worst-case operating conditions, showing up to 33.3% and 25.9% reduction in DRAM read and write latencies, respectively.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"6 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88913324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}