Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar, A. Cristal, M. Hill, K. McKinley, M. Nemirovsky, M. Swift, O. Unsal
Page-based virtual memory improves programmer productivity, security, and memory utilization, but incurs performance overheads due to costly page table walks after TLB misses. This overhead can reach 50% for modern workloads that access increasingly vast memory with stagnating TLB sizes. To reduce the overhead of virtual memory, this paper proposes Redundant Memory Mappings (RMM), which leverage ranges of pages and provides an efficient, alternative representation of many virtual-to-physical mappings. We define a range be a subset of process's pages that are virtually and physically contiguous. RMM translates each range with a single range table entry, enabling a modest number of entries to translate most of the process's address space. RMM operates in parallel with standard paging and uses a software range table and hardware range TLB with arbitrarily large reach. We modify the operating system to automatically detect ranges and to increase their likelihood with eager page allocation. RMM is thus transparent to applications. We prototype RMM software in Linux and emulate the hardware. RMM performs substantially better than paging alone and huge pages, and improves a wider variety of workloads than direct segments (one range per program), reducing the overhead of virtual memory to less than 1% on average.
{"title":"Redundant Memory Mappings for fast access to large memories","authors":"Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar, A. Cristal, M. Hill, K. McKinley, M. Nemirovsky, M. Swift, O. Unsal","doi":"10.1145/2749469.2749471","DOIUrl":"https://doi.org/10.1145/2749469.2749471","url":null,"abstract":"Page-based virtual memory improves programmer productivity, security, and memory utilization, but incurs performance overheads due to costly page table walks after TLB misses. This overhead can reach 50% for modern workloads that access increasingly vast memory with stagnating TLB sizes. To reduce the overhead of virtual memory, this paper proposes Redundant Memory Mappings (RMM), which leverage ranges of pages and provides an efficient, alternative representation of many virtual-to-physical mappings. We define a range be a subset of process's pages that are virtually and physically contiguous. RMM translates each range with a single range table entry, enabling a modest number of entries to translate most of the process's address space. RMM operates in parallel with standard paging and uses a software range table and hardware range TLB with arbitrarily large reach. We modify the operating system to automatically detect ranges and to increase their likelihood with eager page allocation. RMM is thus transparent to applications. We prototype RMM software in Linux and emulate the hardware. RMM performs substantially better than paging alone and huge pages, and improves a wider variety of workloads than direct segments (one range per program), reducing the overhead of virtual memory to less than 1% on average.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"15 1","pages":"66-78"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72988232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jungwhan Choi, Wongyu Shin, Jaemin Jang, Jinwoong Suh, Yongkee Kwon, Youngsuk Moon, L. Kim
Several previous works have changed DRAM bank structure to reduce memory access latency and have shown performance improvement. However, changes in the area-optimized DRAM bank can incur large area-overhead. To solve this problem, we propose Multiple Clone Row DRAM (MCR-DRAM), which uses existing DRAM bank structure without any modification.
{"title":"Multiple Clone Row DRAM: A low latency and area optimized DRAM","authors":"Jungwhan Choi, Wongyu Shin, Jaemin Jang, Jinwoong Suh, Yongkee Kwon, Youngsuk Moon, L. Kim","doi":"10.1145/2749469.2750402","DOIUrl":"https://doi.org/10.1145/2749469.2750402","url":null,"abstract":"Several previous works have changed DRAM bank structure to reduce memory access latency and have shown performance improvement. However, changes in the area-optimized DRAM bank can incur large area-overhead. To solve this problem, we propose Multiple Clone Row DRAM (MCR-DRAM), which uses existing DRAM bank structure without any modification.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"2 1","pages":"223-234"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73024715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Indrani Paul, Wei Huang, Manish Arora, S. Yalamanchili
In this paper, we address the problem of efficiently managing the relative power demands of a high-performance GPU and its memory subsystem. We develop a management approach that dynamically tunes the hardware operating configurations to maintain balance between the power dissipated in compute versus memory access across GPGPU application phases. Our goal is to reduce power with minimal performance degradation. Accordingly, we construct predictors that assess the online sensitivity of applications to three hardware tunables-compute frequency, number of active compute units, and memory bandwidth. Using these sensitivity predictors, we propose a two-level coordinated power management scheme, Harmonia, which coordinates the hardware power states of the GPU and the memory system. Through hardware measurements on a commodity GPU, we evaluate Harmonia against a state-of-the-practice commodity GPU power management scheme, as well as an oracle scheme. Results show that Harmonia improves measured energy-delay squared (ED2) by up to 36% (12% on average) with negligible performance loss across representative GPGPU workloads, and on an average is within 3% of the oracle scheme.
{"title":"Harmonia: Balancing compute and memory power in high-performance GPUs","authors":"Indrani Paul, Wei Huang, Manish Arora, S. Yalamanchili","doi":"10.1145/2749469.2750404","DOIUrl":"https://doi.org/10.1145/2749469.2750404","url":null,"abstract":"In this paper, we address the problem of efficiently managing the relative power demands of a high-performance GPU and its memory subsystem. We develop a management approach that dynamically tunes the hardware operating configurations to maintain balance between the power dissipated in compute versus memory access across GPGPU application phases. Our goal is to reduce power with minimal performance degradation. Accordingly, we construct predictors that assess the online sensitivity of applications to three hardware tunables-compute frequency, number of active compute units, and memory bandwidth. Using these sensitivity predictors, we propose a two-level coordinated power management scheme, Harmonia, which coordinates the hardware power states of the GPU and the memory system. Through hardware measurements on a commodity GPU, we evaluate Harmonia against a state-of-the-practice commodity GPU power management scheme, as well as an oracle scheme. Results show that Harmonia improves measured energy-delay squared (ED2) by up to 36% (12% on average) with negligible performance loss across representative GPGPU workloads, and on an average is within 3% of the oracle scheme.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"10 1","pages":"54-65"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80886713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Stephenson, S. Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, D. Nellans, Mike O'Connor, S. Keckler
To aid application characterization and architecture design space exploration, researchers and engineers have developed a wide range of tools for CPUs, including simulators, profilers, and binary instrumentation tools. With the advent of GPU computing, GPU manufacturers have developed similar tools leveraging hardware profiling and debugging hooks. To date, these tools are largely limited by the fixed menu of options provided by the tool developer and do not offer the user the flexibility to observe or act on events not in the menu. This paper presents SASSI (NVIDIA assembly code “SASS” Instrumentor), a low-level assembly-language instrumentation tool for GPUs. Like CPU binary instrumentation tools, SASSI allows a user to specify instructions at which to inject user-provided instrumentation code. These facilities allow strategic placement of counters and code into GPU assembly code to collect user-directed, fine-grained statistics at hardware speeds. SASSI instrumentation is inherently parallel, leveraging the concurrency of the underlying hardware. In addition to the details of SASSI, this paper provides four case studies that show how SASSI can be used to characterize applications and explore the architecture design space along the dimensions of instruction control flow, memory systems, value similarity, and resilience.
{"title":"Flexible software profiling of GPU architectures","authors":"M. Stephenson, S. Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, D. Nellans, Mike O'Connor, S. Keckler","doi":"10.1145/2749469.2750375","DOIUrl":"https://doi.org/10.1145/2749469.2750375","url":null,"abstract":"To aid application characterization and architecture design space exploration, researchers and engineers have developed a wide range of tools for CPUs, including simulators, profilers, and binary instrumentation tools. With the advent of GPU computing, GPU manufacturers have developed similar tools leveraging hardware profiling and debugging hooks. To date, these tools are largely limited by the fixed menu of options provided by the tool developer and do not offer the user the flexibility to observe or act on events not in the menu. This paper presents SASSI (NVIDIA assembly code “SASS” Instrumentor), a low-level assembly-language instrumentation tool for GPUs. Like CPU binary instrumentation tools, SASSI allows a user to specify instructions at which to inject user-provided instrumentation code. These facilities allow strategic placement of counters and code into GPU assembly code to collect user-directed, fine-grained statistics at hardware speeds. SASSI instrumentation is inherently parallel, leveraging the concurrency of the underlying hardware. In addition to the details of SASSI, this paper provides four case studies that show how SASSI can be used to characterize applications and explore the architecture design space along the dimensions of instruction control flow, memory systems, value similarity, and resilience.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"8 1","pages":"185-197"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86409186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jin Wang, Norman Rubin, A. Sidelnik, S. Yalamanchili
GPUs have been proven effective for structured applications that map well to the rigid 1D-3D grid of threads in modern bulk synchronous parallel (BSP) programming languages. However, less success has been encountered in mapping data intensive irregular applications such as graph analytics, relational databases, and machine learning. Recently introduced nested device-side kernel launching functionality in the GPU is a step in the right direction, but still falls short of being able to effectively harness the GPUs performance potential. We propose a new mechanism called Dynamic Thread Block Launch (DTBL) to extend the current bulk synchronous parallel model underlying the current GPU execution model by supporting dynamic spawning of lightweight thread blocks. This mechanism supports the nested launching of thread blocks rather than kernels to execute dynamically occurring parallel work elements. This paper describes the execution model of DTBL, device-runtime support, and microarchitecture extensions to track and execute dynamically spawned thread blocks. Experiments with a set of irregular data intensive CUDA applications executing on a cycle-level simulator show that DTBL achieves average 1.21x speedup over the original flat implementation and average 1.40x over the implementation with device-side kernel launches using CUDA Dynamic Parallelism.
{"title":"Dynamic Thread Block Launch: A lightweight execution mechanism to support irregular applications on GPUs","authors":"Jin Wang, Norman Rubin, A. Sidelnik, S. Yalamanchili","doi":"10.1145/2749469.2750393","DOIUrl":"https://doi.org/10.1145/2749469.2750393","url":null,"abstract":"GPUs have been proven effective for structured applications that map well to the rigid 1D-3D grid of threads in modern bulk synchronous parallel (BSP) programming languages. However, less success has been encountered in mapping data intensive irregular applications such as graph analytics, relational databases, and machine learning. Recently introduced nested device-side kernel launching functionality in the GPU is a step in the right direction, but still falls short of being able to effectively harness the GPUs performance potential. We propose a new mechanism called Dynamic Thread Block Launch (DTBL) to extend the current bulk synchronous parallel model underlying the current GPU execution model by supporting dynamic spawning of lightweight thread blocks. This mechanism supports the nested launching of thread blocks rather than kernels to execute dynamically occurring parallel work elements. This paper describes the execution model of DTBL, device-runtime support, and microarchitecture extensions to track and execute dynamically spawned thread blocks. Experiments with a set of irregular data intensive CUDA applications executing on a cycle-level simulator show that DTBL achieves average 1.21x speedup over the original flat implementation and average 1.40x over the implementation with device-side kernel launches using CUDA Dynamic Parallelism.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"15 1","pages":"528-540"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86213715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Timothy G. Rogers, Daniel R. Johnson, Mike O'Connor, S. Keckler
This paper studies the effect of warp sizing and scheduling on performance and efficiency in GPUs. We propose Variable Warp Sizing (VWS) which improves the performance of divergent applications by using a small base warp size in the presence of control flow and memory divergence. When appropriate, our proposed technique groups sets of these smaller warps together by ganging their execution in the warp scheduler, improving performance and energy efficiency for regular applications. Warp ganging is necessary to prevent performance degradation on regular workloads due to memory convergence slip, which results from the inability of smaller warps to exploit the same intra-warp memory locality as larger warps. This paper explores the effect of warp sizing on control flow divergence, memory divergence, and locality. For an estimated 5% area cost, our ganged scheduling microarchitecture results in a simulated 35% performance improvement on divergent workloads by allowing smaller groups of threads to proceed independently, and eliminates the performance degradation due to memory convergence slip that is observed when convergent applications are executed with smaller warp sizes.
{"title":"A variable warp size architecture","authors":"Timothy G. Rogers, Daniel R. Johnson, Mike O'Connor, S. Keckler","doi":"10.1145/2749469.2750410","DOIUrl":"https://doi.org/10.1145/2749469.2750410","url":null,"abstract":"This paper studies the effect of warp sizing and scheduling on performance and efficiency in GPUs. We propose Variable Warp Sizing (VWS) which improves the performance of divergent applications by using a small base warp size in the presence of control flow and memory divergence. When appropriate, our proposed technique groups sets of these smaller warps together by ganging their execution in the warp scheduler, improving performance and energy efficiency for regular applications. Warp ganging is necessary to prevent performance degradation on regular workloads due to memory convergence slip, which results from the inability of smaller warps to exploit the same intra-warp memory locality as larger warps. This paper explores the effect of warp sizing on control flow divergence, memory divergence, and locality. For an estimated 5% area cost, our ganged scheduling microarchitecture results in a simulated 35% performance improvement on divergent workloads by allowing smaller groups of threads to proceed independently, and eliminates the performance degradation due to memory convergence slip that is observed when convergent applications are executed with smaller warp sizes.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"6 1","pages":"489-501"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83968802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Görkem Asilioglu, Zhaoxiang Jin, Murat Köksal, Omkar Javeri, Soner Önder
LaZy Superscalar is a processor architecture which delays the execution of fetched instructions until their results are needed by other instructions. This approach eliminates dead instructions and provides the necessary means to fuse dependent instructions across multiple control dependencies by explicitly tracking control and data dependencies through a matrix based scheduler. We present this novel redesign of scheduling, recovery and commit mechanisms and evaluate the performance of the proposed architecture. Our simulations using Spec 2006 benchmark suite indicate that LaZy Superscalar can achieve significant speed-ups while providing respectable power savings compared to a conventional superscalar processor.
{"title":"LaZy Superscalar","authors":"Görkem Asilioglu, Zhaoxiang Jin, Murat Köksal, Omkar Javeri, Soner Önder","doi":"10.1145/2749469.2750409","DOIUrl":"https://doi.org/10.1145/2749469.2750409","url":null,"abstract":"LaZy Superscalar is a processor architecture which delays the execution of fetched instructions until their results are needed by other instructions. This approach eliminates dead instructions and provides the necessary means to fuse dependent instructions across multiple control dependencies by explicitly tracking control and data dependencies through a matrix based scheduler. We present this novel redesign of scheduling, recovery and commit mechanisms and evaluate the performance of the proposed architecture. Our simulations using Spec 2006 benchmark suite indicate that LaZy Superscalar can achieve significant speed-ups while providing respectable power savings compared to a conventional superscalar processor.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"50 1","pages":"260-271"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89431988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Lo, Liqun Cheng, R. Govindaraju, Parthasarathy Ranganathan, C. Kozyrakis
User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared resources can cause latency spikes that violate the service-level objectives of latency-sensitive tasks. The resulting under-utilization hurts both the affordability and energy-efficiency of large-scale datacenters. With technology scaling slowing down, it becomes important to address this opportunity. We present Heracles, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service. Heracles dynamically manages multiple hardware and software isolation mechanisms, such as CPU, memory, and network isolation, to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best-effort tasks. We evaluate Heracles using production latency-critical and batch workloads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated.
{"title":"Heracles: Improving resource efficiency at scale","authors":"David Lo, Liqun Cheng, R. Govindaraju, Parthasarathy Ranganathan, C. Kozyrakis","doi":"10.1145/2749469.2749475","DOIUrl":"https://doi.org/10.1145/2749469.2749475","url":null,"abstract":"User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared resources can cause latency spikes that violate the service-level objectives of latency-sensitive tasks. The resulting under-utilization hurts both the affordability and energy-efficiency of large-scale datacenters. With technology scaling slowing down, it becomes important to address this opportunity. We present Heracles, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service. Heracles dynamically manages multiple hardware and software isolation mechanisms, such as CPU, memory, and network isolation, to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best-effort tasks. We evaluate Heracles using production latency-critical and batch workloads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"27 1","pages":"450-462"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83788340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Pannuto, Yoonmyung Lee, Ye-Sheng Kuo, Z. Foo, B. Kempke, Gyouho Kim, R. Dreslinski, D. Blaauw, P. Dutta
As we show in this paper, I/O has become the limiting factor in scaling down size and power toward the goal of invisible computing. Achieving this goal will require composing optimized and specialized-yet reusable-components with an interconnect that permits tiny, ultra-low power systems. In contrast to today's interconnects which are limited by power-hungry pull-ups or high-overhead chip-select lines, our approach provides a superset of common bus features but at lower power, with fixed area and pin count, using fully synthesizable logic, and with surprisingly low protocol overhead. We present MBus, a new 4-pin, 22.6 pJ/bit/chip chip-to-chip interconnect made of two "shoot-through" rings. MBus facilitates ultra-low power system operation by implementing automatic power-gating of each chip in the system, easing the integration of active, inactive, and activating circuits on a single die. In addition, we introduce a new bus primitive: power oblivious communication, which guarantees message reception regardless of the recipient's power state when a message is sent. This disentangles power management from communication, greatly simplifying the creation of viable, modular, and heterogeneous systems that operate on the order of nanowatts. To evaluate the viability, power, performance, overhead, and scalability of our design, we build both hardware and software implementations of MBus and show its seamless operation across two FPGAs and twelve custom chips from three different semiconductor processes. A three-chip, 2.2 mm3 MBus system draws 8 nW of total system standby power and uses only 22.6 pJ/bit/chip for communication. This is the lowest power for any system bus with MBus 's feature set.
{"title":"MBus: An ultra-low power interconnect bus for next generation nanopower systems","authors":"P. Pannuto, Yoonmyung Lee, Ye-Sheng Kuo, Z. Foo, B. Kempke, Gyouho Kim, R. Dreslinski, D. Blaauw, P. Dutta","doi":"10.1145/2749469.2750376","DOIUrl":"https://doi.org/10.1145/2749469.2750376","url":null,"abstract":"As we show in this paper, I/O has become the limiting factor in scaling down size and power toward the goal of invisible computing. Achieving this goal will require composing optimized and specialized-yet reusable-components with an interconnect that permits tiny, ultra-low power systems. In contrast to today's interconnects which are limited by power-hungry pull-ups or high-overhead chip-select lines, our approach provides a superset of common bus features but at lower power, with fixed area and pin count, using fully synthesizable logic, and with surprisingly low protocol overhead. We present MBus, a new 4-pin, 22.6 pJ/bit/chip chip-to-chip interconnect made of two \"shoot-through\" rings. MBus facilitates ultra-low power system operation by implementing automatic power-gating of each chip in the system, easing the integration of active, inactive, and activating circuits on a single die. In addition, we introduce a new bus primitive: power oblivious communication, which guarantees message reception regardless of the recipient's power state when a message is sent. This disentangles power management from communication, greatly simplifying the creation of viable, modular, and heterogeneous systems that operate on the order of nanowatts. To evaluate the viability, power, performance, overhead, and scalability of our design, we build both hardware and software implementations of MBus and show its seamless operation across two FPGAs and twelve custom chips from three different semiconductor processes. A three-chip, 2.2 mm3 MBus system draws 8 nW of total system standby power and uses only 22.6 pJ/bit/chip for communication. This is the lowest power for any system bus with MBus 's feature set.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"16 1","pages":"629-641"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82292796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amir Rahmati, Matthew Hicks, Daniel E. Holcomb, Kevin Fu
Approximate computing research seeks to trade-off the accuracy of computation for increases in performance or reductions in power consumption. The observation driving approximate computing is that many applications tolerate small amounts of error which allows for an opportunistic relaxation of guard bands (e.g., clock rate and voltage). Besides affecting performance and power, reducing guard bands exposes analog properties of traditionally digital components. For DRAM, one analog property exposed by approximation is the variability of memory cell decay times. In this paper, we show how the differing cell decay times of approximate DRAM creates an error pattern that serves as a system identifying fingerprint. To validate this observation, we build an approximate memory platform and perform experiments that show that the fingerprint due to approximation is device dependent and resilient to changes in environment and level of approximation. To identify a DRAM chip given an approximate output, we develop a distance metric that yields a two-orders-of-magnitude difference in the distance between approximate results produced by the same DRAM chip and those produced by other DRAM chips. We use these results to create a mathematical model of approximate DRAM that we leverage to explore the end-to-end deanonymizing effects of approximate memory using a commodity system running an image manipulation program. The results from our experiment show that given less than 100 approximate outputs, the fingerprint for an approximate DRAM begins to converge to a single, machine identifying fingerprint.
{"title":"Probable cause: The deanonymizing effects of approximate DRAM","authors":"Amir Rahmati, Matthew Hicks, Daniel E. Holcomb, Kevin Fu","doi":"10.1145/2749469.2750419","DOIUrl":"https://doi.org/10.1145/2749469.2750419","url":null,"abstract":"Approximate computing research seeks to trade-off the accuracy of computation for increases in performance or reductions in power consumption. The observation driving approximate computing is that many applications tolerate small amounts of error which allows for an opportunistic relaxation of guard bands (e.g., clock rate and voltage). Besides affecting performance and power, reducing guard bands exposes analog properties of traditionally digital components. For DRAM, one analog property exposed by approximation is the variability of memory cell decay times. In this paper, we show how the differing cell decay times of approximate DRAM creates an error pattern that serves as a system identifying fingerprint. To validate this observation, we build an approximate memory platform and perform experiments that show that the fingerprint due to approximation is device dependent and resilient to changes in environment and level of approximation. To identify a DRAM chip given an approximate output, we develop a distance metric that yields a two-orders-of-magnitude difference in the distance between approximate results produced by the same DRAM chip and those produced by other DRAM chips. We use these results to create a mathematical model of approximate DRAM that we leverage to explore the end-to-end deanonymizing effects of approximate memory using a commodity system running an image manipulation program. The results from our experiment show that given less than 100 approximate outputs, the fingerprint for an approximate DRAM begins to converge to a single, machine identifying fingerprint.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"4 1","pages":"604-615"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80184843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}