Xiangyao Yu, Syed Kamran Haider, Ling Ren, Christopher W. Fletcher, Albert Kwon, Marten van Dijk, S. Devadas
Oblivious RAM (ORAM) is an established technique to hide the access pattern to an untrusted storage system. With ORAM, a curious adversary cannot tell what address the user is accessing when observing the bits moving between the user and the storage system. All existing ORAM schemes achieve obliviousness by adding redundancy to the storage system, i.e., each access is turned into multiple random accesses. Such redundancy incurs a large performance overhead.Although traditional data prefetching techniques successfully hide memory latency in DRAM based systems, it turns out that they do not work well for ORAM because ORAM does not have enough memory bandwidth available for issuing prefetch requests. In this paper, we exploit ORAM locality by taking advantage of the ORAM internal structures. While it might seem apparent that obliviousness and locality are two contradictory concepts, we challenge this intuition by exploiting data locality in ORAM without sacrificing security. In particular, we propose a dynamic ORAM prefetching technique called PrORAM (Dynamic Prefetcher for ORAM) and comprehensively explore its design space. PrORAM detects data locality in programs at runtime, and exploits the locality without leaking any information on the access pattern. Our simulation results show that with PrORAM, the performance of ORA M can be significantly improved. PrORAM achieves an average performance gain of 20% over the baseline ORA M for memory intensive benchmarks among Splash2 and 5.5% for SP EC06 workloads. The peiformance gain for YCSB and TPCC in DBMS benchmarks is 23.6% and 5% respectively. On average, PrORAM offers twice the performance gain than that offered by a static super block scheme.
遗忘RAM (ORAM)是一种成熟的技术,用于隐藏对不可信存储系统的访问模式。使用ORAM,好奇的对手在观察用户和存储系统之间移动的比特时,无法知道用户正在访问哪个地址。现有的所有ORAM方案都是通过在存储系统中增加冗余来实现遗忘的,即每次访问都变成多个随机访问。这种冗余会导致很大的性能开销。尽管传统的数据预取技术成功地隐藏了基于DRAM的系统中的内存延迟,但事实证明,它们不适用于ORAM,因为ORAM没有足够的内存带宽来发出预取请求。在本文中,我们利用ORAM的内部结构来利用ORAM的局部性。虽然遗忘和局部性是两个相互矛盾的概念,但我们通过在不牺牲安全性的情况下利用ORAM中的数据局部性来挑战这种直觉。特别提出了一种动态ORAM预取技术,称为proam (dynamic Prefetcher for ORAM),并对其设计空间进行了全面探索。proam在运行时检测程序中的数据局部性,并在不泄露任何访问模式信息的情况下利用局部性。仿真结果表明,使用proam可以显著提高ORAM的性能。在Splash2的内存密集型基准测试中,prooram比基准ORAM的平均性能提高了20%,在SP EC06工作负载中提高了5.5%。在DBMS基准测试中,YCSB和TPCC的性能增益分别为23.6%和5%。平均而言,proam提供的性能增益是静态超级块方案的两倍。
{"title":"PrORAM: Dynamic prefetcher for Oblivious RAM","authors":"Xiangyao Yu, Syed Kamran Haider, Ling Ren, Christopher W. Fletcher, Albert Kwon, Marten van Dijk, S. Devadas","doi":"10.1145/2749469.2750413","DOIUrl":"https://doi.org/10.1145/2749469.2750413","url":null,"abstract":"Oblivious RAM (ORAM) is an established technique to hide the access pattern to an untrusted storage system. With ORAM, a curious adversary cannot tell what address the user is accessing when observing the bits moving between the user and the storage system. All existing ORAM schemes achieve obliviousness by adding redundancy to the storage system, i.e., each access is turned into multiple random accesses. Such redundancy incurs a large performance overhead.Although traditional data prefetching techniques successfully hide memory latency in DRAM based systems, it turns out that they do not work well for ORAM because ORAM does not have enough memory bandwidth available for issuing prefetch requests. In this paper, we exploit ORAM locality by taking advantage of the ORAM internal structures. While it might seem apparent that obliviousness and locality are two contradictory concepts, we challenge this intuition by exploiting data locality in ORAM without sacrificing security. In particular, we propose a dynamic ORAM prefetching technique called PrORAM (Dynamic Prefetcher for ORAM) and comprehensively explore its design space. PrORAM detects data locality in programs at runtime, and exploits the locality without leaking any information on the access pattern. Our simulation results show that with PrORAM, the performance of ORA M can be significantly improved. PrORAM achieves an average performance gain of 20% over the baseline ORA M for memory intensive benchmarks among Splash2 and 5.5% for SP EC06 workloads. The peiformance gain for YCSB and TPCC in DBMS benchmarks is 23.6% and 5% respectively. On average, PrORAM offers twice the performance gain than that offered by a static super block scheme.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"42 1","pages":"616-628"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89572583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While numerous hardware synchronization mechanisms have been proposed, they either no longer function or suffer great performance loss when their hardware resources are exceeded, or they add significant complexity and cost to handle such resource overflows. Additionally, prior hardware synchronization proposals focus on one type (barrier or lock) of synchronization, so several mechanisms are likely to be needed to support real applications, many of which use locks, barriers, and/or condition variables. This paper proposes MiSAR, a minimalistic synchronization accelerator (MSA) that supports all three commonly used types of synchronization (locks, barriers, and condition variables), and a novel overflow management unit (OMU) that dynamically manages its (very) limited hardware synchronization resources. The OMU allows safe and efficient dynamic transitions between using hardware (MSA) and software synchronization implementations. This allows the MSA's resources to be used only for currently-active synchronization operations, providing significant performance benefits even when the number of synchronization variables used in the program is much larger than the MSA's resources. Because it allows a safe transition between hardware and software synchronization, the OMU also facilitates thread suspend/resume, migration, and other thread-management activities. Finally, the MSA/OMU combination decouples the instruction set support (how the program invokes hardware-supported synchronization) from the actual implementation of the accelerator, allowing different accelerators (or even wholesale removal of the accelerator) in the future without changes to OMU-compatible application or system code. We show that, even with only 2 MSA entries in each tile, the MSA/OMU combination on average performs within 3% of ideal (zero-latency) synchronization, and achieves a speedup of 1.43X over the software (pthreads) implementation.
{"title":"MiSAR: Minimalistic synchronization accelerator with resource overflow management","authors":"Ching-Kai Liang, Milos Prvulović","doi":"10.1145/2749469.2750396","DOIUrl":"https://doi.org/10.1145/2749469.2750396","url":null,"abstract":"While numerous hardware synchronization mechanisms have been proposed, they either no longer function or suffer great performance loss when their hardware resources are exceeded, or they add significant complexity and cost to handle such resource overflows. Additionally, prior hardware synchronization proposals focus on one type (barrier or lock) of synchronization, so several mechanisms are likely to be needed to support real applications, many of which use locks, barriers, and/or condition variables. This paper proposes MiSAR, a minimalistic synchronization accelerator (MSA) that supports all three commonly used types of synchronization (locks, barriers, and condition variables), and a novel overflow management unit (OMU) that dynamically manages its (very) limited hardware synchronization resources. The OMU allows safe and efficient dynamic transitions between using hardware (MSA) and software synchronization implementations. This allows the MSA's resources to be used only for currently-active synchronization operations, providing significant performance benefits even when the number of synchronization variables used in the program is much larger than the MSA's resources. Because it allows a safe transition between hardware and software synchronization, the OMU also facilitates thread suspend/resume, migration, and other thread-management activities. Finally, the MSA/OMU combination decouples the instruction set support (how the program invokes hardware-supported synchronization) from the actual implementation of the accelerator, allowing different accelerators (or even wholesale removal of the accelerator) in the future without changes to OMU-compatible application or system code. We show that, even with only 2 MSA entries in each tile, the MSA/OMU combination on average performs within 3% of ideal (zero-latency) synchronization, and achieves a speedup of 1.43X over the software (pthreads) implementation.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"2 1","pages":"414-426"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86551915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Longjun Liu, Chao Li, Hongbin Sun, Yang Hu, Juncheng Gu, Tao Li, J. Xin, Nanning Zheng
Today, an increasing number of applications and services are being hosted by large-scale data centers. The massive and irregular load surges challenge data center power infrastructures. As a result, power mismatching between supply and demand has emerged as a crucial issue in modern data centers which are either under-provisioned or powered by intermittent power sources. Recent proposals have employed energy storage devices such as the uninterruptible power supply (UPS) systems to address this issue. However, current approaches lack the capacity of efficiently handling the irregular and unpredictable power mismatches. In this paper, we propose Hybrid Energy Buffering (HEB), the first heterogeneous and adaptive strategy that incorporates super-capacitors (SCs) into existing data centers to dynamically deal with power mismatches. Our techniques exploit diverse energy absorbing characteristics and intelligent load assignment policies to provide efficiency-and scenario- aware power mismatch management. More attractively, our management schemes make the costly energy storage devices more affordable and economical for datacenter-scale usage. We evaluate the HEB design with a real system prototype. Compared with a homogenous battery energy buffering system, HEB could improve energy efficiency by 39.7%, extend UPS lifetime by 4.7×, reduce system downtime by 41% and improve renewable energy utilization by 81.2%. Our TCO analysis shows that HEB manifests high ROI and is able to gain more than 1.9× peak shaving benefit during an 8-years period. It allows datacenters to adapt to various power supply anomalies, thereby improving operational efficiency, resiliency and economy.
{"title":"HEB: Deploying and managing hybrid energy buffers for improving datacenter efficiency and economy","authors":"Longjun Liu, Chao Li, Hongbin Sun, Yang Hu, Juncheng Gu, Tao Li, J. Xin, Nanning Zheng","doi":"10.1145/2749469.2750384","DOIUrl":"https://doi.org/10.1145/2749469.2750384","url":null,"abstract":"Today, an increasing number of applications and services are being hosted by large-scale data centers. The massive and irregular load surges challenge data center power infrastructures. As a result, power mismatching between supply and demand has emerged as a crucial issue in modern data centers which are either under-provisioned or powered by intermittent power sources. Recent proposals have employed energy storage devices such as the uninterruptible power supply (UPS) systems to address this issue. However, current approaches lack the capacity of efficiently handling the irregular and unpredictable power mismatches. In this paper, we propose Hybrid Energy Buffering (HEB), the first heterogeneous and adaptive strategy that incorporates super-capacitors (SCs) into existing data centers to dynamically deal with power mismatches. Our techniques exploit diverse energy absorbing characteristics and intelligent load assignment policies to provide efficiency-and scenario- aware power mismatch management. More attractively, our management schemes make the costly energy storage devices more affordable and economical for datacenter-scale usage. We evaluate the HEB design with a real system prototype. Compared with a homogenous battery energy buffering system, HEB could improve energy efficiency by 39.7%, extend UPS lifetime by 4.7×, reduce system downtime by 41% and improve renewable energy utilization by 81.2%. Our TCO analysis shows that HEB manifests high ROI and is able to gain more than 1.9× peak shaving benefit during an 8-years period. It allows datacenters to adapt to various power supply anomalies, thereby improving operational efficiency, resiliency and economy.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"29 1","pages":"463-475"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80503887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chao Zhang, Guangyu Sun, Xian Zhang, Weiqi Zhang, Weisheng Zhao, Tao Wang, Yun Liang, Yongpan Liu, Yu Wang, J. Shu
Racetrack memory is an emerging non-volatile memory based on spintronic domain wall technology. It can achieve ultra-high storage density. Also, its read/write speed is comparable to that of SRAM. Due to the tape-like structure of its storage cell, a “shift” operation is introduced to access racetrack memory. Thus, prior research mainly focused on minimizing shift latency/energy of racetrack memory while leveraging its ultra-high storage density. Yet the reliability issue of a shift operation, however, is not well addressed. In fact, racetrack memory suffers from unsuccessful shift due to domain misalignment. Such a problem is called “position error” in this work. It can significantly reduce mean-time-to-failure (MTTF) of racetrack memory to an intolerable level. Even worse, conventional error correction codes (ECCs), which are designed for “bit errors”, cannot protect racetrack memory from the position errors. In this work, we investigate the position error model of a shift operation and categorize position errors into two types: “stop-in-middle” error and “out-of-step” error. To eliminate the stop-in-middle error, we propose a technique called sub-threshold shift (STS) to perform a more reliable shift in two stages. To detect and recover the out-of-step error, a protection mechanism called position error correction code (p-ECC) is proposed. We first describe how to design a p-ECC for different protection strength and analyze corresponding design overhead. Then, we further propose how to reduce area cost of p-ECC by leveraging the “overhead region” in a racetrack memory stripe. With these protection mechanisms, we introduce a position-error-aware shift architecture. Experimental results demonstrate that, after using our techniques, the overall MTTF of racetrack memory is improved from 1.33μs to more than 69 years, with only 0.2% performance degradation. Trade-off among reliability, area, performance, and energy is also explored with comprehensive discussion.
{"title":"Hi-fi playback: Tolerating position errors in shift operations of racetrack memory","authors":"Chao Zhang, Guangyu Sun, Xian Zhang, Weiqi Zhang, Weisheng Zhao, Tao Wang, Yun Liang, Yongpan Liu, Yu Wang, J. Shu","doi":"10.1145/2749469.2750388","DOIUrl":"https://doi.org/10.1145/2749469.2750388","url":null,"abstract":"Racetrack memory is an emerging non-volatile memory based on spintronic domain wall technology. It can achieve ultra-high storage density. Also, its read/write speed is comparable to that of SRAM. Due to the tape-like structure of its storage cell, a “shift” operation is introduced to access racetrack memory. Thus, prior research mainly focused on minimizing shift latency/energy of racetrack memory while leveraging its ultra-high storage density. Yet the reliability issue of a shift operation, however, is not well addressed. In fact, racetrack memory suffers from unsuccessful shift due to domain misalignment. Such a problem is called “position error” in this work. It can significantly reduce mean-time-to-failure (MTTF) of racetrack memory to an intolerable level. Even worse, conventional error correction codes (ECCs), which are designed for “bit errors”, cannot protect racetrack memory from the position errors. In this work, we investigate the position error model of a shift operation and categorize position errors into two types: “stop-in-middle” error and “out-of-step” error. To eliminate the stop-in-middle error, we propose a technique called sub-threshold shift (STS) to perform a more reliable shift in two stages. To detect and recover the out-of-step error, a protection mechanism called position error correction code (p-ECC) is proposed. We first describe how to design a p-ECC for different protection strength and analyze corresponding design overhead. Then, we further propose how to reduce area cost of p-ECC by leveraging the “overhead region” in a racetrack memory stripe. With these protection mechanisms, we introduce a position-error-aware shift architecture. Experimental results demonstrate that, after using our techniques, the overall MTTF of racetrack memory is improved from 1.33μs to more than 69 years, with only 0.2% performance degradation. Trade-off among reliability, area, performance, and energy is also explored with comprehensive discussion.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"7 1","pages":"694-706"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79047680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Johann Hauswald, Yiping Kang, M. Laurenzano, Quan Chen, Cheng Li, T. Mudge, R. Dreslinski, Jason Mars, Lingjia Tang
As applications such as Apple Siri, Google Now, Microsoft Cortana, and Amazon Echo continue to gain traction, webservice companies are adopting large deep neural networks (DNN) for machine learning challenges such as image processing, speech recognition, natural language processing, among others. A number of open questions arise as to the design of a server platform specialized for DNN and how modern warehouse scale computers (WSCs) should be outfitted to provide DNN as a service for these applications. In this paper, we present DjiNN, an open infrastructure for DNN as a service in WSCs, and Tonic Suite, a suite of 7 end-to-end applications that span image, speech, and language processing. We use DjiNN to design a high throughput DNN system based on massive GPU server designs and provide insights as to the varying characteristics across applications. After studying the throughput, bandwidth, and power properties of DjiNN and Tonic Suite, we investigate several design points for future WSC architectures. We investigate the total cost of ownership implications of having a WSC with a disaggregated GPU pool versus a WSC composed of homogeneous integrated GPU servers. We improve DNN throughput by over 120× for all but one application (40× for Facial Recognition) on an NVIDIA K40 GPU. On a GPU server composed of 8 NVIDIA K40s, we achieve near-linear scaling (around 1000× throughput improvement) for 3 of the 7 applications. Through our analysis, we also find that GPU-enabled WSCs improve total cost of ownership over CPU-only designs by 4-20×, depending on the composition of the workload.
{"title":"DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers","authors":"Johann Hauswald, Yiping Kang, M. Laurenzano, Quan Chen, Cheng Li, T. Mudge, R. Dreslinski, Jason Mars, Lingjia Tang","doi":"10.1145/2749469.2749472","DOIUrl":"https://doi.org/10.1145/2749469.2749472","url":null,"abstract":"As applications such as Apple Siri, Google Now, Microsoft Cortana, and Amazon Echo continue to gain traction, webservice companies are adopting large deep neural networks (DNN) for machine learning challenges such as image processing, speech recognition, natural language processing, among others. A number of open questions arise as to the design of a server platform specialized for DNN and how modern warehouse scale computers (WSCs) should be outfitted to provide DNN as a service for these applications. In this paper, we present DjiNN, an open infrastructure for DNN as a service in WSCs, and Tonic Suite, a suite of 7 end-to-end applications that span image, speech, and language processing. We use DjiNN to design a high throughput DNN system based on massive GPU server designs and provide insights as to the varying characteristics across applications. After studying the throughput, bandwidth, and power properties of DjiNN and Tonic Suite, we investigate several design points for future WSC architectures. We investigate the total cost of ownership implications of having a WSC with a disaggregated GPU pool versus a WSC composed of homogeneous integrated GPU servers. We improve DNN throughput by over 120× for all but one application (40× for Facial Recognition) on an NVIDIA K40 GPU. On a GPU server composed of 8 NVIDIA K40s, we achieve near-linear scaling (around 1000× throughput improvement) for 3 of the 7 applications. Through our analysis, we also find that GPU-enabled WSCs improve total cost of ownership over CPU-only designs by 4-20×, depending on the composition of the workload.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"74 1","pages":"27-40"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77111629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenhao Li, Yubin Xia, Haibo Chen, B. Zang, Haibing Guan
Modern computers are built with increasingly complex software stack crossing multiple layers (i.e., worlds), where cross-world call has been a necessity for various important purposes like security, reliability, and reduced complexity. Unfortunately, there is currently limited cross-world call support (e.g., syscall, vmcall), and thus other calls need to be emulated by detouring multiple times to the privileged software layer (i.e., OS kernel and hypervisor). This causes not only significant performance degradation, but also unnecessary implementation complexity. This paper argues that it is time to rethink the design of traditional cross-world call mechanisms by reviewing existing systems built upon hypervisors. Following the design philosophy of separating authentication from authorization, this paper advocates decoupling of the authorization on whether a world call is permitted (by software) from unforgeable identification of calling peers (by hardware). This results in a flexible cross-world call scheme (namely CrossOver) that allows secure, efficient and flexible cross-world calls across multiple layers not only within the same address space, but also across multiple address spaces. We demonstrate that CrossOver can be approximated by using existing hardware mechanism (namely VMFUNC) and a trivial modification of the VMFUNC mechanism can provide a full support of CrossOver. To show its usefulness, we have conducted case studies by using several recent systems such as Proxos, Hyper-Shell, Tahoma and ShadowContext. Performance measurements using full-system emulation and a real processor with VMFUNC shows that CrossOver significantly boosts the performance of the mentioned systems.
{"title":"Reducing world switches in virtualized environment with flexible cross-world calls","authors":"Wenhao Li, Yubin Xia, Haibo Chen, B. Zang, Haibing Guan","doi":"10.1145/2749469.2750406","DOIUrl":"https://doi.org/10.1145/2749469.2750406","url":null,"abstract":"Modern computers are built with increasingly complex software stack crossing multiple layers (i.e., worlds), where cross-world call has been a necessity for various important purposes like security, reliability, and reduced complexity. Unfortunately, there is currently limited cross-world call support (e.g., syscall, vmcall), and thus other calls need to be emulated by detouring multiple times to the privileged software layer (i.e., OS kernel and hypervisor). This causes not only significant performance degradation, but also unnecessary implementation complexity. This paper argues that it is time to rethink the design of traditional cross-world call mechanisms by reviewing existing systems built upon hypervisors. Following the design philosophy of separating authentication from authorization, this paper advocates decoupling of the authorization on whether a world call is permitted (by software) from unforgeable identification of calling peers (by hardware). This results in a flexible cross-world call scheme (namely CrossOver) that allows secure, efficient and flexible cross-world calls across multiple layers not only within the same address space, but also across multiple address spaces. We demonstrate that CrossOver can be approximated by using existing hardware mechanism (namely VMFUNC) and a trivial modification of the VMFUNC mechanism can provide a full support of CrossOver. To show its usefulness, we have conducted case studies by using several recent systems such as Proxos, Hyper-Shell, Tahoma and ShadowContext. Performance measurements using full-system emulation and a real processor with VMFUNC shows that CrossOver significantly boosts the performance of the mentioned systems.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"68 1","pages":"375-387"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81257710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zidong Du, Robert Fasthuber, Tianshi Chen, P. Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, O. Temam
In recent years, neural network accelerators have been shown to achieve both high energy efficiency and high performance for a broad application scope within the important category of recognition and mining applications. Still, both the energy efficiency and peiformance of such accelerators remain limited by memory accesses. In this paper, we focus on image applications, arguably the most important category among recognition and mining applications. The neural networks which are state-of-the-art for these applications are Convolutional Neural Networks (CNN), and they have an important property: weights are shared among many neurons, considerably reducing the neural network memory footprint. This property allows to entirely map a CNN within an SRAM, eliminating all DRAM accesses for weights. By further hoisting this accelerator next to the image sensor, it is possible to eliminate all remaining DRAM accesses, i.e., for inputs and outputs. In this paper, we propose such a CNN accelerator, placed next to a CMOS or CCD sensor. The absence of DRAM accesses combined with a careful exploitation of the specific data access patterns within CNNs allows us to design an accelerator which is 60x more energy efficient than the previous state-of-the-art neural network accelerator. We present a fult design down to the layout at 65 nm, with a modest footprint of 4.86 mm2 and consuming only 320 mW, but still about 30x faster than high-end GPUs.
{"title":"ShiDianNao: Shifting vision processing closer to the sensor","authors":"Zidong Du, Robert Fasthuber, Tianshi Chen, P. Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, O. Temam","doi":"10.1145/2749469.2750389","DOIUrl":"https://doi.org/10.1145/2749469.2750389","url":null,"abstract":"In recent years, neural network accelerators have been shown to achieve both high energy efficiency and high performance for a broad application scope within the important category of recognition and mining applications. Still, both the energy efficiency and peiformance of such accelerators remain limited by memory accesses. In this paper, we focus on image applications, arguably the most important category among recognition and mining applications. The neural networks which are state-of-the-art for these applications are Convolutional Neural Networks (CNN), and they have an important property: weights are shared among many neurons, considerably reducing the neural network memory footprint. This property allows to entirely map a CNN within an SRAM, eliminating all DRAM accesses for weights. By further hoisting this accelerator next to the image sensor, it is possible to eliminate all remaining DRAM accesses, i.e., for inputs and outputs. In this paper, we propose such a CNN accelerator, placed next to a CMOS or CCD sensor. The absence of DRAM accesses combined with a careful exploitation of the specific data access patterns within CNNs allows us to design an accelerator which is 60x more energy efficient than the previous state-of-the-art neural network accelerator. We present a fult design down to the layout at 65 nm, with a modest footprint of 4.86 mm2 and consuming only 320 mW, but still about 30x faster than high-end GPUs.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"70 1","pages":"92-104"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74395640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Trevor E. Carlson, W. Heirman, O. Allam, S. Kaxiras, L. Eeckhout
Driven by the motivation to expose instruction-level parallelism (ILP), microprocessor cores have evolved from simple, in-order pipelines into complex, superscalar out-of-order designs. By extracting ILP, these processors also enable parallel cache and memory operations as a useful side-effect. Today, however, the growing off-chip memory wall and complex cache hierarchies of many-core processors make cache and memory accesses ever more costly. This increases the importance of extracting memory hierarchy parallelism (MHP), while reducing the net impact of more general, yet complex and power-hungry ILP-extraction techniques. In addition, for multi-core processors operating in power- and energy-constrained environments, energy-efficiency has largely replaced single-thread performance as the primary concern. Based on this observation, we propose a core microarchitecture that is aimed squarely at generating parallel accesses to the memory hierarchy while maximizing energy efficiency. The Load Slice Core extends the efficient in-order, stall-on-use core with a second in-order pipeline that enables memory accesses and address-generating instructions to bypass stalled instructions in the main pipeline. Backward program slices containing address-generating instructions leading up to loads and stores are extracted automatically by the hardware, using a novel iterative algorithm that requires no software support or recompilation. On average, the Load Slice Core improves performance over a baseline in-order processor by 53% with overheads of only 15% in area and 22% in power, leading to an increase in energy efficiency (MIPS/Watt) over in-order and out-of-order designs by 43% and over 4.7×, respectively. In addition, for a power- and area-constrained many-core design, the Load Slice Core outperforms both in-order and out-of-order designs, achieving a 53% and 95% higher performance, respectively, thus providing an alternative direction for future many-core processors.
{"title":"The Load Slice Core microarchitecture","authors":"Trevor E. Carlson, W. Heirman, O. Allam, S. Kaxiras, L. Eeckhout","doi":"10.1145/2749469.2750407","DOIUrl":"https://doi.org/10.1145/2749469.2750407","url":null,"abstract":"Driven by the motivation to expose instruction-level parallelism (ILP), microprocessor cores have evolved from simple, in-order pipelines into complex, superscalar out-of-order designs. By extracting ILP, these processors also enable parallel cache and memory operations as a useful side-effect. Today, however, the growing off-chip memory wall and complex cache hierarchies of many-core processors make cache and memory accesses ever more costly. This increases the importance of extracting memory hierarchy parallelism (MHP), while reducing the net impact of more general, yet complex and power-hungry ILP-extraction techniques. In addition, for multi-core processors operating in power- and energy-constrained environments, energy-efficiency has largely replaced single-thread performance as the primary concern. Based on this observation, we propose a core microarchitecture that is aimed squarely at generating parallel accesses to the memory hierarchy while maximizing energy efficiency. The Load Slice Core extends the efficient in-order, stall-on-use core with a second in-order pipeline that enables memory accesses and address-generating instructions to bypass stalled instructions in the main pipeline. Backward program slices containing address-generating instructions leading up to loads and stores are extracted automatically by the hardware, using a novel iterative algorithm that requires no software support or recompilation. On average, the Load Slice Core improves performance over a baseline in-order processor by 53% with overheads of only 15% in area and 22% in power, leading to an increase in energy efficiency (MIPS/Watt) over in-order and out-of-order designs by 43% and over 4.7×, respectively. In addition, for a power- and area-constrained many-core design, the Load Slice Core outperforms both in-order and out-of-order designs, achieving a 53% and 95% higher performance, respectively, thus providing an alternative direction for future many-core processors.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"35 1","pages":"272-284"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85625924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cloud customers need guarantees regarding the security of their virtual machines (VMs), operating within an Infrastructure as a Service (IaaS) cloud system. This is complicated by the customer not knowing where his VM is executing, and on the semantic gap between what the customer wants to know versus what can be measured in the cloud. We present an architecture for monitoring a VM's security health, with the ability to attest this to the customer in an unforgeable manner. We show a concrete implementation of property-based attestation and a full prototype based on the OpenStack open source cloud software.
{"title":"CloudMonatt: An architecture for security health monitoring and attestation of virtual machines in cloud computing","authors":"Tianwei Zhang, R. Lee","doi":"10.1145/2749469.2750422","DOIUrl":"https://doi.org/10.1145/2749469.2750422","url":null,"abstract":"Cloud customers need guarantees regarding the security of their virtual machines (VMs), operating within an Infrastructure as a Service (IaaS) cloud system. This is complicated by the customer not knowing where his VM is executing, and on the semantic gap between what the customer wants to know versus what can be measured in the cloud. We present an architecture for monitoring a VM's security health, with the ability to attest this to the customer in an unforgeable manner. We show a concrete implementation of property-based attestation and a full prototype based on the OpenStack open source cloud software.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"2015 1","pages":"362-374"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91316298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most modern memory prefetchers rely on spatio-temporal locality to predict the memory addresses likely to be accessed by a program in the near future. Emerging workloads, however, make increasing use of irregular data structures, and thus exhibit a lower degree of spatial locality. This makes them less amenable to spatio-temporal prefetchers. In this paper, we introduce the concept of Semantic Locality, which uses inherent program semantics to characterize access relations. We show how, in principle, semantic locality can capture the relationship between data elements in a manner agnostic to the actual data layout, and we argue that semantic locality transcends spatio-temporal concerns. We further introduce the context-based memory prefetcher, which approximates semantic locality using reinforcement learning. The prefetcher identifies access patterns by applying reinforcement learning methods over machine and code attributes, that provide hints on memory access semantics. We test our prefetcher on a variety of benchmarks that employ both regular and irregular patterns. For the SPEC 2006 suite, it delivers speedups as high as 2.8× (20% on average) over a baseline with no prefetching, and outperforms leading spatio-temporal prefetchers. Finally, we show that the context-based prefetcher makes it possible for naive, pointer-based implementations of irregular algorithms to achieve performance comparable to that of spatially optimized code.
{"title":"Semantic locality and context-based prefetching using reinforcement learning","authors":"L. Peled, Shie Mannor, U. Weiser, Yoav Etsion","doi":"10.1145/2749469.2749473","DOIUrl":"https://doi.org/10.1145/2749469.2749473","url":null,"abstract":"Most modern memory prefetchers rely on spatio-temporal locality to predict the memory addresses likely to be accessed by a program in the near future. Emerging workloads, however, make increasing use of irregular data structures, and thus exhibit a lower degree of spatial locality. This makes them less amenable to spatio-temporal prefetchers. In this paper, we introduce the concept of Semantic Locality, which uses inherent program semantics to characterize access relations. We show how, in principle, semantic locality can capture the relationship between data elements in a manner agnostic to the actual data layout, and we argue that semantic locality transcends spatio-temporal concerns. We further introduce the context-based memory prefetcher, which approximates semantic locality using reinforcement learning. The prefetcher identifies access patterns by applying reinforcement learning methods over machine and code attributes, that provide hints on memory access semantics. We test our prefetcher on a variety of benchmarks that employ both regular and irregular patterns. For the SPEC 2006 suite, it delivers speedups as high as 2.8× (20% on average) over a baseline with no prefetching, and outperforms leading spatio-temporal prefetchers. Finally, we show that the context-based prefetcher makes it possible for naive, pointer-based implementations of irregular algorithms to achieve performance comparable to that of spatially optimized code.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"27 1","pages":"285-297"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86552348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}