2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)最新文献_第4页

FASE: Finding Amplitude-modulated Side-channel Emanations FASE:寻找调幅的侧信道辐射

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750394

R. Callan, A. Zajić, Milos Prvulović

While all computation generates electromagnetic (EM) side-channel signals, some of the strongest and farthest-propagating signals are created when an existing strong periodic signal (e.g. a clock signal) becomes stronger or weaker (amplitude-modulated) depending on processor or memory activity. However, modern systems create emanations at thousands of different frequencies, so it is a difficult, error-prone, and time-consuming task to find those few emanations that are AM-modulated by processor/memory activity. This paper presents a methodology for rapidly finding such activity-modulated signals. This method creates recognizable spectral patterns generated by specially designed micro-benchmarks and then processes the recorded spectra to identify signals that exhibit amplitude-modulation behavior. We apply this method to several computer systems and find several such modulated signals. To illustrate how our methodology can benefit side-channel security research and practice, we also identify the physical mechanisms behind those signals, and find that the strongest signals are created by voltage regulators, memory refreshes, and DRAM clocks. Our results indicate that each signal may carry unique information about system activity, potentially enhancing an attacker's capability to extract sensitive information. We also confirm that our methodology correctly separates emanated signals that are affected by specific processor or memory activities from those that are not.

虽然所有的计算都会产生电磁(EM)侧信道信号，但当现有的强周期性信号(例如时钟信号)变得更强或更弱(幅度调制)时，就会产生一些最强和传播最远的信号。然而，现代系统会产生数千种不同频率的辐射，因此要找到那些由处理器/内存活动am调制的少数辐射是一项困难、容易出错且耗时的任务。本文提出了一种快速发现这种活动调制信号的方法。该方法通过特别设计的微基准产生可识别的光谱模式，然后处理记录的光谱以识别表现出调幅行为的信号。我们将这种方法应用于几个计算机系统，并找到了几个这样的调制信号。为了说明我们的方法如何有益于侧信道安全研究和实践，我们还确定了这些信号背后的物理机制，并发现最强的信号是由电压调节器、内存刷新和DRAM时钟产生的。我们的研究结果表明，每个信号都可能携带有关系统活动的唯一信息，这可能会增强攻击者提取敏感信息的能力。我们还确认，我们的方法正确地区分了受特定处理器或内存活动影响的信号和不受影响的信号。

{"title":"FASE: Finding Amplitude-modulated Side-channel Emanations","authors":"R. Callan, A. Zajić, Milos Prvulović","doi":"10.1145/2749469.2750394","DOIUrl":"https://doi.org/10.1145/2749469.2750394","url":null,"abstract":"While all computation generates electromagnetic (EM) side-channel signals, some of the strongest and farthest-propagating signals are created when an existing strong periodic signal (e.g. a clock signal) becomes stronger or weaker (amplitude-modulated) depending on processor or memory activity. However, modern systems create emanations at thousands of different frequencies, so it is a difficult, error-prone, and time-consuming task to find those few emanations that are AM-modulated by processor/memory activity. This paper presents a methodology for rapidly finding such activity-modulated signals. This method creates recognizable spectral patterns generated by specially designed micro-benchmarks and then processes the recorded spectra to identify signals that exhibit amplitude-modulation behavior. We apply this method to several computer systems and find several such modulated signals. To illustrate how our methodology can benefit side-channel security research and practice, we also identify the physical mechanisms behind those signals, and find that the strongest signals are created by voltage regulators, memory refreshes, and DRAM clocks. Our results indicate that each signal may carry unique information about system activity, potentially enhancing an attacker's capability to extract sensitive information. We also confirm that our methodology correctly separates emanated signals that are affected by specific processor or memory activities from those that are not.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"85 1","pages":"592-603"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90662446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

Architecting to achieve a billion requests per second throughput on a single key-value store server platform 在单个键值存储服务器平台上实现每秒10亿个请求的吞吐量

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750416

Sheng Li, Hyeontaek Lim, V. Lee, Jung Ho Ahn, Anuj Kalia, M. Kaminsky, D. Andersen, O. Seongil, Sukhan Lee, P. Dubey

Distributed in-memory key-value stores (KVSs), such as memcached, have become a critical data serving layer in modern Internet-oriented datacenter infrastructure. Their performance and efficiency directly affect the QoS of web services and the efficiency of datacenters. Traditionally, these systems have had significant overheads from inefficient network processing, OS kernel involvement, and concurrency control. Two recent research thrusts have focused upon improving key-value performance. Hardware-centric research has started to explore specialized platforms including FPGAs for KVSs; results demonstrated an order of magnitude increase in throughput and energy efficiency over stock memcached. Software-centric research revisited the KVS application to address fundamental software bottlenecks and to exploit the full potential of modern commodity hardware; these efforts too showed orders of magnitude improvement over stock memcached. We aim at architecting high performance and efficient KVS platforms, and start with a rigorous architectural characterization across system stacks over a collection of representative KVS implementations. Our detailed full-system characterization not only identifies the critical hardware/software ingredients for high-performance KVS systems, but also leads to guided optimizations atop a recent design to achieve a record-setting throughput of 120 million requests per second (MRPS) on a single commodity server. Our implementation delivers 9.2X the performance (RPS) and 2.8X the system energy efficiency (RPS/watt) of the best-published FPGA-based claims. We craft a set of design principles for future platform architectures, and via detailed simulations demonstrate the capability of achieving a billion RPS with a single server constructed following our principles.

分布式内存中的键值存储(KVSs)，如memcached，已经成为现代面向internet的数据中心基础设施中的关键数据服务层。它们的性能和效率直接影响到web服务的QoS和数据中心的效率。传统上，由于网络处理效率低下、涉及OS内核和并发控制，这些系统有很大的开销。最近的两个研究重点集中在改进键值性能上。以硬件为中心的研究已经开始探索专门的平台，包括用于kvs的fpga;结果表明，吞吐量和能源效率比库存memcached提高了一个数量级。以软件为中心的研究重新审视了KVS应用程序，以解决基本的软件瓶颈，并充分利用现代商用硬件的潜力;与普通memcached相比，这些努力也显示出了数量级的改进。我们的目标是构建高性能和高效的KVS平台，并在一系列具有代表性的KVS实现的系统堆栈上开始严格的体系结构表征。我们详细的全系统特性不仅确定了高性能KVS系统的关键硬件/软件成分，而且还在最近的设计上进行了指向性优化，以在单个商品服务器上实现每秒1.2亿个请求(MRPS)的创纪录吞吐量。我们的实现提供了9.2倍的性能(RPS)和2.8倍的系统能效(RPS/瓦特)的最佳发布的基于fpga的声明。我们为未来的平台架构制定了一套设计原则，并通过详细的模拟证明了使用遵循我们原则构建的单个服务器实现十亿RPS的能力。

{"title":"Architecting to achieve a billion requests per second throughput on a single key-value store server platform","authors":"Sheng Li, Hyeontaek Lim, V. Lee, Jung Ho Ahn, Anuj Kalia, M. Kaminsky, D. Andersen, O. Seongil, Sukhan Lee, P. Dubey","doi":"10.1145/2749469.2750416","DOIUrl":"https://doi.org/10.1145/2749469.2750416","url":null,"abstract":"Distributed in-memory key-value stores (KVSs), such as memcached, have become a critical data serving layer in modern Internet-oriented datacenter infrastructure. Their performance and efficiency directly affect the QoS of web services and the efficiency of datacenters. Traditionally, these systems have had significant overheads from inefficient network processing, OS kernel involvement, and concurrency control. Two recent research thrusts have focused upon improving key-value performance. Hardware-centric research has started to explore specialized platforms including FPGAs for KVSs; results demonstrated an order of magnitude increase in throughput and energy efficiency over stock memcached. Software-centric research revisited the KVS application to address fundamental software bottlenecks and to exploit the full potential of modern commodity hardware; these efforts too showed orders of magnitude improvement over stock memcached. We aim at architecting high performance and efficient KVS platforms, and start with a rigorous architectural characterization across system stacks over a collection of representative KVS implementations. Our detailed full-system characterization not only identifies the critical hardware/software ingredients for high-performance KVS systems, but also leads to guided optimizations atop a recent design to achieve a record-setting throughput of 120 million requests per second (MRPS) on a single commodity server. Our implementation delivers 9.2X the performance (RPS) and 2.8X the system energy efficiency (RPS/watt) of the best-published FPGA-based claims. We craft a set of design principles for future platform architectures, and via detailed simulations demonstrate the capability of achieving a billion RPS with a single server constructed following our principles.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"72 1","pages":"476-488"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85646432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 135

Branch vanguard: Decomposing branch functionality into prediction and resolution instructions 分支先锋:将分支功能分解为预测和解析指令

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750400

Daniel S. McFarlin, C. Zilles

While control speculation is highly effective for generating good schedules in out-of-order processors, it is less effective for in-order processors because compilers have trouble scheduling in the presence of unbiased branches, even when those branches are highly predictable. In this paper, we demonstrate a novel architectural branch decomposition that separates the prediction and deconvergence point of a branch from its resolution, which enables the compiler to profitably schedule across predictable, but unbiased branches. We show that the hardware support for this branch architecture is a trivial extension of existing systems and describe a simple code transformation for exploiting this architectural support. As architectural changes are required, this technique is most compelling for a dynamic binary translation-based system like Project Denver. We evaluate the performance improvements enabled by this transformation for several in-order configurations across the SPEC 2006 benchmark suites. We show that our technique produces a Geomean speedup of 11% for SPEC 2006 Integer, with speedups as large as 35%. As floating point benchmarks contain fewer unbiased, but predictable branches, our Geomean speedup on SPEC 2006 FP is 7%, with a maximum speedup of 26%.

虽然控制推测对于在无序处理器中生成良好的调度非常有效，但对于有序处理器则不太有效，因为编译器在存在无偏分支的情况下会遇到调度问题，即使这些分支是高度可预测的。在本文中，我们演示了一种新的体系结构分支分解，它将分支的预测点和反收敛点从分支的解析中分离出来，这使得编译器能够在可预测的但无偏的分支之间进行有利的调度。我们展示了对该分支体系结构的硬件支持是对现有系统的简单扩展，并描述了利用该体系结构支持的简单代码转换。由于需要对体系结构进行更改，这种技术对于像Project Denver这样基于动态二进制翻译的系统来说是最引人注目的。我们在SPEC 2006基准测试套件的几个顺序配置中评估了这种转换带来的性能改进。我们表明，我们的技术为spec2006 Integer提供了11%的几何加速，加速高达35%。由于浮点基准测试包含较少的无偏差但可预测的分支，因此我们在SPEC 2006 FP上的Geomean加速为7%，最大加速为26%。

{"title":"Branch vanguard: Decomposing branch functionality into prediction and resolution instructions","authors":"Daniel S. McFarlin, C. Zilles","doi":"10.1145/2749469.2750400","DOIUrl":"https://doi.org/10.1145/2749469.2750400","url":null,"abstract":"While control speculation is highly effective for generating good schedules in out-of-order processors, it is less effective for in-order processors because compilers have trouble scheduling in the presence of unbiased branches, even when those branches are highly predictable. In this paper, we demonstrate a novel architectural branch decomposition that separates the prediction and deconvergence point of a branch from its resolution, which enables the compiler to profitably schedule across predictable, but unbiased branches. We show that the hardware support for this branch architecture is a trivial extension of existing systems and describe a simple code transformation for exploiting this architectural support. As architectural changes are required, this technique is most compelling for a dynamic binary translation-based system like Project Denver. We evaluate the performance improvements enabled by this transformation for several in-order configurations across the SPEC 2006 benchmark suites. We show that our technique produces a Geomean speedup of 11% for SPEC 2006 Integer, with speedups as large as 35%. As floating point benchmarks contain fewer unbiased, but predictable branches, our Geomean speedup on SPEC 2006 FP is 7%, with a maximum speedup of 26%.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"55 1","pages":"323-335"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80806349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Rumba: An online quality management system for approximate computing 伦巴:用于近似计算的在线质量管理系统

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750371

D. Khudia, Babak Zamirai, M. Samadi, S. Mahlke

Approximate computing can be employed for an emerging class of applications from various domains such as multimedia, machine learning and computer vision. The approximated output of such applications, even though not 100% numerically correct, is often either useful or the difference is unnoticeable to the end user. This opens up a new design dimension to trade off application performance and energy consumption with output correctness. However, a largely unaddressed challenge is quality control: how to ensure the user experience meets a prescribed level of quality. Current approaches either do not monitor output quality or use sampling approaches to check a small subset of the output assuming that it is representative. While these approaches have been shown to produce average errors that are acceptable, they often miss large errors without any means to take corrective actions. To overcome this challenge, we propose Rumba for online detection and correction of large approximation errors in an approximate accelerator-based computing environment. Rumba employs continuous lightweight checks in the accelerator to detect large approximation errors and then fixes these errors by exact re-computation on the host processor. Rumba employs computationally inexpensive output error prediction models for efficient detection. Computing patterns amenable for approximation (e.g., map and stencil) are usually data parallel in nature and Rumba exploits this property for selective correction. Overall, Rumba is able to achieve 2.1x reduction in output error for an unchecked approximation accelerator while maintaining the accelerator performance gains at the cost of reducing the energy savings from 3.2x to 2.2x for a set of applications from different approximate computing domains.

近似计算可以用于各种领域的新兴应用，如多媒体、机器学习和计算机视觉。这种应用程序的近似输出，即使不是100%的数字正确，通常是有用的，或者差异是不明显的最终用户。这开辟了一个新的设计维度，在应用程序性能和能耗与输出正确性之间进行权衡。然而，一个很大程度上未解决的挑战是质量控制:如何确保用户体验达到规定的质量水平。当前的方法要么不监控输出质量，要么使用抽样方法检查输出的一小部分，假设它具有代表性。虽然这些方法已被证明可以产生可接受的平均误差，但它们经常忽略大的错误，而没有采取任何纠正措施。为了克服这一挑战，我们提出了Rumba在基于近似加速器的计算环境中用于在线检测和校正大型近似误差。Rumba在加速器中使用连续的轻量级检查来检测较大的近似误差，然后通过在主机处理器上精确的重新计算来修复这些错误。Rumba采用计算成本低廉的输出误差预测模型进行有效检测。适用于近似的计算模式(例如，地图和模板)通常在本质上是数据并行的，Rumba利用这一特性进行选择性校正。总的来说，Rumba能够将未经检查的近似加速器的输出误差减少2.1倍，同时保持加速器性能的提高，代价是将来自不同近似计算领域的一组应用程序的节能从3.2倍降低到2.2倍。

{"title":"Rumba: An online quality management system for approximate computing","authors":"D. Khudia, Babak Zamirai, M. Samadi, S. Mahlke","doi":"10.1145/2749469.2750371","DOIUrl":"https://doi.org/10.1145/2749469.2750371","url":null,"abstract":"Approximate computing can be employed for an emerging class of applications from various domains such as multimedia, machine learning and computer vision. The approximated output of such applications, even though not 100% numerically correct, is often either useful or the difference is unnoticeable to the end user. This opens up a new design dimension to trade off application performance and energy consumption with output correctness. However, a largely unaddressed challenge is quality control: how to ensure the user experience meets a prescribed level of quality. Current approaches either do not monitor output quality or use sampling approaches to check a small subset of the output assuming that it is representative. While these approaches have been shown to produce average errors that are acceptable, they often miss large errors without any means to take corrective actions. To overcome this challenge, we propose Rumba for online detection and correction of large approximation errors in an approximate accelerator-based computing environment. Rumba employs continuous lightweight checks in the accelerator to detect large approximation errors and then fixes these errors by exact re-computation on the host processor. Rumba employs computationally inexpensive output error prediction models for efficient detection. Computing patterns amenable for approximation (e.g., map and stencil) are usually data parallel in nature and Rumba exploits this property for selective correction. Overall, Rumba is able to achieve 2.1x reduction in output error for an unchecked approximation accelerator while maintaining the accelerator performance gains at the cost of reducing the energy savings from 3.2x to 2.2x for a set of applications from different approximate computing domains.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"1 1","pages":"554-566"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88694086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 138

DynaSpAM: Dynamic spatial architecture mapping using Out of Order instruction schedules DynaSpAM:使用无序指令调度的动态空间架构映射

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750414

Feng Liu, Heejin Ahn, S. Beard, Taewook Oh, David I. August

Spatial architectures are more efficient than traditional Out-of-Order (OOO) processors for computationally intensive programs. However, spatial architectures require mapping a program, either statically or dynamically, onto the spatial fabric. Static methods can generate efficient mappings, but they cannot adapt to changing workloads and are not compatible across hardware generations. Current dynamic methods are adaptive and compatible, but do not optimize as well due to their limited use of speculation and small mapping scopes. To overcome the limitations of existing dynamic mapping methods for spatial architectures, while minimizing the inefficiencies inherent in OOO superscalar processors, this paper presents DynaSpAM (Dynamic Spatial Architecture Mapping), a framework that tightly couples a spatial fabric with an OOO pipeline. DynaSpAM coaxes the OOO processor into producing an optimized mapping with a simple modification to the processor's scheduler. The insight behind DynaSpAM is that today's powerful OOO processors do for themselves most of the work necessary to produce a highly optimized mapping for a spatial architecture, including aggressively speculating control and memory dependences, and scheduling instructions using a large window. Evaluation of DynaSpAM shows a geomean speedup of 1.42× for 11 benchmarks from the Rodinia benchmark suite with a geomean 23.9% reduction in energy consumption compared to an 8-issue OOO pipeline.

对于计算密集型程序，空间架构比传统的无序(OOO)处理器更有效。然而，空间架构需要将程序静态或动态地映射到空间结构上。静态方法可以生成高效的映射，但它们不能适应不断变化的工作负载，并且在硬件代之间不兼容。目前的动态方法是自适应和兼容的，但由于其使用的推测和小的映射范围有限，不能很好地优化。为了克服现有空间架构动态映射方法的局限性，同时最大限度地降低OOO超标量处理器固有的低效率，本文提出了DynaSpAM(动态空间架构映射)，这是一个将空间结构与OOO管道紧密耦合的框架。DynaSpAM通过对处理器调度程序的简单修改诱导OOO处理器生成优化的映射。DynaSpAM背后的见解是，当今强大的OOO处理器自己完成了为空间体系结构生成高度优化映射所需的大部分工作，包括积极地推测控制和内存依赖，以及使用大窗口调度指令。对DynaSpAM的评估显示，在Rodinia基准套件的11个基准测试中，DynaSpAM的速度提高了1.42倍，与8个OOO管道相比，能耗降低了23.9%。

{"title":"DynaSpAM: Dynamic spatial architecture mapping using Out of Order instruction schedules","authors":"Feng Liu, Heejin Ahn, S. Beard, Taewook Oh, David I. August","doi":"10.1145/2749469.2750414","DOIUrl":"https://doi.org/10.1145/2749469.2750414","url":null,"abstract":"Spatial architectures are more efficient than traditional Out-of-Order (OOO) processors for computationally intensive programs. However, spatial architectures require mapping a program, either statically or dynamically, onto the spatial fabric. Static methods can generate efficient mappings, but they cannot adapt to changing workloads and are not compatible across hardware generations. Current dynamic methods are adaptive and compatible, but do not optimize as well due to their limited use of speculation and small mapping scopes. To overcome the limitations of existing dynamic mapping methods for spatial architectures, while minimizing the inefficiencies inherent in OOO superscalar processors, this paper presents DynaSpAM (Dynamic Spatial Architecture Mapping), a framework that tightly couples a spatial fabric with an OOO pipeline. DynaSpAM coaxes the OOO processor into producing an optimized mapping with a simple modification to the processor's scheduler. The insight behind DynaSpAM is that today's powerful OOO processors do for themselves most of the work necessary to produce a highly optimized mapping for a spatial architecture, including aggressively speculating control and memory dependences, and scheduling instructions using a large window. Evaluation of DynaSpAM shows a geomean speedup of 1.42× for 11 benchmarks from the Rodinia benchmark suite with a geomean 23.9% reduction in energy consumption compared to an 8-issue OOO pipeline.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"13 1","pages":"541-553"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88857851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Exploring the potential of heterogeneous Von Neumann/dataflow execution models 探索异构冯·诺依曼/数据流执行模型的潜力

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750380

Tony Nowatzki, Vinay Gangadhar, K. Sankaralingam

General purpose processors (GPPs), from small inorder designs to many-issue out-of-order, incur large power overheads which must be addressed for future technology generations. Major sources of overhead include structures which dynamically extract the data-dependence graph or maintain precise state. Considering irregular workloads, current specialization approaches either heavily curtail performance, or provide simply too little benefit. Interestingly, well known explicit-dataflow architectures eliminate these overheads by directly executing the data-dependence graph and eschewing instruction-precise recoverability. However, even after decades of research, dataflow architectures have yet to come into prominence as a solution. We attribute this to a lack of effective control speculation and the latency overhead of explicit communication, which is crippling for certain codes. This paper makes the observation that if both out-of-order and explicit-dataflow were available in one processor, many types of GPP cores can benefit from dynamically switching during certain phases of an application's lifetime. Analysis reveals that an ideal explicit-dataflow engine could be profitable for more than half of instructions, providing significant performance and energy improvements. The challenge is to achieve these benefits without introducing excess hardware complexity. To this end, we propose the Specialization Engine for Explicit-Dataflow (SEED). Integrated with an inorder core, we see 1.67× performance and 1.65× energy benefits, with an Out-Of-Order (OOO) dual-issue core we see 1.33× and 1.70×, and with a quad-issue OOO, 1.14× and 1.54×.

通用处理器(gpp)，从小的无序设计到多问题无序设计，都会产生巨大的功耗开销，这必须在未来的技术世代中得到解决。开销的主要来源包括动态提取数据依赖图或维护精确状态的结构。考虑到不规则的工作负载，当前的专门化方法要么严重降低性能，要么提供的好处太少。有趣的是，众所周知的显式数据流架构通过直接执行数据依赖性图和避免指令精确的可恢复性来消除这些开销。然而，即使经过几十年的研究，数据流架构仍未作为一种解决方案得到重视。我们将其归因于缺乏有效的控制推测和显式通信的延迟开销，这对某些代码来说是严重的。本文观察到，如果无序数据流和显式数据流在一个处理器中都可用，那么在应用程序生命周期的某些阶段，许多类型的GPP核心都可以从动态切换中受益。分析表明，理想的显式数据流引擎可以为超过一半的指令提供利润，提供显著的性能和能源改进。挑战在于如何在不引入过多硬件复杂性的情况下实现这些好处。为此，我们提出了显式数据流专门化引擎(SEED)。与无序内核集成，我们看到1.67倍的性能和1.65倍的能源效益，与无序(OOO)双内核集成，我们看到1.33倍和1.70倍，与四处理器OOO集成，1.14倍和1.54倍。

{"title":"Exploring the potential of heterogeneous Von Neumann/dataflow execution models","authors":"Tony Nowatzki, Vinay Gangadhar, K. Sankaralingam","doi":"10.1145/2749469.2750380","DOIUrl":"https://doi.org/10.1145/2749469.2750380","url":null,"abstract":"General purpose processors (GPPs), from small inorder designs to many-issue out-of-order, incur large power overheads which must be addressed for future technology generations. Major sources of overhead include structures which dynamically extract the data-dependence graph or maintain precise state. Considering irregular workloads, current specialization approaches either heavily curtail performance, or provide simply too little benefit. Interestingly, well known explicit-dataflow architectures eliminate these overheads by directly executing the data-dependence graph and eschewing instruction-precise recoverability. However, even after decades of research, dataflow architectures have yet to come into prominence as a solution. We attribute this to a lack of effective control speculation and the latency overhead of explicit communication, which is crippling for certain codes. This paper makes the observation that if both out-of-order and explicit-dataflow were available in one processor, many types of GPP cores can benefit from dynamically switching during certain phases of an application's lifetime. Analysis reveals that an ideal explicit-dataflow engine could be profitable for more than half of instructions, providing significant performance and energy improvements. The challenge is to achieve these benefits without introducing excess hardware complexity. To this end, we propose the Specialization Engine for Explicit-Dataflow (SEED). Integrated with an inorder core, we see 1.67× performance and 1.65× energy benefits, with an Out-Of-Order (OOO) dual-issue core we see 1.33× and 1.70×, and with a quad-issue OOO, 1.14× and 1.54×.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"54 1","pages":"298-310"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91150328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 63

SLIP: Reducing wire energy in the memory hierarchy SLIP:减少存储器层次结构中的线路能量

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750398

Subhasis Das, Tor M. Aamodt, W. Dally

Wire energy has become the major contributor to energy in large lower level caches. While wire energy is related to wire latency its costs are exposed differently in the memory hierarchy. We propose Sub-Level Insertion Policy (SLIP), a cache management policy which improves cache energy consumption by increasing the number of accesses from energy efficient locations while simultaneously decreasing intra-level data movement. In SLIP, each cache level is partitioned into several cache sublevels of differing sizes. Then, the recent reuse distance distribution of a line is used to choose an energy-optimized insertion and movement policy for the line. The policy choice is made by a hardware unit that predicts the number of accesses and inter-level movements. Using a full-system simulation including OS interactions and hardware overheads, we show that SLIP saves 35% energy at the L2 and 22% energy at the L3 level and performs 0.75% better than a regular cache hierarchy in a single core system. When configured to include a bypassing policy, SLIP reduces traffic to DRAM by 2.2%. This is achieved at the cost of storing 12b metadata per cache line (2.3% overhead), a 6b policy in the PTE, and 32b distribution metadata for each page in the DRAM (a overhead of 0.1%). Using SLIP in a multiprogrammed system saves 47% LLC energy, and reduces traffic to DRAM by 5.5%.

导线能量已成为大型低层缓存中能量的主要贡献者。虽然线能量与线延迟有关，但它的成本在内存层次结构中是不同的。我们提出了子级插入策略(SLIP)，这是一种缓存管理策略，通过增加来自节能位置的访问次数来提高缓存能耗，同时减少层内数据移动。在SLIP中，每个缓存级别被划分为大小不同的几个缓存子级别。然后，利用线路最近的重用距离分布来选择线路的能量优化插入和移动策略。策略选择由硬件单元做出，该硬件单元预测访问数量和层间移动。通过包括操作系统交互和硬件开销在内的全系统模拟，我们发现SLIP在L2级节省35%的能量，在L3级节省22%的能量，并且在单核系统中比常规缓存层次结构性能好0.75%。当配置为包含绕过策略时，SLIP将到DRAM的流量减少2.2%。实现这一目标的代价是:每条高速缓存线路存储12b元数据(2.3%的开销)，在PTE中存储6b的策略，在DRAM中为每个页面存储32b的分布元数据(0.1%的开销)。在多程序系统中使用SLIP可节省47%的LLC能量，并减少5.5%的DRAM流量。

{"title":"SLIP: Reducing wire energy in the memory hierarchy","authors":"Subhasis Das, Tor M. Aamodt, W. Dally","doi":"10.1145/2749469.2750398","DOIUrl":"https://doi.org/10.1145/2749469.2750398","url":null,"abstract":"Wire energy has become the major contributor to energy in large lower level caches. While wire energy is related to wire latency its costs are exposed differently in the memory hierarchy. We propose Sub-Level Insertion Policy (SLIP), a cache management policy which improves cache energy consumption by increasing the number of accesses from energy efficient locations while simultaneously decreasing intra-level data movement. In SLIP, each cache level is partitioned into several cache sublevels of differing sizes. Then, the recent reuse distance distribution of a line is used to choose an energy-optimized insertion and movement policy for the line. The policy choice is made by a hardware unit that predicts the number of accesses and inter-level movements. Using a full-system simulation including OS interactions and hardware overheads, we show that SLIP saves 35% energy at the L2 and 22% energy at the L3 level and performs 0.75% better than a regular cache hierarchy in a single core system. When configured to include a bypassing policy, SLIP reduces traffic to DRAM by 2.2%. This is achieved at the cost of storing 12b metadata per cache line (2.3% overhead), a 6b policy in the PTE, and 32b distribution metadata for each page in the DRAM (a overhead of 0.1%). Using SLIP in a multiprogrammed system saves 47% LLC energy, and reduces traffic to DRAM by 5.5%.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"24 1","pages":"349-361"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87404472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

SHRINK: Reducing the ISA complexity via instruction recycling 收缩:通过指令回收减少ISA的复杂性

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750391

B. Lopes, R. Auler, Luiz E. Ramos, E. Borin, R. Azevedo

Microprocessor manufacturers typically keep old instruction sets in modern processors to ensure backward compatibility with legacy software. The introduction of newer extensions to the ISA increases the design complexity of microprocessor front-ends, exacerbates the consumption of precious on-chip resources (e.g., silicon area and energy), and demands more efforts for hardware verification and debugging. We analyzed several x86 applications and operating systems deployed between 1995 and 2012 and observed that many instructions stop being used over time, and more than 500 instructions were never used in these applications. We also investigate the impact of including these unused instructions in the design of the x86 decoders and propose SHRINK, a mechanism to remove old instructions without breaking backward compatibility with legacy code. SHRINK allows us to remove 40% of the instructions from the x86 ISA and improve the critical path, area, and power consumption of the instruction decoder, respectively, by 23%, 48%, and 49%, on average.

微处理器制造商通常在现代处理器中保留旧的指令集，以确保与遗留软件的向后兼容性。ISA新扩展的引入增加了微处理器前端设计的复杂性，加剧了宝贵的片上资源(如硅面积和能源)的消耗，并且需要更多的硬件验证和调试工作。我们分析了1995年至2012年间部署的几个x86应用程序和操作系统，并观察到许多指令随着时间的推移而停止使用，超过500条指令从未在这些应用程序中使用过。我们还研究了在x86解码器设计中包含这些未使用指令的影响，并提出了SHRINK，这是一种删除旧指令而不破坏与遗留代码向后兼容性的机制。收缩允许我们从x86 ISA中删除40%的指令，并将指令解码器的关键路径、面积和功耗平均分别提高23%、48%和49%。

引用次数: 19

Stash: Have your scratchpad and cache it too 藏匿:准备好你的便签本，并把它藏起来

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750374

Rakesh Komuravelli, Matthew D. Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, S. Adve, Vikram S. Adve

Heterogeneous systems employ specialization for energy efficiency. Since data movement is expected to be a dominant consumer of energy, these systems employ specialized memories (e.g., scratchpads and FIFOs) for better efficiency for targeted data. These memory structures, however, tend to exist in local address spaces, incurring significant performance and energy penalties due to inefficient data movement between the global and private spaces. We propose an efficient heterogeneous memory system where specialized memory components are tightly coupled in a unified and coherent address space. This paper applies these ideas to a system with CPUs and GPUs with scratchpads and caches. We introduce a new memory organization, stash, that combines the benefits of caches and scratchpads without incurring their downsides. Like a scratchpad, the stash is directly addressed (without tags and TLB accesses) and provides compact storage. Like a cache, the stash is globally addressable and visible, providing implicit data movement and increased data reuse. We show that the stash provides better performance and energy than a cache and a scratchpad, while enabling new use cases for heterogeneous systems. For 4 microbenchmarks, which exploit new use cases (e.g., reuse across GPU compute kernels), compared to scratchpads and caches, the stash reduces execution cycles by an average of 27% and 13% respectively and energy by an average of 53% and 35%. For 7 current GPU applications, which are not designed to exploit the new features of the stash, compared to scratchpads and caches, the stash reduces cycles by 10% and 12% on average (max 22% and 31%) respectively, and energy by 16% and 32% on average (max 30% and 51%).

异构系统采用能源效率专业化。由于数据移动预计将成为主要的能源消耗者，这些系统采用专门的存储器(例如，刮擦板和fifo)来提高目标数据的效率。然而，这些内存结构往往存在于本地地址空间中，由于在全局空间和私有空间之间的数据移动效率低下，导致显著的性能和能量损失。我们提出了一种高效的异构存储系统，其中专用存储组件在统一和一致的地址空间中紧密耦合。本文将这些思想应用到一个带有刮擦板和缓存的cpu和gpu系统中。我们引入了一种新的内存组织，stash，它结合了缓存和刮擦板的优点，而不会产生它们的缺点。与刮擦板一样，存储库是直接寻址的(没有标签和TLB访问)，并提供紧凑的存储。与缓存一样，存储库是全局可寻址和可见的，提供隐式数据移动和增加的数据重用。我们展示了存储库提供了比缓存和便签簿更好的性能和能量，同时支持异构系统的新用例。对于利用新用例(例如，跨GPU计算内核的重用)的4个微基准测试，与scratchpad和缓存相比，stash分别平均减少了27%和13%的执行周期，平均减少了53%和35%的能源。对于目前的7个GPU应用程序，这些应用程序没有设计利用stash的新功能，与刮擦板和缓存相比，stash平均减少了10%和12%的周期(最大22%和31%)，平均减少了16%和32%的能量(最大30%和51%)。

{"title":"Stash: Have your scratchpad and cache it too","authors":"Rakesh Komuravelli, Matthew D. Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, S. Adve, Vikram S. Adve","doi":"10.1145/2749469.2750374","DOIUrl":"https://doi.org/10.1145/2749469.2750374","url":null,"abstract":"Heterogeneous systems employ specialization for energy efficiency. Since data movement is expected to be a dominant consumer of energy, these systems employ specialized memories (e.g., scratchpads and FIFOs) for better efficiency for targeted data. These memory structures, however, tend to exist in local address spaces, incurring significant performance and energy penalties due to inefficient data movement between the global and private spaces. We propose an efficient heterogeneous memory system where specialized memory components are tightly coupled in a unified and coherent address space. This paper applies these ideas to a system with CPUs and GPUs with scratchpads and caches. We introduce a new memory organization, stash, that combines the benefits of caches and scratchpads without incurring their downsides. Like a scratchpad, the stash is directly addressed (without tags and TLB accesses) and provides compact storage. Like a cache, the stash is globally addressable and visible, providing implicit data movement and increased data reuse. We show that the stash provides better performance and energy than a cache and a scratchpad, while enabling new use cases for heterogeneous systems. For 4 microbenchmarks, which exploit new use cases (e.g., reuse across GPU compute kernels), compared to scratchpads and caches, the stash reduces execution cycles by an average of 27% and 13% respectively and energy by an average of 53% and 35%. For 7 current GPU applications, which are not designed to exploit the new features of the stash, compared to scratchpads and caches, the stash reduces cycles by 10% and 12% on average (max 22% and 31%) respectively, and energy by 16% and 32% on average (max 30% and 51%).","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"45 1","pages":"707-719"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78525777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 71

Manycore Network Interfaces for in-memory rack-scale computing 用于内存机架级计算的多核网络接口

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750415

Alexandros Daglis, Stanko Novakovic, Edouard Bugnion, B. Falsafi, Boris Grot

Datacenter operators rely on low-cost, high-density technologies to maximize throughput for data-intensive services with tight tail latencies. In-memory rack-scale computing is emerging as a promising paradigm in scale-out datacenters capitalizing on commodity SoCs, low-latency and high-bandwidth communication fabrics and a remote memory access model to enable aggregation of a rack's memory for critical data-intensive applications such as graph processing or key-value stores. Low latency and high bandwidth not only dictate eliminating communication bottlenecks in the software protocols and off-chip fabrics but also a careful on-chip integration of network interfaces. The latter is a key challenge especially in architectures with RDMA-inspired one-sided operations that aim to achieve low latency and high bandwidth through on-chip Network Interface (NI) support. This paper proposes and evaluates network interface architectures for tiled manycore SoCs for in-memory rack-scale computing. Our results indicate that a careful splitting of NI functionality per chip tile and at the chip's edge along a NOC dimension enables a rack-scale architecture to optimize for both latency and bandwidth. Our best manycore NI architecture achieves latencies within 3% of an idealized hardware NUMA and efficiently uses the full bisection bandwidth of the NOC, without changing the on-chip coherence protocol or the core's microarchitecture.

数据中心运营商依靠低成本、高密度的技术来最大限度地提高数据密集型服务的吞吐量。内存中的机架规模计算正在成为横向扩展数据中心的一个有前途的范例，它利用商品soc、低延迟和高带宽通信结构以及远程内存访问模型，为关键数据密集型应用程序(如图形处理或键值存储)提供机架内存聚合。低延迟和高带宽不仅要求消除软件协议和片外结构中的通信瓶颈，而且还要求对网络接口进行仔细的片上集成。后者是一个关键的挑战，特别是在具有rdma启发的单边操作的体系结构中，这种操作旨在通过片上网络接口(NI)支持实现低延迟和高带宽。本文提出并评估了用于内存机架规模计算的平铺多核soc的网络接口体系结构。我们的研究结果表明，沿着NOC维度仔细拆分每个芯片瓦片和芯片边缘的NI功能，可以使机架级架构同时优化延迟和带宽。我们最好的多核NI架构可以在理想硬件NUMA的3%内实现延迟，并有效地使用NOC的全等分带宽，而无需改变片上一致性协议或核心的微架构。

{"title":"Manycore Network Interfaces for in-memory rack-scale computing","authors":"Alexandros Daglis, Stanko Novakovic, Edouard Bugnion, B. Falsafi, Boris Grot","doi":"10.1145/2749469.2750415","DOIUrl":"https://doi.org/10.1145/2749469.2750415","url":null,"abstract":"Datacenter operators rely on low-cost, high-density technologies to maximize throughput for data-intensive services with tight tail latencies. In-memory rack-scale computing is emerging as a promising paradigm in scale-out datacenters capitalizing on commodity SoCs, low-latency and high-bandwidth communication fabrics and a remote memory access model to enable aggregation of a rack's memory for critical data-intensive applications such as graph processing or key-value stores. Low latency and high bandwidth not only dictate eliminating communication bottlenecks in the software protocols and off-chip fabrics but also a careful on-chip integration of network interfaces. The latter is a key challenge especially in architectures with RDMA-inspired one-sided operations that aim to achieve low latency and high bandwidth through on-chip Network Interface (NI) support. This paper proposes and evaluates network interface architectures for tiled manycore SoCs for in-memory rack-scale computing. Our results indicate that a careful splitting of NI functionality per chip tile and at the chip's edge along a NOC dimension enables a rack-scale architecture to optimize for both latency and bandwidth. Our best manycore NI architecture achieves latencies within 3% of an idealized hardware NUMA and efficiently uses the full bisection bandwidth of the NOC, without changing the on-chip coherence protocol or the core's microarchitecture.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"11 2 1","pages":"567-579"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73105461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34