2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)最新文献_第2页

Future Vector Microprocessor Extensions for Data Aggregations 数据聚合的未来矢量微处理器扩展

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001182

Timothy Hayes, Oscar Palomar, O. Unsal, A. Cristal, M. Valero

As the rate of annual data generation grows exponentially, there is a demand to aggregate and summarise vast amounts of information quickly. In the past, frequency scaling was relied upon to push application throughput. Today, Dennard scaling has ceased and further performance must come from exploiting parallelism. Single instruction-multiple data (SIMD) instruction sets offer a highly efficient and scalable way of exploiting data-level parallelism (DLP). While microprocessors originally offered very simple SIMD support targeted at multimedia applications, these extensions have been growing both in width and functionality. Observing this trend, we use a simulation framework to model future SIMD support and then propose and evaluate five different ways of vectorising data aggregation. We find that although data aggregation is abundant in DLP, it is often too irregular to be expressed efficiently using typical SIMD instructions. Based on this observation, we propose a set of novel algorithms and SIMD instructions to better capture this irregular DLP. Furthermore, we discover that the best algorithm is highly dependent on the characteristics of the input. Our proposed solution can dynamically choose the optimal algorithm in the majority of cases and achieves speedups between 2.7x and 7.6x over a scalar baseline.

随着每年数据生成的速度呈指数级增长，人们需要快速汇总和总结大量信息。在过去，频率缩放是用来提高应用程序吞吐量的。今天，Dennard缩放已经停止，进一步的性能必须通过利用并行性来实现。单指令多数据(SIMD)指令集提供了一种利用数据级并行性(DLP)的高效且可扩展的方法。虽然微处理器最初提供了针对多媒体应用程序的非常简单的SIMD支持，但这些扩展在宽度和功能上都在不断增长。观察到这一趋势，我们使用一个模拟框架来模拟未来SIMD支持，然后提出并评估五种不同的向量化数据聚合方法。我们发现，尽管DLP中有大量的数据聚合，但它往往过于不规则，无法使用典型的SIMD指令有效地表达。基于这一观察，我们提出了一套新的算法和SIMD指令来更好地捕获这种不规则的DLP。此外，我们发现最好的算法高度依赖于输入的特征。我们提出的解决方案可以在大多数情况下动态选择最优算法，并在标量基线上实现2.7到7.6倍的加速。

{"title":"Future Vector Microprocessor Extensions for Data Aggregations","authors":"Timothy Hayes, Oscar Palomar, O. Unsal, A. Cristal, M. Valero","doi":"10.1145/3007787.3001182","DOIUrl":"https://doi.org/10.1145/3007787.3001182","url":null,"abstract":"As the rate of annual data generation grows exponentially, there is a demand to aggregate and summarise vast amounts of information quickly. In the past, frequency scaling was relied upon to push application throughput. Today, Dennard scaling has ceased and further performance must come from exploiting parallelism. Single instruction-multiple data (SIMD) instruction sets offer a highly efficient and scalable way of exploiting data-level parallelism (DLP). While microprocessors originally offered very simple SIMD support targeted at multimedia applications, these extensions have been growing both in width and functionality. Observing this trend, we use a simulation framework to model future SIMD support and then propose and evaluate five different ways of vectorising data aggregation. We find that although data aggregation is abundant in DLP, it is often too irregular to be expressed efficiently using typical SIMD instructions. Based on this observation, we propose a set of novel algorithms and SIMD instructions to better capture this irregular DLP. Furthermore, we discover that the best algorithm is highly dependent on the characteristics of the input. Our proposed solution can dynamically choose the optimal algorithm in the majority of cases and achieves speedups between 2.7x and 7.6x over a scalar baseline.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"167 1","pages":"418-430"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83530595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Dynamo: Facebook's Data Center-Wide Power Management System Dynamo: Facebook的数据中心范围电源管理系统

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001187

Qiang Wu, Qingyuan Deng, L. Ganesh, Chang-Hong Hsu, Yun Jin, Sanjeev Kumar, Bin Li, Justin Meza, YeeJiun Song

Data center power is a scarce resource that often goes underutilized due to conservative planning. This is because the penalty for overloading the data center power delivery hierarchy and tripping a circuit breaker is very high, potentially causing long service outages. Recently, dynamic server power capping, which limits the amount of power consumed by a server, has been proposed and studied as a way to reduce this penalty, enabling more aggressive utilization of provisioned data center power. However, no real at-scale solution for data center-wide power monitoring and control has been presented in the literature. In this paper, we describe Dynamo -- a data center-wide power management system that monitors the entire power hierarchy and makes coordinated control decisions to safely and efficiently use provisioned data center power. Dynamo has been developed and deployed across all of Facebook's data centers for the past three years. Our key insight is that in real-world data centers, different power and performance constraints at different levels in the power hierarchy necessitate coordinated data center-wide power management. We make three main contributions. First, to understand the design space of Dynamo, we provide a characterization of power variation in data centers running a diverse set of modern workloads. This characterization uses fine-grained power samples from tens of thousands of servers and spanning a period of over six months. Second, we present the detailed design of Dynamo. Our design addresses several key issues not addressed by previous simulation-based studies. Third, the proposed techniques and design have been deployed and evaluated in large scale data centers serving billions of users. We present production results showing that Dynamo has prevented 18 potential power outages in the past 6 months due to unexpected power surges, that Dynamo enables optimizations leading to a 13% performance boost for a production Hadoop cluster and a nearly 40% performance increase for a search cluster, and that Dynamo has already enabled an 8% increase in the power capacity utilization of one of our data centers with more aggressive power subscription measures underway.

数据中心的电力是一种稀缺资源，由于规划保守，常常得不到充分利用。这是因为数据中心供电层次结构过载和断路器跳闸的代价非常高，可能会导致长时间的服务中断。最近，有人提出并研究了动态服务器功率上限，它限制了服务器消耗的功率，作为减少这种损失的一种方法，从而能够更积极地利用所提供的数据中心功率。然而，文献中还没有针对数据中心范围内的电力监测和控制提出真正的大规模解决方案。在本文中，我们描述了Dynamo——一个数据中心范围的电源管理系统，它监视整个电源层次，并做出协调的控制决策，以安全有效地使用预置的数据中心电源。在过去的三年里，Dynamo已经被开发并部署到Facebook的所有数据中心。我们的主要见解是，在真实的数据中心中，电源层次结构中不同级别的不同电源和性能约束需要协调数据中心范围的电源管理。我们做出了三个主要贡献。首先，为了理解Dynamo的设计空间，我们提供了运行各种现代工作负载的数据中心的功率变化特征。这种特性使用了来自数万台服务器的细粒度电源样本，时间跨度超过6个月。其次，给出了Dynamo的详细设计。我们的设计解决了以前基于模拟的研究没有解决的几个关键问题。第三，所提出的技术和设计已经在服务数十亿用户的大型数据中心中进行了部署和评估。我们展示的生产结果显示，在过去6个月里，Dynamo已经防止了18次由于意外的电力激增而导致的潜在停电，Dynamo使生产Hadoop集群的性能提高了13%，搜索集群的性能提高了近40%，并且Dynamo已经使我们的一个数据中心的电力容量利用率提高了8%，并且正在采取更积极的电力订阅措施。

{"title":"Dynamo: Facebook's Data Center-Wide Power Management System","authors":"Qiang Wu, Qingyuan Deng, L. Ganesh, Chang-Hong Hsu, Yun Jin, Sanjeev Kumar, Bin Li, Justin Meza, YeeJiun Song","doi":"10.1145/3007787.3001187","DOIUrl":"https://doi.org/10.1145/3007787.3001187","url":null,"abstract":"Data center power is a scarce resource that often goes underutilized due to conservative planning. This is because the penalty for overloading the data center power delivery hierarchy and tripping a circuit breaker is very high, potentially causing long service outages. Recently, dynamic server power capping, which limits the amount of power consumed by a server, has been proposed and studied as a way to reduce this penalty, enabling more aggressive utilization of provisioned data center power. However, no real at-scale solution for data center-wide power monitoring and control has been presented in the literature. In this paper, we describe Dynamo -- a data center-wide power management system that monitors the entire power hierarchy and makes coordinated control decisions to safely and efficiently use provisioned data center power. Dynamo has been developed and deployed across all of Facebook's data centers for the past three years. Our key insight is that in real-world data centers, different power and performance constraints at different levels in the power hierarchy necessitate coordinated data center-wide power management. We make three main contributions. First, to understand the design space of Dynamo, we provide a characterization of power variation in data centers running a diverse set of modern workloads. This characterization uses fine-grained power samples from tens of thousands of servers and spanning a period of over six months. Second, we present the detailed design of Dynamo. Our design addresses several key issues not addressed by previous simulation-based studies. Third, the proposed techniques and design have been deployed and evaluated in large scale data centers serving billions of users. We present production results showing that Dynamo has prevented 18 potential power outages in the past 6 months due to unexpected power surges, that Dynamo enables optimizations leading to a 13% performance boost for a production Hadoop cluster and a nearly 40% performance increase for a search cluster, and that Dynamo has already enabled an 8% increase in the power capacity utilization of one of our data centers with more aggressive power subscription measures underway.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"50 1","pages":"469-480"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88506879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 118

Using Multiple Input, Multiple Output Formal Control to Maximize Resource Efficiency in Architectures 使用多输入、多输出形式控制最大化体系结构中的资源效率

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001207

Raghavendra Pradyumna Pothukuchi, Amin Ansari, P. Voulgaris, J. Torrellas

As processors seek more resource efficiency, they increasingly need to target multiple goals at the same time, such as a level of performance, power consumption, and average utilization. Robust control solutions cannot come from heuristic-based controllers or even from formal approaches that combine multiple single-parameter controllers. Such controllers may end-up working against each other. What is needed is control-theoretical MIMO (multiple input, multiple output) controllers, which actuate on multiple inputs and control multiple outputs in a coordinated manner. In this paper, we use MIMO control-theory techniques to develop controllers to dynamically tune architectural parameters in processors. To our knowledge, this is the first work in this area. We discuss three ways in which a MIMO controller can be used. We develop an example of MIMO controller and show that it is substantially more effective than controllers based on heuristics or built by combining single-parameter formal controllers. The general approach discussed here is likely to be increasingly relevant as future processors become more resource-constrained and adaptive.

随着处理器寻求更高的资源效率，它们越来越需要同时瞄准多个目标，例如性能水平、功耗和平均利用率。鲁棒控制解决方案不能来自基于启发式的控制器，甚至不能来自组合多个单参数控制器的正式方法。这些控制器最终可能会相互对抗。需要的是控制理论的MIMO(多输入，多输出)控制器，它在多个输入上驱动并以协调的方式控制多个输出。在本文中，我们使用MIMO控制理论技术来开发控制器来动态调整处理器中的结构参数。据我们所知，这是这一领域的首次研究。我们讨论了三种使用MIMO控制器的方法。我们开发了一个MIMO控制器的例子，并表明它比基于启发式或通过组合单参数形式控制器构建的控制器有效得多。随着未来的处理器变得更受资源约束和适应性，这里讨论的一般方法可能会越来越相关。

{"title":"Using Multiple Input, Multiple Output Formal Control to Maximize Resource Efficiency in Architectures","authors":"Raghavendra Pradyumna Pothukuchi, Amin Ansari, P. Voulgaris, J. Torrellas","doi":"10.1145/3007787.3001207","DOIUrl":"https://doi.org/10.1145/3007787.3001207","url":null,"abstract":"As processors seek more resource efficiency, they increasingly need to target multiple goals at the same time, such as a level of performance, power consumption, and average utilization. Robust control solutions cannot come from heuristic-based controllers or even from formal approaches that combine multiple single-parameter controllers. Such controllers may end-up working against each other. What is needed is control-theoretical MIMO (multiple input, multiple output) controllers, which actuate on multiple inputs and control multiple outputs in a coordinated manner. In this paper, we use MIMO control-theory techniques to develop controllers to dynamically tune architectural parameters in processors. To our knowledge, this is the first work in this area. We discuss three ways in which a MIMO controller can be used. We develop an example of MIMO controller and show that it is substantially more effective than controllers based on heuristics or built by combining single-parameter formal controllers. The general approach discussed here is likely to be increasingly relevant as future processors become more resource-constrained and adaptive.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"119 1","pages":"658-670"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77430319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 58

APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs APRES:利用gpu上的负载特性来提高缓存效率

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001158

Yunho Oh, Keunsoo Kim, M. Yoon, Jong Hyun Park, Yongjun Park, W. Ro, M. Annavaram

Long memory latency and limited throughput become performance bottlenecks of GPGPU applications. The latency takes hundreds of cycles which is difficult to be hidden by simply interleaving tens of warp execution. While cache hierarchy helps to reduce memory system pressure, massive Thread-Level Parallelism (TLP) often causes excessive cache contention. This paper proposes Adaptive PREfetching and Scheduling (APRES) to improve GPU cache efficiency. APRES relies on the following observations. First, certain static load instructions tend to generate memory addresses having very high locality. Second, although loads have no locality, the access addresses still can show highly strided access pattern. Third, the locality behavior tends to be consistent regardless of warp ID. APRES schedules warps so that as many cache hits generated as possible before any cache misses generated. This is to minimize cache thrashing when many warps are contending for a cache line. However, to realize this operation, it is required to predict which warp will hit the cache in the near future. Without directly predicting future cache hit/miss for each warp, APRES creates a group of warps that will execute the same load instruction in the near future. Based on the third observation, we expect the locality behavior is consistent over all warps in the group. If the first executed warp in the group hits the cache, then the load is considered as a high locality type, and APRES prioritizes all warps in the group. Group prioritization leads to consecutive cache hits, because the grouped warps are likely to access the same cache line. If the first warp missed the cache, then the load is considered as a strided type, and APRES generates prefetch requests for the other warps in the group. After that, APRES prioritizes prefetch targeted warps so that the demand requests are merged to Miss Status Holding Register (MSHR) or prefetched lines can be accessed. On memory-intensive applications, APRES achieves 31.7% performance improvement compared to the baseline GPU and 7.2% additional speedup compared to the best combination of existing warp scheduling and prefetching methods.

较长的内存延迟和有限的吞吐量成为GPGPU应用的性能瓶颈。延迟需要数百个周期，这很难通过简单地交错数十次warp执行来隐藏。虽然缓存层次结构有助于减少内存系统的压力，但是大量的线程级并行(TLP)通常会导致过度的缓存争用。为了提高GPU的缓存效率，本文提出了自适应预取和调度(APRES)算法。APRES依靠以下观察结果。首先，某些静态加载指令倾向于生成具有非常高局部性的内存地址。其次，虽然负载没有局部性，但访问地址仍然可以表现出高度跨越式的访问模式。第三，无论warp ID如何，局部性行为都趋于一致。APRES调度扭曲，以便在任何缓存丢失生成之前生成尽可能多的缓存命中。这是为了在许多warp争用一条缓存线时最小化缓存抖动。然而，要实现这个操作，需要预测在不久的将来哪个warp会命中缓存。APRES没有直接预测每次warp的缓存命中/未命中，而是创建了一组warp，这些warp将在不久的将来执行相同的加载指令。基于第三个观察，我们期望局部性行为在群体中的所有翘曲中是一致的。如果组中第一次执行的warp命中缓存，则加载被认为是高局部性类型，并且APRES优先考虑组中的所有warp。组优先级导致连续的缓存命中，因为分组扭曲可能访问相同的缓存行。如果第一次warp错过了缓存，则将加载视为跨步类型，并且APRES为组中的其他warp生成预取请求。之后，APRES优先预取目标warp，以便将需求请求合并到Miss Status Holding Register (MSHR)中，或者可以访问预取行。在内存密集型应用程序中，与基准GPU相比，APRES实现了31.7%的性能提升，与现有warp调度和预取方法的最佳组合相比，APRES实现了7.2%的额外加速提升。

{"title":"APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs","authors":"Yunho Oh, Keunsoo Kim, M. Yoon, Jong Hyun Park, Yongjun Park, W. Ro, M. Annavaram","doi":"10.1145/3007787.3001158","DOIUrl":"https://doi.org/10.1145/3007787.3001158","url":null,"abstract":"Long memory latency and limited throughput become performance bottlenecks of GPGPU applications. The latency takes hundreds of cycles which is difficult to be hidden by simply interleaving tens of warp execution. While cache hierarchy helps to reduce memory system pressure, massive Thread-Level Parallelism (TLP) often causes excessive cache contention. This paper proposes Adaptive PREfetching and Scheduling (APRES) to improve GPU cache efficiency. APRES relies on the following observations. First, certain static load instructions tend to generate memory addresses having very high locality. Second, although loads have no locality, the access addresses still can show highly strided access pattern. Third, the locality behavior tends to be consistent regardless of warp ID. APRES schedules warps so that as many cache hits generated as possible before any cache misses generated. This is to minimize cache thrashing when many warps are contending for a cache line. However, to realize this operation, it is required to predict which warp will hit the cache in the near future. Without directly predicting future cache hit/miss for each warp, APRES creates a group of warps that will execute the same load instruction in the near future. Based on the third observation, we expect the locality behavior is consistent over all warps in the group. If the first executed warp in the group hits the cache, then the load is considered as a high locality type, and APRES prioritizes all warps in the group. Group prioritization leads to consecutive cache hits, because the grouped warps are likely to access the same cache line. If the first warp missed the cache, then the load is considered as a strided type, and APRES generates prefetch requests for the other warps in the group. After that, APRES prioritizes prefetch targeted warps so that the demand requests are merged to Miss Status Holding Register (MSHR) or prefetched lines can be accessed. On memory-intensive applications, APRES achieves 31.7% performance improvement compared to the baseline GPU and 7.2% additional speedup compared to the best combination of existing warp scheduling and prefetching methods.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"40 1","pages":"191-203"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81612637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Power Attack Defense: Securing Battery-Backed Data Centers 电力攻击防御:保护电池支持的数据中心

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001189

Chao Li, Zhenhua Wang, Xiaofeng Hou, Hao-peng Chen, Xiaoyao Liang, M. Guo

Battery systems are crucial components for mission-critical data centers. Without secure energy backup, existing under-provisioned data centers are largely unguarded targets for cyber criminals. Particularly for today's scale-out servers, power oversubscription unavoidably taxes a data center's backup energy resources, leaving very little room for dealing with emergency. Besides, the emerging trend towards deploying distributed energy storage architecture causes the associated energy backup of each rack to shrink, making servers vulnerable to power anomalies. As a result, an attacker can generate power peaks to easily crash or disrupt a power-constrained system. This study aims at securing data centers from malicious loads that seek to drain their precious energy storage and overload server racks without prior detection. We term such load as Power Virus (PV) and demonstrate its basic two-phase attacking model and characterize its behaviors on real systems. The PV can learn the victim rack's battery characteristics by disguising as benign loads. Once gaining enough information, the PV can be mutated to generate hidden power spikes that have a high chance to overload the system. To defend against PV, we propose power attack defense (PAD), a novel energy management patch built on lightweight software and hardware mechanisms. PAD not only increases the attacking cost considerably by hiding vulnerable racks from visible spikes, it also strengthens the last line of defense against hidden spikes. Using Google cluster traces we show that PAD can effectively raise the bar of a successful power attack: compared to prior arts, it increases the data center survival time by 1.6~11X and provides better performance guarantee. It enables modern data centers to safely exploit the benefits that power oversubscription may provide, with the slightest cost overhead.

电池系统是关键任务数据中心的关键组件。如果没有安全的能源备份，现有的供应不足的数据中心在很大程度上是网络犯罪分子毫无防备的目标。特别是对于今天的横向扩展服务器来说，电力超额订阅不可避免地会给数据中心的备用能源造成负担，因此留给处理紧急情况的空间非常小。此外，分布式储能架构的部署趋势使得每个机架的关联能量备份减少，使服务器容易受到电源异常的影响。因此，攻击者可以产生功率峰值，从而很容易地使功率受限的系统崩溃或中断。这项研究旨在保护数据中心免受恶意负载的侵害，这些恶意负载试图耗尽其宝贵的能量存储，并在没有事先检测的情况下使服务器机架过载。本文将这种负载称为Power Virus (PV)，给出了其两阶段攻击的基本模型，并对其在实际系统上的行为进行了表征。PV可以通过伪装成良性负载来学习受害者机架的电池特性。一旦获得足够的信息，PV就可以发生突变，产生隐藏的功率峰值，这很有可能使系统过载。为了防御PV，我们提出了power attack defense (PAD)，这是一种基于轻量级软硬件机制的新型能源管理补丁。PAD不仅通过隐藏脆弱的机架以躲避可见的尖峰而大大增加了攻击成本，而且还加强了对隐藏尖峰的最后一道防线。通过使用Google集群跟踪，我们发现PAD可以有效地提高成功的功率攻击的标准:与现有技术相比，它将数据中心的生存时间提高了1.6~11X，并提供了更好的性能保证。它使现代数据中心能够安全地利用电力超额订阅可能提供的好处，而成本开销最小。

{"title":"Power Attack Defense: Securing Battery-Backed Data Centers","authors":"Chao Li, Zhenhua Wang, Xiaofeng Hou, Hao-peng Chen, Xiaoyao Liang, M. Guo","doi":"10.1145/3007787.3001189","DOIUrl":"https://doi.org/10.1145/3007787.3001189","url":null,"abstract":"Battery systems are crucial components for mission-critical data centers. Without secure energy backup, existing under-provisioned data centers are largely unguarded targets for cyber criminals. Particularly for today's scale-out servers, power oversubscription unavoidably taxes a data center's backup energy resources, leaving very little room for dealing with emergency. Besides, the emerging trend towards deploying distributed energy storage architecture causes the associated energy backup of each rack to shrink, making servers vulnerable to power anomalies. As a result, an attacker can generate power peaks to easily crash or disrupt a power-constrained system. This study aims at securing data centers from malicious loads that seek to drain their precious energy storage and overload server racks without prior detection. We term such load as Power Virus (PV) and demonstrate its basic two-phase attacking model and characterize its behaviors on real systems. The PV can learn the victim rack's battery characteristics by disguising as benign loads. Once gaining enough information, the PV can be mutated to generate hidden power spikes that have a high chance to overload the system. To defend against PV, we propose power attack defense (PAD), a novel energy management patch built on lightweight software and hardware mechanisms. PAD not only increases the attacking cost considerably by hiding vulnerable racks from visible spikes, it also strengthens the last line of defense against hidden spikes. Using Google cluster traces we show that PAD can effectively raise the bar of a successful power attack: compared to prior arts, it increases the data center survival time by 1.6~11X and provides better performance guarantee. It enables modern data centers to safely exploit the benefits that power oversubscription may provide, with the slightest cost overhead.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"12 1","pages":"493-505"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75251239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

Boosting Access Parallelism to PCM-Based Main Memory 提高基于pcm的主存的访问并行性

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001211

M. Arjomand, M. Kandemir, A. Sivasubramaniam, C. Das

Despite its promise as a DRAM main memory replacement, Phase Change Memory (PCM) has high write latencies which can be a serious detriment to its widespread adoption. Apart from slowing down a write request, the consequent high latency can also keep other chips of the same rank, that are not involved in this write, idle for long times. There are several practical considerations that make it difficult to allow subsequent reads and/or writes to be served concurrently from the same chips during the long latency write. This paper proposes and evaluates several novel mechanisms - re-constructing data from error correction bits instead of waiting for chips currently busy to serve a read, rotating word mappings across chips of a PCM rank, and rotating the mapping of error detection/correction bits across these chips - to overlap several reads with an ongoing write (RoW) and even a write with an ongoing write (WoW). The paper also presents the necessary micro-architectural enhancements needed to implement these mechanisms, without significantly changing the current interfaces. The resulting PCM access parallelism (PCMap) system incorporating these enhancements, boosts the intra-rank-level parallelism during such writes from a very low baseline value of 2.4 to an average and maximum values of 4.5 and 7.4, respectively (out of a maximum of 8.0), across a wide spectrum of both multiprogrammed and multithreaded workloads. This boost in parallelism results in an average IPC improvement of 15.6% and 16.7% for the multi-programmed and multi-threaded workloads, respectively.

尽管相变存储器(PCM)有望成为DRAM主存储器的替代品，但它具有很高的写入延迟，这可能严重损害其广泛采用。除了降低写请求的速度之外，随之而来的高延迟还会使其他相同级别的芯片长时间处于空闲状态，而这些芯片不参与这次写操作。有几个实际的考虑因素使得在长延迟写入期间允许从同一芯片并发地提供后续读和/或写服务变得困难。本文提出并评估了几种新的机制——从纠错位重新构建数据，而不是等待当前忙于读取的芯片，在PCM秩的芯片上旋转单词映射，以及在这些芯片上旋转错误检测/纠错位的映射——将几个读取与正在进行的写入重叠(RoW)，甚至写与正在进行的写入重叠(WoW)。本文还介绍了在不显著改变当前接口的情况下实现这些机制所需的必要的微体系结构增强。合并了这些增强功能的PCM访问并行性(PCMap)系统，可以在广泛的多编程和多线程工作负载范围内，将写操作期间的秩内并行性从非常低的基线值2.4提高到平均值4.5和最大值7.4(最大值为8.0)。这种并行性的提升使得多程序和多线程工作负载的IPC平均分别提高了15.6%和16.7%。

{"title":"Boosting Access Parallelism to PCM-Based Main Memory","authors":"M. Arjomand, M. Kandemir, A. Sivasubramaniam, C. Das","doi":"10.1145/3007787.3001211","DOIUrl":"https://doi.org/10.1145/3007787.3001211","url":null,"abstract":"Despite its promise as a DRAM main memory replacement, Phase Change Memory (PCM) has high write latencies which can be a serious detriment to its widespread adoption. Apart from slowing down a write request, the consequent high latency can also keep other chips of the same rank, that are not involved in this write, idle for long times. There are several practical considerations that make it difficult to allow subsequent reads and/or writes to be served concurrently from the same chips during the long latency write. This paper proposes and evaluates several novel mechanisms - re-constructing data from error correction bits instead of waiting for chips currently busy to serve a read, rotating word mappings across chips of a PCM rank, and rotating the mapping of error detection/correction bits across these chips - to overlap several reads with an ongoing write (RoW) and even a write with an ongoing write (WoW). The paper also presents the necessary micro-architectural enhancements needed to implement these mechanisms, without significantly changing the current interfaces. The resulting PCM access parallelism (PCMap) system incorporating these enhancements, boosts the intra-rank-level parallelism during such writes from a very low baseline value of 2.4 to an average and maximum values of 4.5 and 7.4, respectively (out of a maximum of 8.0), across a wide spectrum of both multiprogrammed and multithreaded workloads. This boost in parallelism results in an average IPC improvement of 15.6% and 16.7% for the multi-programmed and multi-threaded workloads, respectively.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"31 1","pages":"695-706"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74477310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing Cnvlutin:无效-无神经元的深度神经网络计算

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001138

Jorge Albericio, Patrick Judd, Tayler H. Hetherington, Tor M. Aamodt, Natalie D. Enright Jerger, Andreas Moshovos

This work observes that a large fraction of the computations performed by Deep Neural Networks (DNNs) are intrinsically ineffectual as they involve a multiplication where one of the inputs is zero. This observation motivates Cnvolutin (CNV), a value-based approach to hardware acceleration that eliminates most of these ineffectual operations, improving performance and energy over a state-of-the-art accelerator with no accuracy loss. CNV uses hierarchical data-parallel units, allowing groups of lanes to proceed mostly independently enabling them to skip over the ineffectual computations. A co-designed data storage format encodes the computation elimination decisions taking them off the critical path while avoiding control divergence in the data parallel units. Combined, the units and the data storage format result in a data-parallel architecture that maintains wide, aligned accesses to its memory hierarchy and that keeps its data lanes busy. By loosening the ineffectual computation identification criterion, CNV enables further performance and energy efficiency improvements, and more so if a loss in accuracy is acceptable. Experimental measurements over a set of state-of-the-art DNNs for image classification show that CNV improves performance over a state-of-the-art accelerator from 1.24× to 1.55× and by 1.37× on average without any loss in accuracy by removing zero-valued operand multiplications alone. While CNV incurs an area overhead of 4.49%, it improves overall EDP (Energy Delay Product) and ED2P (Energy Delay Squared Product) on average by 1.47× and 2.01×, respectively. The average performance improvements increase to 1.52× without any loss in accuracy with a broader ineffectual identification policy. Further improvements are demonstrated with a loss in accuracy.

这项工作观察到，深度神经网络(dnn)执行的大部分计算本质上是无效的，因为它们涉及其中一个输入为零的乘法。这一观察结果激发了Cnvolutin (CNV)的灵感，这是一种基于价值的硬件加速方法，消除了大多数无效操作，在不损失精度的情况下，提高了最先进的加速器的性能和能量。CNV使用分层数据并行单元，允许车道组独立进行，使它们能够跳过无效的计算。协同设计的数据存储格式对计算消除决策进行编码，使其远离关键路径，同时避免了数据并行单元中的控制发散。这些单元和数据存储格式结合在一起，形成了一个数据并行架构，该架构保持了对其内存层次结构的宽、对齐访问，并使其数据通道保持繁忙。通过放宽无效计算识别标准，CNV可以进一步提高性能和能效，如果精度损失是可以接受的，则效果会更好。对一组最先进的dnn进行图像分类的实验测量表明，通过去除零值操作数乘法，CNV将最先进的加速器的性能从1.24倍提高到1.55倍，平均提高1.37倍，而精度没有任何损失。虽然CNV会产生4.49%的面积开销，但它将整体EDP(能量延迟积)和ED2P(能量延迟平方积)平均分别提高了1.47倍和2.01倍。使用更广泛的无效识别策略时，平均性能提高到1.52倍，而精度没有任何损失。进一步的改进证明了准确性的损失。

{"title":"Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing","authors":"Jorge Albericio, Patrick Judd, Tayler H. Hetherington, Tor M. Aamodt, Natalie D. Enright Jerger, Andreas Moshovos","doi":"10.1145/3007787.3001138","DOIUrl":"https://doi.org/10.1145/3007787.3001138","url":null,"abstract":"This work observes that a large fraction of the computations performed by Deep Neural Networks (DNNs) are intrinsically ineffectual as they involve a multiplication where one of the inputs is zero. This observation motivates Cnvolutin (CNV), a value-based approach to hardware acceleration that eliminates most of these ineffectual operations, improving performance and energy over a state-of-the-art accelerator with no accuracy loss. CNV uses hierarchical data-parallel units, allowing groups of lanes to proceed mostly independently enabling them to skip over the ineffectual computations. A co-designed data storage format encodes the computation elimination decisions taking them off the critical path while avoiding control divergence in the data parallel units. Combined, the units and the data storage format result in a data-parallel architecture that maintains wide, aligned accesses to its memory hierarchy and that keeps its data lanes busy. By loosening the ineffectual computation identification criterion, CNV enables further performance and energy efficiency improvements, and more so if a loss in accuracy is acceptable. Experimental measurements over a set of state-of-the-art DNNs for image classification show that CNV improves performance over a state-of-the-art accelerator from 1.24× to 1.55× and by 1.37× on average without any loss in accuracy by removing zero-valued operand multiplications alone. While CNV incurs an area overhead of 4.49%, it improves overall EDP (Energy Delay Product) and ED2P (Energy Delay Squared Product) on average by 1.47× and 2.01×, respectively. The average performance improvements increase to 1.52× without any loss in accuracy with a broader ineffectual identification policy. Further improvements are demonstrated with a loss in accuracy.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"50 1","pages":"1-13"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81768090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 629

RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision 用于连续移动视觉的模拟卷积图像传感器架构

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001164

R. Likamwa, Yunhui Hou, Yuan Gao, M. Polansky, Lin Zhong

Continuous mobile vision is limited by the inability to efficiently capture image frames and process vision features. This is largely due to the energy burden of analog readout circuitry, data traffic, and intensive computation. To promote efficiency, we shift early vision processing into the analog domain. This results in RedEye, an analog convolutional image sensor that performs layers of a convolutional neural network in the analog domain before quantization. We design RedEye to mitigate analog design complexity, using a modular column-parallel design to promote physical design reuse and algorithmic cyclic reuse. RedEye uses programmable mechanisms to admit noise for tunable energy reduction. Compared to conventional systems, RedEye reports an 85% reduction in sensor energy, 73% reduction in cloudlet-based system energy, and a 45% reduction in computation-based system energy.

连续移动视觉受到无法有效捕获图像帧和处理视觉特征的限制。这主要是由于模拟读出电路、数据流量和密集计算的能量负担。为了提高效率，我们将早期的视觉处理转移到模拟域。这就产生了RedEye，一种模拟卷积图像传感器，在量化之前在模拟域中执行卷积神经网络的层。我们设计RedEye来降低模拟设计的复杂性，使用模块化列并行设计来促进物理设计重用和算法循环重用。RedEye使用可编程的机制，以承认噪音可调的能源减少。与传统系统相比，RedEye的传感器能耗降低85%，基于云的系统能耗降低73%，基于计算的系统能耗降低45%。

引用次数: 186

Agile Paging: Exceeding the Best of Nested and Shadow Paging 敏捷分页:超越嵌套分页和影子分页

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001212

Jayneel Gandhi, M. Hill, M. Swift

Virtualization provides benefits for many workloads, but the overheads of virtualizing memory are not universally low. The cost comes from managing two levels of address translation - one in the guest virtual machine (VM) and the other in the host virtual machine monitor (VMM) - with either nested or shadow paging. Nested paging directly performs a two-level page walk that makes TLB misses slower than unvirtualized native, but enables fast page tables changes. Alternatively, shadow paging restores native TLB miss speeds, but requires costly VMM intervention on page table updates. This paper proposes agile paging that combines both techniques and exceeds the best of both. A virtualized page walk starts with shadow paging and optionally switches in the same page walk to nested paging where frequent page table updates would cause costly VMM interventions. Agile paging enables most TLB misses to be handled as fast as native while most page table changes avoid VMM intervention. It requires modest changes to hardware (e.g., demark when to switch) and VMM policies (e.g., predict good switching opportunities). We emulate the proposed hardware and prototype the software in Linux with KVM on x86-64. Agile paging performs more than 12% better than the best of the two techniques and comes within 4% of native execution for all workloads.

虚拟化为许多工作负载提供了好处，但是虚拟化内存的开销并不普遍低。成本来自管理两个级别的地址转换—一个在客户虚拟机(VM)中，另一个在主机虚拟机监视器(VMM)中—使用嵌套分页或影子分页。嵌套分页直接执行两级页遍历，这使得TLB丢失比非虚拟化的本机慢，但支持快速的页表更改。另外，影子分页可以恢复本地TLB丢失速度，但需要对页表更新进行昂贵的VMM干预。本文提出了一种灵活的分页技术，它结合了这两种技术并超越了两者的优点。虚拟页遍历从影子分页开始，并可选择在同一页遍历中切换到嵌套分页，在嵌套分页中，频繁的页表更新将导致代价高昂的VMM干预。敏捷分页使大多数TLB错误能够像本机一样快速处理，而大多数页表更改避免了VMM的干预。它需要对硬件(例如，标记何时切换)和VMM策略(例如，预测良好的切换机会)进行适度的更改。我们对所提出的硬件进行了仿真，并在x86-64上使用KVM在Linux环境下对软件进行了原型化。敏捷分页的性能比这两种技术中最好的一种提高了12%以上，并且在所有工作负载下的本机执行率不到4%。

{"title":"Agile Paging: Exceeding the Best of Nested and Shadow Paging","authors":"Jayneel Gandhi, M. Hill, M. Swift","doi":"10.1145/3007787.3001212","DOIUrl":"https://doi.org/10.1145/3007787.3001212","url":null,"abstract":"Virtualization provides benefits for many workloads, but the overheads of virtualizing memory are not universally low. The cost comes from managing two levels of address translation - one in the guest virtual machine (VM) and the other in the host virtual machine monitor (VMM) - with either nested or shadow paging. Nested paging directly performs a two-level page walk that makes TLB misses slower than unvirtualized native, but enables fast page tables changes. Alternatively, shadow paging restores native TLB miss speeds, but requires costly VMM intervention on page table updates. This paper proposes agile paging that combines both techniques and exceeds the best of both. A virtualized page walk starts with shadow paging and optionally switches in the same page walk to nested paging where frequent page table updates would cause costly VMM interventions. Agile paging enables most TLB misses to be handled as fast as native while most page table changes avoid VMM intervention. It requires modest changes to hardware (e.g., demark when to switch) and VMM policies (e.g., predict good switching opportunities). We emulate the proposed hardware and prototype the software in Linux with KVM on x86-64. Agile paging performs more than 12% better than the best of the two techniques and comes within 4% of native execution for all workloads.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"69 4 1","pages":"707-718"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83571157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61

CASH: Supporting IaaS Customers with a Sub-core Configurable Architecture CASH:通过子核心可配置架构支持IaaS客户

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001209

Yanqi Zhou, H. Hoffmann, D. Wentzlaff

Infrastructure as a Service (IaaS) Clouds have grown increasingly important. Recent architecture designs support IaaS providers through fine-grain configurability, allowing providers to orchestrate low-level resource usage. Little work, however, has been devoted to supporting IaaS customers who must determine how to use such fine-grain configurable resources to meet quality-of-service (QoS) requirements while minimizing cost. This is a difficult problem because the multiplicity of configurations creates a non-convex optimization space. In addition, this optimization space may change as customer applications enter and exit distinct processing phases. In this paper, we overcome these issues by proposing CASH: a fine-grain configurable architecture co-designed with a cost-optimizing runtime system. The hardware architecture enables configurability at the granularity of individual ALUs and L2 cache banks and provides unique interfaces to support low-overhead, dynamic configuration and monitoring. The runtime uses a combination of control theory and machine learning to configure the architecture such that QoS requirements are met and cost is minimized. Our results demonstrate that the combination of fine-grain configurability and non-convex optimization provides tremendous cost savings (70% savings) compared to coarse-grain heterogeneity and heuristic optimization. In addition, the system is able to customize configurations to particular applications, respond to application phases, and provide near optimal cost for QoS targets.

基础设施即服务(IaaS)云变得越来越重要。最近的架构设计通过细粒度的可配置性来支持IaaS提供商，允许提供商编排低级资源的使用。但是，很少有工作专门用于支持IaaS客户，这些客户必须确定如何使用这种细粒度可配置资源来满足服务质量(QoS)需求，同时最小化成本。这是一个困难的问题，因为配置的多样性创建了一个非凸优化空间。此外，随着客户应用程序进入和退出不同的处理阶段，这个优化空间可能会发生变化。在本文中，我们通过提出CASH来克服这些问题:CASH是一种与成本优化运行时系统共同设计的细粒度可配置架构。硬件架构支持在单个alu和L2缓存库的粒度上进行配置，并提供独特的接口来支持低开销、动态配置和监控。运行时使用控制理论和机器学习的组合来配置架构，以满足QoS要求并将成本降至最低。我们的研究结果表明，与粗粒度异构性和启发式优化相比，细粒度可配置性和非凸优化相结合可以节省大量成本(节省70%)。此外，系统能够为特定的应用程序定制配置，响应应用程序阶段，并为QoS目标提供接近最优的成本。

{"title":"CASH: Supporting IaaS Customers with a Sub-core Configurable Architecture","authors":"Yanqi Zhou, H. Hoffmann, D. Wentzlaff","doi":"10.1145/3007787.3001209","DOIUrl":"https://doi.org/10.1145/3007787.3001209","url":null,"abstract":"Infrastructure as a Service (IaaS) Clouds have grown increasingly important. Recent architecture designs support IaaS providers through fine-grain configurability, allowing providers to orchestrate low-level resource usage. Little work, however, has been devoted to supporting IaaS customers who must determine how to use such fine-grain configurable resources to meet quality-of-service (QoS) requirements while minimizing cost. This is a difficult problem because the multiplicity of configurations creates a non-convex optimization space. In addition, this optimization space may change as customer applications enter and exit distinct processing phases. In this paper, we overcome these issues by proposing CASH: a fine-grain configurable architecture co-designed with a cost-optimizing runtime system. The hardware architecture enables configurability at the granularity of individual ALUs and L2 cache banks and provides unique interfaces to support low-overhead, dynamic configuration and monitoring. The runtime uses a combination of control theory and machine learning to configure the architecture such that QoS requirements are met and cost is minimized. Our results demonstrate that the combination of fine-grain configurability and non-convex optimization provides tremendous cost savings (70% savings) compared to coarse-grain heterogeneity and heuristic optimization. In addition, the system is able to customize configurations to particular applications, respond to application phases, and provide near optimal cost for QoS targets.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"515 1","pages":"682-694"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78146311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30