2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献_第4页

Shortcut Mining: Exploiting Cross-Layer Shortcut Reuse in DCNN Accelerators 捷径挖掘:利用DCNN加速器中的跨层捷径重用

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00030

Arash AziziMazreah, Lizhong Chen

Off-chip memory traffic has been a major performance bottleneck in deep learning accelerators. While reusing on-chip data is a promising way to reduce off-chip traffic, the opportunity on reusing shortcut connection data in deep networks (e.g., residual networks) have been largely neglected. Those shortcut data accounts for nearly 40% of the total feature map data. In this paper, we propose Shortcut Mining, a novel approach that “mines” the unexploited opportunity of on-chip data reusing. We introduce the abstraction of logical buffers to address the lack of flexibility in existing buffer architecture, and then propose a sequence of procedures which, collectively, can effectively reuse both shortcut and non-shortcut feature maps. The proposed procedures are also able to reuse shortcut data across any number of intermediate layers without using additional buffer resources. Experiment results from prototyping on FPGAs show that, the proposed Shortcut Mining achieves 53.3%, 58%, and 43% reduction in off-chip feature map traffic for SqueezeNet, ResNet-34, and ResNet152, respectively and a 1.93X increase in throughput compared with a state-of-the-art accelerator.

虽然重用片上数据是减少片外流量的一种有前途的方法，但在深度网络(例如残余网络)中重用快捷连接数据的机会在很大程度上被忽视了。这些快捷方式数据占全部特征图数据的近40%。在本文中，我们提出了“捷径挖掘”，这是一种“挖掘”芯片上数据重用未开发机会的新方法。我们引入了逻辑缓冲区的抽象来解决现有缓冲区架构缺乏灵活性的问题，然后提出了一系列过程，这些过程可以有效地重用快捷和非快捷特征映射。所建议的过程还能够跨任意数量的中间层重用快捷数据，而无需使用额外的缓冲区资源。fpga上的原型实验结果表明，所提出的捷径挖掘分别为SqueezeNet、ResNet-34和ResNet152减少了53.3%、58%和43%的片外特征映射流量，与最先进的加速器相比，吞吐量增加了1.93倍。

{"title":"Shortcut Mining: Exploiting Cross-Layer Shortcut Reuse in DCNN Accelerators","authors":"Arash AziziMazreah, Lizhong Chen","doi":"10.1109/HPCA.2019.00030","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00030","url":null,"abstract":"Off-chip memory traffic has been a major performance bottleneck in deep learning accelerators. While reusing on-chip data is a promising way to reduce off-chip traffic, the opportunity on reusing shortcut connection data in deep networks (e.g., residual networks) have been largely neglected. Those shortcut data accounts for nearly 40% of the total feature map data. In this paper, we propose Shortcut Mining, a novel approach that “mines” the unexploited opportunity of on-chip data reusing. We introduce the abstraction of logical buffers to address the lack of flexibility in existing buffer architecture, and then propose a sequence of procedures which, collectively, can effectively reuse both shortcut and non-shortcut feature maps. The proposed procedures are also able to reuse shortcut data across any number of intermediate layers without using additional buffer resources. Experiment results from prototyping on FPGAs show that, the proposed Shortcut Mining achieves 53.3%, 58%, and 43% reduction in off-chip feature map traffic for SqueezeNet, ResNet-34, and ResNet152, respectively and a 1.93X increase in throughput compared with a state-of-the-art accelerator.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122050394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Enhancing Server Efficiency in the Face of Killer Microseconds 提高服务器效率面对杀手级微秒

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00037

Amirhossein Mirhosseini, Akshitha Sriraman, T. Wenisch

We are entering an era of “killer microseconds” in data center applications. Killer microseconds refer to μs-scale “holes” in CPU schedules caused by stalls to access fast I/O devices or brief idle times between requests in high throughput microservices. Whereas modern computing platforms can efficiently hide ns-scale and ms-scale stalls through micro-architectural techniques and OS context switching, they lack efficient support to hide the latency of μs-scale stalls. Simultaneous Multithreading (SMT) is an efficient way to improve core utilization and increase server performance density. Unfortunately, scaling SMT to provision enough threads to hide frequent μs-scale stalls is prohibitive and SMT co-location can often drastically increase the tail latency of cloud microservices. In this paper, we propose Duplexity, a heterogeneous server architecture that employs aggressive multithreading to hide the latency of killer microseconds, without sacrificing the Quality-of-Service (QoS) of latency-sensitive microservices. Duplexity provisions dyads (pairs) of two kinds of cores: master-cores, which each primarily executes a single latency-critical master-thread, and lender-cores, which multiplex latency-insensitive throughput threads. When the master-thread stalls, the master-core borrows filler-threads from the lender-core, filling μs-scale utilization holes of the microservice. We propose critical mechanisms, including separate memory paths for the master-thread and filler-threads, to enable master-cores to borrow filler-threads while protecting master-threads’ state from disruption. Duplexity facilitates fast master-thread restart when stalls resolve and minimizes the microservice’s QoS violation. Our evaluation demonstrates that Duplexity is able to achieve 1.9× higher core utilization and 2.7× lower iso-throughput 99th-percentile tail latency over an SMT-based server design, on average.

我们正在进入数据中心应用的“杀手级微秒”时代。杀手级微秒指的是由于访问快速I/O设备的停顿或高吞吐量微服务请求之间的短暂空闲时间导致的CPU调度中μs级的“漏洞”。尽管现代计算平台可以通过微架构技术和操作系统上下文切换有效地隐藏ns级和ms级的延迟，但它们缺乏对μ级延迟的有效支持。同步多线程(SMT)是提高核心利用率和提高服务器性能密度的有效方法。不幸的是，扩展SMT以提供足够的线程来隐藏频繁的μ级延迟是令人禁止的，并且SMT的协同定位通常会大大增加云微服务的尾部延迟。在本文中，我们提出了双工(Duplexity)，这是一种异构服务器架构，它采用积极的多线程来隐藏杀手级微秒的延迟，而不会牺牲延迟敏感微服务的服务质量(QoS)。双工性提供两种核的双(对):主核，每个主核主要执行一个延迟关键的主线程，和借出核，多路延迟不敏感的吞吐量线程。当主线程停止运行时，主核从借出核中借用填充线程，填补微服务μ级的利用率漏洞。我们提出了关键机制，包括主线程和填充线程的独立内存路径，以使主核能够借用填充线程，同时保护主线程的状态免受中断。双工性有助于在中断解决时快速重启主线程，并最大限度地减少微服务的QoS冲突。我们的评估表明，与基于smt的服务器设计相比，duplex能够实现1.9倍高的核心利用率和2.7倍低的等吞吐量第99百分位尾部延迟。

{"title":"Enhancing Server Efficiency in the Face of Killer Microseconds","authors":"Amirhossein Mirhosseini, Akshitha Sriraman, T. Wenisch","doi":"10.1109/HPCA.2019.00037","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00037","url":null,"abstract":"We are entering an era of “killer microseconds” in data center applications. Killer microseconds refer to μs-scale “holes” in CPU schedules caused by stalls to access fast I/O devices or brief idle times between requests in high throughput microservices. Whereas modern computing platforms can efficiently hide ns-scale and ms-scale stalls through micro-architectural techniques and OS context switching, they lack efficient support to hide the latency of μs-scale stalls. Simultaneous Multithreading (SMT) is an efficient way to improve core utilization and increase server performance density. Unfortunately, scaling SMT to provision enough threads to hide frequent μs-scale stalls is prohibitive and SMT co-location can often drastically increase the tail latency of cloud microservices. In this paper, we propose Duplexity, a heterogeneous server architecture that employs aggressive multithreading to hide the latency of killer microseconds, without sacrificing the Quality-of-Service (QoS) of latency-sensitive microservices. Duplexity provisions dyads (pairs) of two kinds of cores: master-cores, which each primarily executes a single latency-critical master-thread, and lender-cores, which multiplex latency-insensitive throughput threads. When the master-thread stalls, the master-core borrows filler-threads from the lender-core, filling μs-scale utilization holes of the microservice. We propose critical mechanisms, including separate memory paths for the master-thread and filler-threads, to enable master-cores to borrow filler-threads while protecting master-threads’ state from disruption. Duplexity facilitates fast master-thread restart when stalls resolve and minimizes the microservice’s QoS violation. Our evaluation demonstrates that Duplexity is able to achieve 1.9× higher core utilization and 2.7× lower iso-throughput 99th-percentile tail latency over an SMT-based server design, on average.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114508056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

VIP: A Versatile Inference Processor VIP:多功能推理处理器

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00049

Skand Hurkat, José F. Martínez

We present Versatile Inference Processor (VIP), a highly programmable architecture for machine learning inference. VIP consists of 128 lightweight processing engines employing a vector processing paradigm, with a simple ISA and carefully chosen microarchitecture features. It is coupled with a modern, lightly customized, 3D-stacked memory system. Through detailed execution-driven simulations backed by RTL synthesis, we show that we can achieve online, real-time vision throughput (24 fps), at low power consumption, for both fullHD depth-from-stereo using belief propagation, and VGG-16 and VGG-19 deep neural networks (batch size of 1). Our RTL synthesis of a VIP processing engine in TSMC 28 nm technology, using a commercial standard-cell library supplied by ARM, results in 18 mm2 of silicon area and 3.5 W to 4.8 W of power consumption for all 128 VIP processing engines combined.

我们提出了多功能推理处理器(VIP)，一种高度可编程的机器学习推理架构。VIP由128个轻量级处理引擎组成，采用矢量处理范例，具有简单的ISA和精心选择的微架构功能。它与一个现代的，轻定制的，3d堆叠存储系统相结合。通过RTL合成支持的详细执行驱动仿真，我们表明我们可以在低功耗下实现在线实时视觉吞吐量(24 fps)，使用信念传播和VGG-16和VGG-19深度神经网络(批量大小为1)实现全高清立体声深度。我们的RTL合成了台积电28纳米技术的VIP处理引擎，使用ARM提供的商业标准单元库。导致所有128个VIP处理引擎的硅面积为18mm2，功耗为3.5 W至4.8 W。

引用次数: 8

The Best of IEEE Computer Architecture Letters in 2018 2018年最佳IEEE计算机体系结构快报

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00060

P. Gratz

“Amoeba: An Autonomous Backup and Recovery SSD for Ransomware Attack Defense”, Donghyun Min, Donggyu Park, Jinwoo Ahn, Ryan Walker, Junghee Lee, Sungyong Park, Youngjae Kim, Sogang University and University of Texas at San Antonio “The Architectural Implications of Cloud Microservices”, Yu Gan and Christina Delimitrou, Cornell University “An Alternative Analytical Approach to Associative Processing”, Soroosh Khoram, Yue Zha, and Jing Li, University of Wisconsin-Madison

“变形虫:用于勒索软件攻击防御的自主备份和恢复SSD”，Donghyun Min, Donggyu Park, Jinwoo Ahn, Ryan Walker, Junghee Lee, Sungyong Park, Youngjae Kim，西江大学和德克萨斯大学圣安东尼奥分校“云微服务的架构意义”，Yu Gan和Christina Delimitrou，康奈尔大学“联想处理的另一种分析方法”，Soroosh Khoram, Yue Zha和Jing Li，威斯康星大学麦迪逊分校

引用次数: 0

Fine-Tuning the Active Timing Margin (ATM) Control Loop for Maximizing Multi-core Efficiency on an IBM POWER Server 在IBM POWER服务器上微调主动时序余量(ATM)控制环以最大化多核效率

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00031

Yazhou Zu, Daniel Richins, C. Lefurgy, V. Reddi

Active Timing Margin (ATM) is a technology that improves processor efficiency by reducing the pipeline timing margin with a control loop that adjusts voltage and frequency based on real-time chip environment monitoring. Although ATM has already been shown to yield substantial performance benefits, its full potential has yet to be unlocked. In this paper, we investigate how to maximize ATM’s efficiency gain with a new means of exposing the inter-core speed variation: finetuning the ATM control loop. We conduct our analysis and evaluation on a production-grade POWER7+ system. On the POWER7+ server platform, we fine-tune the ATM control loop by programming its Critical Path Monitors, a key component of its ATM design that measures the cores’ timing margins. With a robust stress-test procedure, we expose over 200 MHz of inherent inter-core speed differential by fine-tuning the percore ATM control loop. Exploiting this differential, we manage to double the ATM frequency gain over the static timing margin; this is not possible using conventional means, i.e. by setting fixed points for each core, because the corelevel must account for chip-wide worst-case voltage variation. To manage the significant performance heterogeneity of fine-tuned systems, we propose application scheduling and throttling to manage the chip’s process and voltage variation. Our proposal improves application performance by more than 10% over the static margin, almost doubling the 6% improvement of the default, unmanaged ATM system. Our technique is general enough that it can be adopted by any system that employs an active timing margin control loop. Keywords-Active timing margin, Performance, Power efficiency, Reliability, Critical path monitors

主动时序余量(ATM)是一种通过减少流水线时序余量来提高处理器效率的技术，该技术采用基于实时芯片环境监测的控制回路来调节电压和频率。尽管ATM已经被证明可以产生大量的性能优势，但它的全部潜力还没有被释放出来。在本文中，我们研究了如何通过一种暴露核间速度变化的新方法来最大化ATM的效率增益:微调ATM控制回路。我们在生产级POWER7+系统上进行分析和评估。在POWER7+服务器平台上，我们通过编程其关键路径监视器来微调ATM控制回路，关键路径监视器是其ATM设计的一个关键组件，用于测量内核的时间裕度。通过强大的压力测试程序，我们通过微调percore ATM控制回路暴露了超过200 MHz的固有核间速度差异。利用这种差异，我们设法使ATM频率增益在静态定时裕度上翻倍;这是不可能使用传统的手段，即通过设置固定点每个核心，因为核心电平必须考虑到芯片范围内的最坏情况电压变化。为了管理微调系统的显著性能异质性，我们提出了应用程序调度和节流来管理芯片的过程和电压变化。我们的建议将应用程序性能提高了10%以上，比默认的非托管ATM系统的6%的改进几乎翻了一番。我们的技术是通用的，它可以采用任何系统，采用主动定时余量控制回路。关键词:主动时序余量，性能，功率效率，可靠性，关键路径监视器

{"title":"Fine-Tuning the Active Timing Margin (ATM) Control Loop for Maximizing Multi-core Efficiency on an IBM POWER Server","authors":"Yazhou Zu, Daniel Richins, C. Lefurgy, V. Reddi","doi":"10.1109/HPCA.2019.00031","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00031","url":null,"abstract":"Active Timing Margin (ATM) is a technology that improves processor efficiency by reducing the pipeline timing margin with a control loop that adjusts voltage and frequency based on real-time chip environment monitoring. Although ATM has already been shown to yield substantial performance benefits, its full potential has yet to be unlocked. In this paper, we investigate how to maximize ATM’s efficiency gain with a new means of exposing the inter-core speed variation: finetuning the ATM control loop. We conduct our analysis and evaluation on a production-grade POWER7+ system. On the POWER7+ server platform, we fine-tune the ATM control loop by programming its Critical Path Monitors, a key component of its ATM design that measures the cores’ timing margins. With a robust stress-test procedure, we expose over 200 MHz of inherent inter-core speed differential by fine-tuning the percore ATM control loop. Exploiting this differential, we manage to double the ATM frequency gain over the static timing margin; this is not possible using conventional means, i.e. by setting fixed <v, f> points for each core, because the corelevel <v, f> must account for chip-wide worst-case voltage variation. To manage the significant performance heterogeneity of fine-tuned systems, we propose application scheduling and throttling to manage the chip’s process and voltage variation. Our proposal improves application performance by more than 10% over the static margin, almost doubling the 6% improvement of the default, unmanaged ATM system. Our technique is general enough that it can be adopted by any system that employs an active timing margin control loop. Keywords-Active timing margin, Performance, Power efficiency, Reliability, Critical path monitors","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129655605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Bit Prudent In-Cache Acceleration of Deep Convolutional Neural Networks 深度卷积神经网络的位审慎缓存加速

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00029

Xiaowei Wang, Jiecao Yu, C. Augustine, R. Iyer, R. Das

We propose Bit Prudent In-Cache Acceleration of Deep Convolutional Neural Networks an in-SRAM architecture for accelerating Convolutional Neural Network (CNN) inference by leveraging network redundancy and massive parallelism. The network redundancy is exploited in two ways. First, we prune and fine-tune the trained network model and develop two distinct methods coalescing and overlapping to run inferences efficiently with sparse models. Second, we propose an architecture for network models with a reduced bit width by leveraging bit-serial computation. Our proposed architecture achieves a 17.7×/3.7× speedup over server class CPU/GPU, and a 1.6× speedup compared to the relevant in-cache accelerator, with 2% area overhead each processor die, and no loss on top-1 accuracy for AlexNet. With a relaxed accuracy limit, our tunable architecture achieves higher speedups. Keywords-In-Memory Computing; Cache; Neural Network Pruning; Low Precision Neural Network.

我们提出了位谨慎的深度卷积神经网络缓存内加速，通过利用网络冗余和大规模并行性来加速卷积神经网络(CNN)推理的sram内架构。网络冗余有两种利用方式。首先，我们对训练好的网络模型进行了修剪和微调，并开发了两种不同的方法合并和重叠，以有效地运行稀疏模型的推理。其次，我们提出了一种利用位串行计算减少位宽度的网络模型架构。与服务器级CPU/GPU相比，我们提出的架构实现了17.7倍/3.7倍的加速，与相关的缓存内加速器相比，实现了1.6倍的加速，每个处理器芯片的面积开销为2%，并且AlexNet的top-1精度没有损失。通过放宽精度限制，我们的可调架构实现更高的速度。Keywords-In-Memory计算;缓存;神经网络修剪;低精度神经网络。

引用次数: 25

BRB: Mitigating Branch Predictor Side-Channels. BRB:缓解分支预测器侧通道。

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00058

Ilias Vougioukas, Nikos Nikoleris, Andreas Sandberg, S. Diestelhorst, B. Al-Hashimi, G. Merrett

Modern processors use branch prediction as an optimization to improve processor performance. Predictors have become larger and increasingly more sophisticated in order to achieve higher accuracies which are needed in high performance cores. However, branch prediction can also be a source of side channel exploits, as one context can deliberately change the branch predictor state and alter the instruction flow of another context. Current mitigation techniques either sacrifice performance for security, or fail to guarantee isolation when retaining the accuracy. Achieving both has proven to be challenging. In this work we address this by, (1) introducing the notions of steady-state and transient branch predictor accuracy, and (2) showing that current predictors increase their misprediction rate by as much as 90% on average when forced to flush branch prediction state to remain secure. To solve this, (3) we introduce the branch retention buffer, a novel mechanism that partitions only the most useful branch predictor components to isolate separate contexts. Our mechanism makes thread isolation practical, as it stops the predictor from executing cold with little if any added area and no warm-up overheads. At the same time our results show that, compared to the state-of-the-art, average misprediction rates are reduced by 15-20% without increasing area, leading to a 2% performance increase.

现代处理器使用分支预测作为优化来提高处理器性能。为了实现高性能核心所需的更高精度，预测器已经变得越来越大，越来越复杂。然而，分支预测也可能是侧通道漏洞的来源，因为一个上下文可以故意改变分支预测器状态并改变另一个上下文的指令流。当前的缓解技术要么为了安全性牺牲性能，要么在保持准确性的同时无法保证隔离。事实证明，实现这两个目标颇具挑战性。在这项工作中，我们通过(1)引入稳态和瞬态分支预测器精度的概念，以及(2)表明，当被迫刷新分支预测状态以保持安全时，当前预测器的错误预测率平均增加高达90%。为了解决这个问题，(3)我们引入了分支保留缓冲区，这是一种新的机制，它只划分最有用的分支预测器组件来隔离单独的上下文。我们的机制使线程隔离变得切实可行，因为它阻止了预测器的冷执行，几乎没有增加任何区域，也没有预热开销。与此同时，我们的研究结果表明，与最先进的技术相比，在不增加面积的情况下，平均错误预测率降低了15-20%，从而使性能提高了2%。

{"title":"BRB: Mitigating Branch Predictor Side-Channels.","authors":"Ilias Vougioukas, Nikos Nikoleris, Andreas Sandberg, S. Diestelhorst, B. Al-Hashimi, G. Merrett","doi":"10.1109/HPCA.2019.00058","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00058","url":null,"abstract":"Modern processors use branch prediction as an optimization to improve processor performance. Predictors have become larger and increasingly more sophisticated in order to achieve higher accuracies which are needed in high performance cores. However, branch prediction can also be a source of side channel exploits, as one context can deliberately change the branch predictor state and alter the instruction flow of another context. Current mitigation techniques either sacrifice performance for security, or fail to guarantee isolation when retaining the accuracy. Achieving both has proven to be challenging. In this work we address this by, (1) introducing the notions of steady-state and transient branch predictor accuracy, and (2) showing that current predictors increase their misprediction rate by as much as 90% on average when forced to flush branch prediction state to remain secure. To solve this, (3) we introduce the branch retention buffer, a novel mechanism that partitions only the most useful branch predictor components to isolate separate contexts. Our mechanism makes thread isolation practical, as it stops the predictor from executing cold with little if any added area and no warm-up overheads. At the same time our results show that, compared to the state-of-the-art, average misprediction rates are reduced by 15-20% without increasing area, leading to a 2% performance increase.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133653081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

A Scalable Priority-Aware Approach to Managing Data Center Server Power 管理数据中心服务器电源的可扩展优先级感知方法

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00067

Y. Li, C. Lefurgy, K. Rajamani, Malcolm S. Allen-Ware, Guillermo J. Silva, D. Heimsoth, Saugata Ghose, O. Mutlu

Power management is a key component of modern data center design. Power managers must (1) ensure the costand energy-efficient utilization of the data center infrastructure, (2) maintain availability of the services provided by the center, and (3) address environmental concerns associated with the center’s power consumption. While several power management techniques have been proposed and deployed in production data centers, there are still many challenges to comprehensive data center power management. This is particularly true in public cloud environments, where different jobs have different priority levels, and where high availability is critical. One example of the challenges facing public cloud data centers involves power capping. As power delivery must be highly reliable and tolerate wide variation in the load drawn by the data center components, the power infrastructure (e.g., power supplies, circuit breakers, UPS) has high redundancy and overprovisioning. During normal operation (i.e., typical server power demands, and no failures in the center), the power infrastructure is significantly underutilized. Power capping is a common solution to reduce this underutilization, by allowing more servers to be added safely (i.e., without power shortfalls) to the existing power infrastructure, and throttling power consumption in the infrequent cases where the demanded power exceeds the provisioned power capacity to avoid shortfalls. However, state-of-the-art power capping solutions are (1) not directly applicable to the redundant power infrastructure used in highly-available data centers; and (2) oblivious to differing workload priorities across the entire center when power consumption needs to be throttled, which can unnecessarily slow down high-priority work. To address this need, we develop CapMaestro, a new power management architecture with three key features for public cloud data centers. First, CapMaestro is designed to work with multiple power feeds (i.e., sources), and exploits server-level power capping to independently cap the load on each feed of a server. Second, CapMaestro uses a scalable, global priority-aware power capping approach, which accounts for power capacity at each level of the power distribution hierarchy. It exploits the underutilization of commonly-employed redundant power infrastructure at each level of the hierarchy to safely accommodate a much greater number of servers. Third, CapMaestro exploits stranded power (i.e., power budgets that are not utilized) in redundant power infrastructure to boost the performance of workloads in the data center. We add CapMaestro to a real cloud data center control plane, and demonstrate the effectiveness of all three key features. Using a large-scale data center simulation, we demonstrate that CapMaestro significantly and safely increases the number of servers for existing infrastructure. We also call out other key technical challenges the industry faces in data center power management.

电源管理是现代数据中心设计的关键组成部分。电力管理人员必须(1)确保数据中心基础设施的持续节能利用，(2)保持中心提供的服务的可用性，以及(3)解决与中心电力消耗相关的环境问题。虽然在生产数据中心中已经提出并部署了几种电源管理技术，但全面的数据中心电源管理仍然存在许多挑战。在公共云环境中尤其如此，因为不同的作业具有不同的优先级级别，并且高可用性至关重要。公共云数据中心面临的挑战之一涉及功率上限。由于电力传输必须是高度可靠的，并且能够承受数据中心组件所产生的负载的广泛变化，因此电力基础设施(例如电源、断路器、UPS)具有高冗余和过度供应。在正常操作期间(即，典型的服务器电力需求，中心没有故障)，电力基础设施的利用率明显不足。功率封顶是减少这种利用率不足的一种常见解决方案，它允许将更多服务器安全地添加到现有的电力基础设施中(即，没有电力短缺)，并在需求的电力超过提供的电力容量的罕见情况下限制电力消耗，以避免电力短缺。然而，最先进的功率封顶解决方案(1)不能直接适用于高可用性数据中心中使用的冗余电力基础设施;(2)当需要控制功耗时，忽略了整个中心不同的工作负载优先级，这可能会不必要地减慢高优先级的工作。为了满足这一需求，我们开发了CapMaestro，这是一种新的电源管理架构，具有公共云数据中心的三个关键功能。首先，CapMaestro设计用于多个电源馈送(即，源)，并利用服务器级功率封顶来独立地限制服务器每个馈送的负载。其次，CapMaestro使用可扩展的、全局优先级感知的功率封顶方法，该方法考虑了功率分布层次结构中每个级别的功率容量。它利用了在层次结构的每个级别上普遍使用的冗余电力基础设施的未充分利用，以安全地容纳更多数量的服务器。第三，CapMaestro利用冗余电力基础设施中的滞留电力(即未使用的电力预算)来提高数据中心工作负载的性能。我们将CapMaestro添加到真正的云数据中心控制平面中，并演示了所有三个关键功能的有效性。通过大规模数据中心模拟，我们证明CapMaestro显著且安全地增加了现有基础设施的服务器数量。我们还指出了行业在数据中心电源管理方面面临的其他关键技术挑战。

{"title":"A Scalable Priority-Aware Approach to Managing Data Center Server Power","authors":"Y. Li, C. Lefurgy, K. Rajamani, Malcolm S. Allen-Ware, Guillermo J. Silva, D. Heimsoth, Saugata Ghose, O. Mutlu","doi":"10.1109/HPCA.2019.00067","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00067","url":null,"abstract":"Power management is a key component of modern data center design. Power managers must (1) ensure the costand energy-efficient utilization of the data center infrastructure, (2) maintain availability of the services provided by the center, and (3) address environmental concerns associated with the center’s power consumption. While several power management techniques have been proposed and deployed in production data centers, there are still many challenges to comprehensive data center power management. This is particularly true in public cloud environments, where different jobs have different priority levels, and where high availability is critical. One example of the challenges facing public cloud data centers involves power capping. As power delivery must be highly reliable and tolerate wide variation in the load drawn by the data center components, the power infrastructure (e.g., power supplies, circuit breakers, UPS) has high redundancy and overprovisioning. During normal operation (i.e., typical server power demands, and no failures in the center), the power infrastructure is significantly underutilized. Power capping is a common solution to reduce this underutilization, by allowing more servers to be added safely (i.e., without power shortfalls) to the existing power infrastructure, and throttling power consumption in the infrequent cases where the demanded power exceeds the provisioned power capacity to avoid shortfalls. However, state-of-the-art power capping solutions are (1) not directly applicable to the redundant power infrastructure used in highly-available data centers; and (2) oblivious to differing workload priorities across the entire center when power consumption needs to be throttled, which can unnecessarily slow down high-priority work. To address this need, we develop CapMaestro, a new power management architecture with three key features for public cloud data centers. First, CapMaestro is designed to work with multiple power feeds (i.e., sources), and exploits server-level power capping to independently cap the load on each feed of a server. Second, CapMaestro uses a scalable, global priority-aware power capping approach, which accounts for power capacity at each level of the power distribution hierarchy. It exploits the underutilization of commonly-employed redundant power infrastructure at each level of the hierarchy to safely accommodate a much greater number of servers. Third, CapMaestro exploits stranded power (i.e., power budgets that are not utilized) in redundant power infrastructure to boost the performance of workloads in the data center. We add CapMaestro to a real cloud data center control plane, and demonstrate the effectiveness of all three key features. Using a large-scale data center simulation, we demonstrate that CapMaestro significantly and safely increases the number of servers for existing infrastructure. We also call out other key technical challenges the industry faces in data center power management.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133865997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Darwin-WGA: A Co-processor Provides Increased Sensitivity in Whole Genome Alignments with High Speedup 达尔文- wga:一个协处理器提供了高加速全基因组比对的灵敏度

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00050

Yatish Turakhia, Sneha D. Goenka, G. Bejerano, W. Dally

Whole genome alignment (WGA) is an indispensable tool in comparative genomics to study how different lifeforms have been shaped by evolution at the molecular level. Existing software whole genome aligners require several CPU weeks to compare a pair of mammalian genomes and still miss several biologically-meaningful, high-scoring alignment regions. These aligners are based on the seed-filter-and-extend paradigm with an ungapped filtering stage. Ungapped filtering is responsible for the low sensitivity of these aligners but is used because it is 200× faster than performing gapped alignment, using dynamic programming, in software. In this paper, we show that both performance and sensitivity can be greatly improved by using a hardware accelerator for WGA. Using the genomes of two roundworms (C. elegans and C. Briggsae) and four fruit flies (D. melanogaster, D. simulans, D. yakuba, and D. pseudoobscura), we show that replacing ungapped filtering with gapped filtering increases the number of matching base-pairs in alignments by up to 3×. Our accelerator, Darwin-WGA, is the first hardware accelerator for whole genome alignment and accelerates the gapped filtering stage. Darwin-WGA also employs GACT-X, a novel algorithm used in the extension stage to align arbitrarily long genome sequences using a small on-chip memory, that provides better quality alignments at 2× improvement in memory and speed over the previously published GACT algorithm. Implemented on an FPGA, Darwin-WGA provides up to 24× improvement (performance/$) in WGA over iso-sensitive software. An ASIC implementation of the proposed architecture on TSMC 40nm technology takes around 43W power with 36mm area. It achieves up to 10× performance/watt improvement on whole genome alignments over state-of-the-art software at higher sensitivity, and up to 1,500× performance/watt improvement compared to iso-sensitive software. Darwin-WGA is released under open-source MIT license and is available from https://github.com/gsneha26/Darwin-WGA. Keywords-Co-processor, Comparative Genomics, Whole Genome Alignment, Gapped Filtering

全基因组比对(WGA)是比较基因组学研究不同生命形式如何在分子水平上进化形成的重要工具。现有的全基因组比对软件需要几个CPU周的时间来比较一对哺乳动物的基因组，并且仍然遗漏了一些生物学上有意义的、高得分的比对区域。这些对齐器基于种子过滤-扩展模式，具有未截断的过滤阶段。未间隙滤波是导致这些对准器灵敏度低的原因，但之所以使用它，是因为它比在软件中使用动态规划执行间隙对准快200倍。在本文中，我们证明了使用硬件加速器可以大大提高WGA的性能和灵敏度。利用两种线虫(C. elegans和C. Briggsae)和四种果蝇(D. melanogaster、D. simulans、D. yakuba和D. pseudoobscura)的基因组，我们发现用间隙过滤取代未缺口过滤可使比对中的匹配碱基对数量增加3倍。我们的加速器，Darwin-WGA，是第一个全基因组比对的硬件加速器，可以加速间隙过滤阶段。Darwin-WGA还采用了GACT- x，这是一种用于扩展阶段的新算法，可以使用小型片上存储器对任意长的基因组序列进行对齐，与先前发表的GACT算法相比，该算法在内存和速度上提高了2倍，提供了更好的质量对齐。在FPGA上实现，Darwin-WGA比等敏感软件提供了高达24倍的改进(性能/$)。采用台积电40nm技术的ASIC实现该架构的功耗约为43W，面积为36mm。与最先进的软件相比，它在全基因组比对方面实现了高达10倍的性能/瓦的改进，灵敏度更高，与等灵敏度软件相比，性能/瓦的提高可达1,500倍。Darwin-WGA在开源MIT许可下发布，可从https://github.com/gsneha26/Darwin-WGA获得。关键词:协处理器，比较基因组学，全基因组比对，间隙滤波

{"title":"Darwin-WGA: A Co-processor Provides Increased Sensitivity in Whole Genome Alignments with High Speedup","authors":"Yatish Turakhia, Sneha D. Goenka, G. Bejerano, W. Dally","doi":"10.1109/HPCA.2019.00050","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00050","url":null,"abstract":"Whole genome alignment (WGA) is an indispensable tool in comparative genomics to study how different lifeforms have been shaped by evolution at the molecular level. Existing software whole genome aligners require several CPU weeks to compare a pair of mammalian genomes and still miss several biologically-meaningful, high-scoring alignment regions. These aligners are based on the seed-filter-and-extend paradigm with an ungapped filtering stage. Ungapped filtering is responsible for the low sensitivity of these aligners but is used because it is 200× faster than performing gapped alignment, using dynamic programming, in software. In this paper, we show that both performance and sensitivity can be greatly improved by using a hardware accelerator for WGA. Using the genomes of two roundworms (C. elegans and C. Briggsae) and four fruit flies (D. melanogaster, D. simulans, D. yakuba, and D. pseudoobscura), we show that replacing ungapped filtering with gapped filtering increases the number of matching base-pairs in alignments by up to 3×. Our accelerator, Darwin-WGA, is the first hardware accelerator for whole genome alignment and accelerates the gapped filtering stage. Darwin-WGA also employs GACT-X, a novel algorithm used in the extension stage to align arbitrarily long genome sequences using a small on-chip memory, that provides better quality alignments at 2× improvement in memory and speed over the previously published GACT algorithm. Implemented on an FPGA, Darwin-WGA provides up to 24× improvement (performance/$) in WGA over iso-sensitive software. An ASIC implementation of the proposed architecture on TSMC 40nm technology takes around 43W power with 36mm area. It achieves up to 10× performance/watt improvement on whole genome alignments over state-of-the-art software at higher sensitivity, and up to 1,500× performance/watt improvement compared to iso-sensitive software. Darwin-WGA is released under open-source MIT license and is available from https://github.com/gsneha26/Darwin-WGA. Keywords-Co-processor, Comparative Genomics, Whole Genome Alignment, Gapped Filtering","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131376261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Active-Routing: Compute on the Way for Near-Data Processing 主动路由:近数据处理方式上的计算

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00018

Jiayi Huang, Ramprakash Reddy Puli, Pritam Majumder, Sungkeun Kim, R. Boyapati, K. H. Yum, Eun Jung Kim

—The explosion of data availability and the demand for faster data analysis have led to the emergence of applications exhibiting large memory footprint and low data reuse rate. These workloads, ranging from neural networks to graph processing, expose compute kernels that operate over myriads of data. Signiﬁcant data movement requirements of these kernels impose heavy stress on modern memory subsystems and communication fabrics. To mitigate the worsening gap between high CPU computation density and deﬁcient memory bandwidth, solutions like memory networks and near-data processing designs are being architected to improve system performance substantially. In this work, we examine the idea of mapping compute ker- nels to the memory network so as to leverage in-network computing in data-ﬂow style, by means of near-data processing. We propose Active-Routing , an in-network compute architecture that enables computation on the way for near-data processing by exploiting patterns of aggregation over intermediate results of arithmetic operators. The proposed architecture leverages the massive memory-level parallelism and network concurrency to optimize the aggregation operations along a dynamically built Active-Routing Tree . Our evaluations show that Active-Routing can achieve upto 7 × speedup with an average of 60% performance improvement, and reduce the energy-delay product by 80% across various benchmarks compared to the state-of-the-art processing-in-memory architecture.

数据可用性的爆炸式增长和对更快的数据分析的需求导致了大量内存占用和低数据重用率的应用程序的出现。这些工作负载，从神经网络到图形处理，暴露了在无数数据上操作的计算内核。这些内核的重要数据移动需求给现代存储子系统和通信结构带来了沉重的压力。为了缓解高CPU计算密度和内存带宽不足之间日益恶化的差距，人们正在设计内存网络和近数据处理设计等解决方案，以大幅提高系统性能。在这项工作中，我们研究了将计算内核映射到内存网络的想法，以便通过近数据处理来利用数据流风格的网络内计算。我们提出主动路由，这是一种网络内计算架构，通过利用算术运算符中间结果的聚合模式，使近数据处理的计算成为可能。所提出的体系结构利用大量内存级并行性和网络并发性，沿着动态构建的活动路由树优化聚合操作。我们的评估表明，与最先进的内存处理架构相比，Active-Routing可以实现高达7倍的加速，平均性能提高60%，并在各种基准测试中将能量延迟产品降低80%。

{"title":"Active-Routing: Compute on the Way for Near-Data Processing","authors":"Jiayi Huang, Ramprakash Reddy Puli, Pritam Majumder, Sungkeun Kim, R. Boyapati, K. H. Yum, Eun Jung Kim","doi":"10.1109/HPCA.2019.00018","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00018","url":null,"abstract":"—The explosion of data availability and the demand for faster data analysis have led to the emergence of applications exhibiting large memory footprint and low data reuse rate. These workloads, ranging from neural networks to graph processing, expose compute kernels that operate over myriads of data. Signiﬁcant data movement requirements of these kernels impose heavy stress on modern memory subsystems and communication fabrics. To mitigate the worsening gap between high CPU computation density and deﬁcient memory bandwidth, solutions like memory networks and near-data processing designs are being architected to improve system performance substantially. In this work, we examine the idea of mapping compute ker- nels to the memory network so as to leverage in-network computing in data-ﬂow style, by means of near-data processing. We propose Active-Routing , an in-network compute architecture that enables computation on the way for near-data processing by exploiting patterns of aggregation over intermediate results of arithmetic operators. The proposed architecture leverages the massive memory-level parallelism and network concurrency to optimize the aggregation operations along a dynamically built Active-Routing Tree . Our evaluations show that Active-Routing can achieve upto 7 × speedup with an average of 60% performance improvement, and reduce the energy-delay product by 80% across various benchmarks compared to the state-of-the-art processing-in-memory architecture.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123010735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21