2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)最新文献_第3页

Peak Efficiency Aware Scheduling for Highly Energy Proportional Servers 高能量比例服务器的峰值效率感知调度

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001188

Daniel Wong

Energy proportionality of data center severs have improved drastically over the past decade to the point where near ideal energy proportional servers are now common. These highly energy proportional servers exhibit the unique property where peak efficiency no longer coincides with peak utilization. In this paper, we explore the implications of this property on data center scheduling. We identified that current state of the art data center schedulers does not efficiently leverage these properties, leading to inefficient scheduling decisions. We propose Peak Efficiency Aware Scheduling (PEAS) which can achieve better-than-ideal energy proportionality at the data center level. We demonstrate that PEAS can reduce average power by 25.5% with 3.0% improvement to TCO compared to state-of-the-art scheduling policies.

在过去的十年中，数据中心服务器的能量比例已经得到了极大的改善，现在接近理想能量比例的服务器已经很常见了。这些高能量比例的服务器表现出独特的特性，即峰值效率不再与峰值利用率一致。在本文中，我们探讨了这一特性对数据中心调度的影响。我们发现，目前最先进的数据中心调度器不能有效地利用这些属性，从而导致低效的调度决策。我们提出了峰值效率感知调度(Peak Efficiency Aware Scheduling, PEAS)，它可以在数据中心级别实现优于理想的能量比例。我们证明，与最先进的调度策略相比，豌豆可以将平均功耗降低25.5%，TCO提高3.0%。

引用次数: 35

Asymmetry-Aware Work-Stealing Runtimes 不对称感知窃取工作的运行时

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001142

Christopher Torng, Moyang Wang, C. Batten

Amdahl's law provides architects a compelling reason to introduce system asymmetry to optimize for both serial and parallel regions of execution. Asymmetry in a multicore processor can arise statically (e.g., from core microarchitecture) or dynamically (e.g., applying dynamic voltage/frequency scaling). Work stealing is an increasingly popular approach to task distribution that elegantly balances task-based parallelism across multiple worker threads. In this paper, we propose asymmetry-aware work-stealing (AAWS) runtimes, which are carefully designed to exploit both the static and dynamic asymmetry in modern systems. AAWS runtimes use three key hardware/software techniques: work-pacing, work-sprinting, and work-mugging. Work-pacing and work-sprinting are novel techniques that combine a marginal-utility-based approach with integrated voltage regulators to improve performance and energy efficiency in high-and low-parallel regions. Work-mugging is a previously proposed technique that enables a waiting big core to preemptively migrate work from a busy little core. We propose a simple implementation of work-mugging based on lightweight user-level interrupts. We use a vertically integrated research methodology spanning software, architecture, and VLSI to make the case that holistically combining static asymmetry, dynamic asymmetry, and work-stealing runtimes can improve both performance and energy efficiency in future multicore systems.

Amdahl定律为架构师提供了一个令人信服的理由来引入系统不对称，以优化串行和并行执行区域。多核处理器中的不对称可以静态地(例如，从核心微架构)或动态地(例如，应用动态电压/频率缩放)产生。工作窃取是一种日益流行的任务分发方法，它可以在多个工作线程之间优雅地平衡基于任务的并行性。在本文中，我们提出了不对称感知工作窃取(aws)运行时，它被精心设计以利用现代系统中的静态和动态不对称。aws运行时使用三种关键的硬件/软件技术:工作节奏、工作冲刺和工作抢劫。工作节奏和工作冲刺是一种新颖的技术，它将基于边际效用的方法与集成电压调节器相结合，以提高高、低并联区域的性能和能源效率。工作抢劫是先前提出的一种技术，它使等待的大核心能够从繁忙的小核心中抢先迁移工作。我们提出了一个基于轻量级用户级中断的工作抢劫的简单实现。我们使用跨越软件、架构和VLSI的垂直集成研究方法，以证明整体结合静态不对称、动态不对称和窃取工作的运行时可以提高未来多核系统的性能和能源效率。

{"title":"Asymmetry-Aware Work-Stealing Runtimes","authors":"Christopher Torng, Moyang Wang, C. Batten","doi":"10.1145/3007787.3001142","DOIUrl":"https://doi.org/10.1145/3007787.3001142","url":null,"abstract":"Amdahl's law provides architects a compelling reason to introduce system asymmetry to optimize for both serial and parallel regions of execution. Asymmetry in a multicore processor can arise statically (e.g., from core microarchitecture) or dynamically (e.g., applying dynamic voltage/frequency scaling). Work stealing is an increasingly popular approach to task distribution that elegantly balances task-based parallelism across multiple worker threads. In this paper, we propose asymmetry-aware work-stealing (AAWS) runtimes, which are carefully designed to exploit both the static and dynamic asymmetry in modern systems. AAWS runtimes use three key hardware/software techniques: work-pacing, work-sprinting, and work-mugging. Work-pacing and work-sprinting are novel techniques that combine a marginal-utility-based approach with integrated voltage regulators to improve performance and energy efficiency in high-and low-parallel regions. Work-mugging is a previously proposed technique that enables a waiting big core to preemptively migrate work from a busy little core. We propose a simple implementation of work-mugging based on lightweight user-level interrupts. We use a vertically integrated research methodology spanning software, architecture, and VLSI to make the case that holistically combining static asymmetry, dynamic asymmetry, and work-stealing runtimes can improve both performance and energy efficiency in future multicore systems.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"75 1","pages":"40-52"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85972415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Cambricon: An Instruction Set Architecture for Neural Networks 寒武纪:神经网络的指令集架构

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001179

Shaoli Liu, Zidong Du, Jinhua Tao, D. Han, Tao Luo, Yuan Xie, Yunji Chen, Tianshi Chen

Neural Networks (NN) are a family of models for a broad range of emerging machine learning and pattern recondition applications. NN techniques are conventionally executed on general-purpose processors (such as CPU and GPGPU), which are usually not energy-efficient since they invest excessive hardware resources to flexibly support various workloads. Consequently, application-specific hardware accelerators for neural networks have been proposed recently to improve the energy-efficiency. However, such accelerators were designed for a small set of NN techniques sharing similar computational patterns, and they adopt complex and informative instructions (control signals) directly corresponding to high-level functional blocks of an NN (such as layers), or even an NN as a whole. Although straightforward and easy-to-implement for a limited set of similar NN techniques, the lack of agility in the instruction set prevents such accelerator designs from supporting a variety of different NN techniques with sufficient flexibility and efficiency. In this paper, we propose a novel domain-specific Instruction Set Architecture (ISA) for NN accelerators, called Cambricon, which is a load-store architecture that integrates scalar, vector, matrix, logical, data transfer, and control instructions, based on a comprehensive analysis of existing NN techniques. Our evaluation over a total of ten representative yet distinct NN techniques have demonstrated that Cambricon exhibits strong descriptive capacity over a broad range of NN techniques, and provides higher code density than general-purpose ISAs such as ×86, MIPS, and GPGPU. Compared to the latest state-of-the-art NN accelerator design DaDianNao [5] (which can only accommodate 3 types of NN techniques), our Cambricon-based accelerator prototype implemented in TSMC 65nm technology incurs only negligible latency/power/area overheads, with a versatile coverage of 10 different NN benchmarks.

神经网络(NN)是一组用于广泛新兴机器学习和模式修复应用的模型。神经网络技术通常在通用处理器(如CPU和GPGPU)上执行，通常不节能，因为它们投入过多的硬件资源来灵活地支持各种工作负载。因此，最近提出了用于神经网络的特定应用硬件加速器来提高能量效率。然而，这种加速器是为一小部分共享相似计算模式的神经网络技术而设计的，它们采用复杂且信息丰富的指令(控制信号)，直接对应于神经网络的高级功能块(如层)，甚至是整个神经网络。尽管对于一组有限的类似神经网络技术来说，这种加速器设计简单且易于实现，但指令集缺乏灵活性，阻碍了这种加速器设计以足够的灵活性和效率支持各种不同的神经网络技术。在本文中，我们在对现有神经网络技术进行综合分析的基础上，提出了一种新的用于神经网络加速器的特定领域指令集架构(ISA)，称为Cambricon，它是一种集成了标量、向量、矩阵、逻辑、数据传输和控制指令的负载存储架构。我们对十种具有代表性但不同的神经网络技术的评估表明，寒武纪在广泛的神经网络技术中表现出强大的描述能力，并且比通用isa(如×86, MIPS和GPGPU)提供更高的代码密度。与最新的最先进的神经网络加速器设计DaDianNao[5](只能容纳3种类型的神经网络技术)相比，我们采用台积电65nm技术实现的基于寒武纪的加速器原型只会产生微不足道的延迟/功耗/面积开销，并具有10种不同的神经网络基准。

{"title":"Cambricon: An Instruction Set Architecture for Neural Networks","authors":"Shaoli Liu, Zidong Du, Jinhua Tao, D. Han, Tao Luo, Yuan Xie, Yunji Chen, Tianshi Chen","doi":"10.1145/3007787.3001179","DOIUrl":"https://doi.org/10.1145/3007787.3001179","url":null,"abstract":"Neural Networks (NN) are a family of models for a broad range of emerging machine learning and pattern recondition applications. NN techniques are conventionally executed on general-purpose processors (such as CPU and GPGPU), which are usually not energy-efficient since they invest excessive hardware resources to flexibly support various workloads. Consequently, application-specific hardware accelerators for neural networks have been proposed recently to improve the energy-efficiency. However, such accelerators were designed for a small set of NN techniques sharing similar computational patterns, and they adopt complex and informative instructions (control signals) directly corresponding to high-level functional blocks of an NN (such as layers), or even an NN as a whole. Although straightforward and easy-to-implement for a limited set of similar NN techniques, the lack of agility in the instruction set prevents such accelerator designs from supporting a variety of different NN techniques with sufficient flexibility and efficiency. In this paper, we propose a novel domain-specific Instruction Set Architecture (ISA) for NN accelerators, called Cambricon, which is a load-store architecture that integrates scalar, vector, matrix, logical, data transfer, and control instructions, based on a comprehensive analysis of existing NN techniques. Our evaluation over a total of ten representative yet distinct NN techniques have demonstrated that Cambricon exhibits strong descriptive capacity over a broad range of NN techniques, and provides higher code density than general-purpose ISAs such as ×86, MIPS, and GPGPU. Compared to the latest state-of-the-art NN accelerator design DaDianNao [5] (which can only accommodate 3 types of NN techniques), our Cambricon-based accelerator prototype implemented in TSMC 65nm technology incurs only negligible latency/power/area overheads, with a versatile coverage of 10 different NN benchmarks.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"7 1","pages":"393-405"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82686876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 271

Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory Neurocube:一个具有高密度三维存储器的可编程数字神经形态架构

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001178

Duckhwan Kim, J. Kung, S. Chai, S. Yalamanchili, S. Mukhopadhyay

This paper presents a programmable and scalable digital neuromorphic architecture based on 3D high-density memory integrated with logic tier for efficient neural computing. The proposed architecture consists of clusters of processing engines, connected by 2D mesh network as a processing tier, which is integrated in 3D with multiple tiers of DRAM. The PE clusters access multiple memory channels (vaults) in parallel. The operating principle, referred to as the memory centric computing, embeds specialized state-machines within the vault controllers of HMC to drive data into the PE clusters. The paper presents the basic architecture of the Neurocube and an analysis of the logic tier synthesized in 28nm and 15nm process technologies. The performance of the Neurocube is evaluated and illustrated through the mapping of a Convolutional Neural Network and estimating the subsequent power and performance for both training and inference.

本文提出了一种基于三维高密度存储器集成逻辑层的可编程、可扩展的数字神经形态体系结构，用于高效的神经计算。所提出的架构由处理引擎集群组成，通过二维网格网络作为处理层连接，该处理层与多层DRAM集成在3D中。PE集群并行访问多个内存通道(vault)。这种操作原理称为以内存为中心的计算，它在HMC的vault控制器中嵌入专门的状态机，将数据驱动到PE集群中。本文介绍了Neurocube的基本架构，并分析了28nm和15nm工艺合成的逻辑层。通过卷积神经网络的映射以及对训练和推理的后续功率和性能的估计，对Neurocube的性能进行了评估和说明。

引用次数: 355

Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference 跑步机:通过精确负载测试和统计推断确定尾部延迟的来源

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001186

Yunqi Zhang, David Meisner, Jason Mars, Lingjia Tang

Managing tail latency of requests has become one of the primary challenges for large-scale Internet services. Data centers are quickly evolving and service operators frequently desire to make changes to the deployed software and production hardware configurations. Such changes demand a confident understanding of the impact on one's service, in particular its effect on tail latency (e.g., 95th-or 99th-percentile response latency of the service). Evaluating the impact on the tail is challenging because of its inherent variability. Existing tools and methodologies for measuring these effects suffer from a number of deficiencies including poor load tester design, statistically inaccurate aggregation, and improper attribution of effects. As shown in the paper, these pitfalls can often result in misleading conclusions. In this paper, we develop a methodology for statistically rigorous performance evaluation and performance factor attribution for server workloads. First, we find that careful design of the server load tester can ensure high quality performance evaluation, and empirically demonstrate the inaccuracy of load testers in previous work. Learning from the design flaws in prior work, we design and develop a modular load tester platform, Treadmill, that overcomes pitfalls of existing tools. Next, utilizing Treadmill, we construct measurement and analysis procedures that can properly attribute performance factors. We rely on statistically-sound performance evaluation and quantile regression, extending it to accommodate the idiosyncrasies of server systems. Finally, we use our augmented methodology to evaluate the impact of common server hardware features with Facebook production workloads on production hardware. We decompose the effects of these features on request tail latency and demonstrate that our evaluation methodology provides superior results, particularly in capturing complicated and counter-intuitive performance behaviors. By tuning the hardware features as suggested by the attribution, we reduce the 99th-percentile latency by 43% and its variance by 93%.

管理请求的尾部延迟已经成为大规模Internet服务的主要挑战之一。数据中心正在快速发展，服务运营商经常希望对部署的软件和生产硬件配置进行更改。这种变化要求对服务的影响有自信的理解，特别是对尾部延迟的影响(例如，服务的95或99百分位响应延迟)。由于其固有的可变性，评估对尾部的影响是具有挑战性的。用于测量这些影响的现有工具和方法存在许多缺陷，包括糟糕的负载测试仪设计、统计不准确的聚合和不适当的影响归因。正如本文所示，这些陷阱往往会导致误导性的结论。在本文中，我们开发了一种方法，用于统计严格的性能评估和服务器工作负载的性能因素归因。首先，我们发现精心设计的服务器负载测试仪可以保证高质量的性能评估，并实证证明了以往工作中负载测试仪的不准确性。从之前工作的设计缺陷中学习，我们设计并开发了一个模块化负载测试平台，跑步机，克服了现有工具的缺陷。其次，利用跑步机，我们构建测量和分析程序，可以适当地归因于性能因素。我们依赖于统计可靠的性能评估和分位数回归，并对其进行扩展以适应服务器系统的特性。最后，我们使用我们的增强方法来评估带有Facebook生产工作负载的通用服务器硬件特性对生产硬件的影响。我们分解了这些特征对请求尾部延迟的影响，并证明我们的评估方法提供了优越的结果，特别是在捕获复杂和反直觉的性能行为方面。通过根据属性调整硬件特性，我们将第99百分位延迟减少了43%，其方差减少了93%。

{"title":"Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference","authors":"Yunqi Zhang, David Meisner, Jason Mars, Lingjia Tang","doi":"10.1145/3007787.3001186","DOIUrl":"https://doi.org/10.1145/3007787.3001186","url":null,"abstract":"Managing tail latency of requests has become one of the primary challenges for large-scale Internet services. Data centers are quickly evolving and service operators frequently desire to make changes to the deployed software and production hardware configurations. Such changes demand a confident understanding of the impact on one's service, in particular its effect on tail latency (e.g., 95th-or 99th-percentile response latency of the service). Evaluating the impact on the tail is challenging because of its inherent variability. Existing tools and methodologies for measuring these effects suffer from a number of deficiencies including poor load tester design, statistically inaccurate aggregation, and improper attribution of effects. As shown in the paper, these pitfalls can often result in misleading conclusions. In this paper, we develop a methodology for statistically rigorous performance evaluation and performance factor attribution for server workloads. First, we find that careful design of the server load tester can ensure high quality performance evaluation, and empirically demonstrate the inaccuracy of load testers in previous work. Learning from the design flaws in prior work, we design and develop a modular load tester platform, Treadmill, that overcomes pitfalls of existing tools. Next, utilizing Treadmill, we construct measurement and analysis procedures that can properly attribute performance factors. We rely on statistically-sound performance evaluation and quantile regression, extending it to accommodate the idiosyncrasies of server systems. Finally, we use our augmented methodology to evaluate the impact of common server hardware features with Facebook production workloads on production hardware. We decompose the effects of these features on request tail latency and demonstrate that our evaluation methodology provides superior results, particularly in capturing complicated and counter-intuitive performance behaviors. By tuning the hardware features as suggested by the attribution, we reduce the 99th-percentile latency by 43% and its variance by 93%.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"369 1","pages":"456-468"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76423413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 86

All-Inclusive ECC: Thorough End-to-End Protection for Reliable Computer Memory 全包ECC:彻底的端到端保护可靠的计算机内存

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001203

Jungrae Kim, Michael B. Sullivan, Sangkug Lym, M. Erez

Increasing transfer rates and decreasing I/O voltage levels make signals more vulnerable to transmission errors. While the data in computer memory are well-protected by modern error checking and correcting (ECC) codes, the clock, control, command, and address (CCCA) signals are weakly protected or even unprotected such that transmission errors leave serious gaps in data-only protection. This paper presents All-Inclusive ECC (AIECC), a memory protection scheme that leverages and augments data ECC to also thoroughly protect CCCA signals. AIECC provides strong end-to-end protection of memory, detecting nearly 100% of CCCA errors and also preventing transmission errors from causing latent memory data corruption. AIECC provides these system-level benefits without requiring extra storage and transfer overheads and without degrading the effective level of data protection.

增加传输速率和降低I/O电压水平使信号更容易受到传输错误的影响。计算机内存中的数据受到现代纠错码(ECC)的良好保护，而时钟、控制、命令和地址(CCCA)信号的保护很弱，甚至不受保护，以至于传输错误在数据保护方面留下了严重的空白。本文提出了全包ECC (AIECC)，这是一种利用和增强数据ECC来彻底保护CCCA信号的内存保护方案。AIECC提供了强大的端到端内存保护，几乎100%检测到CCCA错误，并防止传输错误导致潜在的内存数据损坏。AIECC提供了这些系统级的好处，而不需要额外的存储和传输开销，也不会降低数据保护的有效级别。

引用次数: 21

ASIC Clouds: Specializing the Datacenter ASIC云:专门化数据中心

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001156

Ikuo Magaki, M. Khazraee, L. V. Gutierrez, M. Taylor

GPU and FPGA-based clouds have already demonstrated the promise of accelerating computing-intensive workloads with greatly improved power and performance. In this paper, we examine the design of ASIC Clouds, which are purpose-built datacenters comprised of large arrays of ASIC accelerators, whose purpose is to optimize the total cost of ownership (TCO) of large, high-volume chronic computations, which are becoming increasingly common as more and more services are built around the Cloud model. On the surface, the creation of ASIC clouds may seem highlyimprobable due to high NREs and the inflexibility of ASICs. Surprisingly, however, large-scale ASIC Clouds have already been deployed by a large number of commercial entities, to implement the distributed Bitcoin cryptocurrency system. We begin with a case study of Bitcoin mining ASIC Clouds, which are perhaps the largest ASIC Clouds to date. From there, we design three more ASIC Clouds, including a YouTube-style video transcoding ASIC Cloud, a Litecoin ASIC Cloud, and a Convolutional Neural Network ASIC Cloud and show 2-3 orders of magnitude better TCO versus CPU and GPU. Among our contributions, we present a methodology that given an accelerator design, derives Pareto-optimal ASIC Cloud Servers, by extracting data from place-and-routed circuits and computational fluid dynamic simulations, and then employing clever but brute-force search to find the best jointly-optimized ASIC, DRAM subsystem, motherboard, power delivery system, cooling system, operating voltage, and case design. Moreover, we show how data center parameters determine which of the many Pareto-optimal points is TCO-optimal. Finally we examine when it makes sense to build an ASIC Cloud, and examine the impact of ASIC NRE.

基于GPU和fpga的云计算已经展示了加速计算密集型工作负载的前景，大大提高了功率和性能。在本文中，我们研究了ASIC云的设计，ASIC云是由大型ASIC加速器阵列组成的专用数据中心，其目的是优化大型，高容量慢性计算的总拥有成本(TCO)，随着越来越多的服务围绕云模型构建，这种计算变得越来越普遍。从表面上看，由于高NREs和ASIC的不灵活性，ASIC云的创建似乎极不可能。然而，令人惊讶的是，大规模的ASIC云已经被大量的商业实体部署，以实现分布式比特币加密货币系统。我们从比特币挖矿ASIC云的案例研究开始，这可能是迄今为止最大的ASIC云。从那里，我们设计了另外三个ASIC云，包括youtube风格的视频转码ASIC云，莱特币ASIC云和卷积神经网络ASIC云，并显示出比CPU和GPU更好的2-3个数量级的TCO。在我们的贡献中，我们提出了一种方法，该方法给出了一个加速器设计，通过从放置和路由电路和计算流体动力学模拟中提取数据，得出帕累托最优的ASIC云服务器，然后采用聪明但暴力的搜索来找到最佳的联合优化ASIC, DRAM子系统，主板，供电系统，冷却系统，工作电压和机箱设计。此外，我们还展示了数据中心参数如何决定众多帕累托最优点中哪一个是tco最优的。最后，我们研究了何时构建ASIC云是有意义的，并研究了ASIC NRE的影响。

{"title":"ASIC Clouds: Specializing the Datacenter","authors":"Ikuo Magaki, M. Khazraee, L. V. Gutierrez, M. Taylor","doi":"10.1145/3007787.3001156","DOIUrl":"https://doi.org/10.1145/3007787.3001156","url":null,"abstract":"GPU and FPGA-based clouds have already demonstrated the promise of accelerating computing-intensive workloads with greatly improved power and performance. In this paper, we examine the design of ASIC Clouds, which are purpose-built datacenters comprised of large arrays of ASIC accelerators, whose purpose is to optimize the total cost of ownership (TCO) of large, high-volume chronic computations, which are becoming increasingly common as more and more services are built around the Cloud model. On the surface, the creation of ASIC clouds may seem highlyimprobable due to high NREs and the inflexibility of ASICs. Surprisingly, however, large-scale ASIC Clouds have already been deployed by a large number of commercial entities, to implement the distributed Bitcoin cryptocurrency system. We begin with a case study of Bitcoin mining ASIC Clouds, which are perhaps the largest ASIC Clouds to date. From there, we design three more ASIC Clouds, including a YouTube-style video transcoding ASIC Cloud, a Litecoin ASIC Cloud, and a Convolutional Neural Network ASIC Cloud and show 2-3 orders of magnitude better TCO versus CPU and GPU. Among our contributions, we present a methodology that given an accelerator design, derives Pareto-optimal ASIC Cloud Servers, by extracting data from place-and-routed circuits and computational fluid dynamic simulations, and then employing clever but brute-force search to find the best jointly-optimized ASIC, DRAM subsystem, motherboard, power delivery system, cooling system, operating voltage, and case design. Moreover, we show how data center parameters determine which of the many Pareto-optimal points is TCO-optimal. Finally we examine when it makes sense to build an ASIC Cloud, and examine the impact of ASIC NRE.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"33 1","pages":"178-190"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81475493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 92

ARM Virtualization: Performance and Architectural Implications ARM虚拟化:性能和架构含义

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001169

Chris Dall, Shih-wei Li, J. Lim, Jason Nieh, G. Koloventzos

ARM servers are becoming increasingly common, making server technologies such as virtualization for ARM of growing importance. We present the first study of ARM virtualization performance on server hardware, including multi-core measurements of two popular ARM and x86 hypervisors, KVM and Xen. We show how ARM hardware support for virtualization can enable much faster transitions between VMs and the hypervisor, a key hypervisor operation. However, current hypervisor designs, including both Type 1 hypervisors such as Xen and Type 2 hypervisors such as KVM, are not able to leverage this performance benefit for real application workloads. We discuss the reasons why and show that other factors related to hypervisor software design and implementation have a larger role in overall performance. Based on our measurements, we discuss changes to ARM's hardware virtualization support that can potentially bridge the gap to bring its faster VM-to-hypervisor transition mechanism to modern Type 2 hypervisors running real applications. These changes have been incorporated into the latest ARM architecture.

ARM服务器正变得越来越普遍，这使得服务器技术(如面向ARM的虚拟化)变得越来越重要。我们首次对服务器硬件上的ARM虚拟化性能进行了研究，包括对两种流行的ARM和x86管理程序KVM和Xen的多核测量。我们展示了ARM硬件对虚拟化的支持如何能够更快地实现虚拟机和虚拟机管理程序(一个关键的虚拟机管理程序操作)之间的转换。但是，当前的管理程序设计，包括类型1管理程序(如Xen)和类型2管理程序(如KVM)，都不能将这种性能优势用于实际的应用程序工作负载。我们将讨论其中的原因，并说明与管理程序软件设计和实现相关的其他因素在总体性能中发挥更大的作用。根据我们的测量，我们讨论了ARM硬件虚拟化支持的变化，这些变化可能会弥合差距，将更快的vm到hypervisor的转换机制引入运行实际应用程序的现代Type 2 hypervisor。这些变化已被纳入最新的ARM架构。

{"title":"ARM Virtualization: Performance and Architectural Implications","authors":"Chris Dall, Shih-wei Li, J. Lim, Jason Nieh, G. Koloventzos","doi":"10.1145/3007787.3001169","DOIUrl":"https://doi.org/10.1145/3007787.3001169","url":null,"abstract":"ARM servers are becoming increasingly common, making server technologies such as virtualization for ARM of growing importance. We present the first study of ARM virtualization performance on server hardware, including multi-core measurements of two popular ARM and x86 hypervisors, KVM and Xen. We show how ARM hardware support for virtualization can enable much faster transitions between VMs and the hypervisor, a key hypervisor operation. However, current hypervisor designs, including both Type 1 hypervisors such as Xen and Type 2 hypervisors such as KVM, are not able to leverage this performance benefit for real application workloads. We discuss the reasons why and show that other factors related to hypervisor software design and implementation have a larger role in overall performance. Based on our measurements, we discuss changes to ARM's hardware virtualization support that can potentially bridge the gap to bring its faster VM-to-hypervisor transition mechanism to modern Type 2 hypervisors running real applications. These changes have been incorporated into the latest ARM architecture.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"1 1","pages":"304-316"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88402884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

Short-Circuit Dispatch: Accelerating Virtual Machine Interpreters on Embedded Processors 短路调度:在嵌入式处理器上加速虚拟机解释器

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001168

Channoh Kim, Sungmin Kim, Hyeonwoong Cho, Doo-Young Kim, Jaehyeok Kim, Young H. Oh, Hakbeom Jang, Jae W. Lee

Interpreters are widely used to implement high-level language virtual machines (VMs), especially on resource-constrained embedded platforms. Many scripting languages employ interpreter-based VMs for their advantages over native code compilers, such as portability, smaller resource footprint, and compact codes. For efficient interpretation a script (program) is first compiled into an intermediate representation, or bytecodes. The canonical interpreter then runs an infinite loop that fetches, decodes, and executes one bytecode at a time. This bytecode dispatch loop is a well-known source of inefficiency, typically featuring a large jump table with a hard-to-predict indirect jump. Most existing techniques to optimize this loop focus on reducing the misprediction rate of this indirect jump in both hardware and software. However, these techniques are much less effective on embedded processors with shallow pipelines and low IPCs. Instead, we tackle another source of inefficiency more prominent on embedded platforms - redundant computation in the dispatch loop. To this end, we propose Short-Circuit Dispatch (SCD), a low cost architectural extension that enables fast, hardware-based bytecode dispatch with fewer instructions. The key idea of SCD is to overlay the software-created bytecode jump table on a branch target buffer (BTB). Once a bytecode is fetched, the BTB is looked up using the bytecode, instead of PC, as key. If it hits, the interpreter directly jumps to the target address retrieved from the BTB, otherwise, it goes through the original dispatch path. This effectively eliminates redundant computation in the dispatcher code for decode, bound check, and target address calculation, thus significantly reducing total instruction count. Our simulation results demonstrate that SCD achieves geomean speedups of 19.9% and 14.1% for two production-grade script interpreters for Lua and JavaScript, respectively. Moreover, our fully synthesizable RTL design based on a RISC-V embedded processor shows that SCD improves the EDP of the Lua interpreter by 24.2%, while increasing the chip area by only 0.72% at a 40nm technology node.

解释器被广泛用于实现高级语言虚拟机(vm)，特别是在资源受限的嵌入式平台上。许多脚本语言使用基于解释器的vm，因为它们比本机代码编译器更有优势，比如可移植性、更小的资源占用和紧凑的代码。为了有效地解释，脚本(程序)首先被编译成中间表示形式或字节码。然后，规范解释器运行一个无限循环，一次获取、解码和执行一个字节码。这种字节码调度循环是效率低下的一个众所周知的原因，它通常具有难以预测的间接跳转的大型跳转表。大多数现有的优化这种循环的技术都集中在硬件和软件上减少这种间接跳跃的错误预测率。然而，这些技术在具有浅管道和低ipc的嵌入式处理器上的效果要差得多。相反，我们解决了另一个在嵌入式平台上更为突出的低效率来源——调度循环中的冗余计算。为此，我们提出了短路调度(SCD)，这是一种低成本的架构扩展，可以用更少的指令实现快速的、基于硬件的字节码调度。SCD的关键思想是将软件创建的字节码跳转表覆盖在分支目标缓冲区(BTB)上。一旦获取字节码，就会使用字节码而不是PC作为密钥查找BTB。如果命中，解释器直接跳转到从BTB检索到的目标地址，否则，它将通过原始分派路径。这有效地消除了调度程序代码中解码、绑定检查和目标地址计算的冗余计算，从而大大减少了总指令计数。我们的模拟结果表明，SCD对于两个用于Lua和JavaScript的生产级脚本解释器分别实现了19.9%和14.1%的几何加速。此外，我们基于RISC-V嵌入式处理器的完全可合成RTL设计表明，在40nm技术节点上，SCD使Lua解释器的EDP提高了24.2%，而芯片面积仅增加了0.72%。

{"title":"Short-Circuit Dispatch: Accelerating Virtual Machine Interpreters on Embedded Processors","authors":"Channoh Kim, Sungmin Kim, Hyeonwoong Cho, Doo-Young Kim, Jaehyeok Kim, Young H. Oh, Hakbeom Jang, Jae W. Lee","doi":"10.1145/3007787.3001168","DOIUrl":"https://doi.org/10.1145/3007787.3001168","url":null,"abstract":"Interpreters are widely used to implement high-level language virtual machines (VMs), especially on resource-constrained embedded platforms. Many scripting languages employ interpreter-based VMs for their advantages over native code compilers, such as portability, smaller resource footprint, and compact codes. For efficient interpretation a script (program) is first compiled into an intermediate representation, or bytecodes. The canonical interpreter then runs an infinite loop that fetches, decodes, and executes one bytecode at a time. This bytecode dispatch loop is a well-known source of inefficiency, typically featuring a large jump table with a hard-to-predict indirect jump. Most existing techniques to optimize this loop focus on reducing the misprediction rate of this indirect jump in both hardware and software. However, these techniques are much less effective on embedded processors with shallow pipelines and low IPCs. Instead, we tackle another source of inefficiency more prominent on embedded platforms - redundant computation in the dispatch loop. To this end, we propose Short-Circuit Dispatch (SCD), a low cost architectural extension that enables fast, hardware-based bytecode dispatch with fewer instructions. The key idea of SCD is to overlay the software-created bytecode jump table on a branch target buffer (BTB). Once a bytecode is fetched, the BTB is looked up using the bytecode, instead of PC, as key. If it hits, the interpreter directly jumps to the target address retrieved from the BTB, otherwise, it goes through the original dispatch path. This effectively eliminates redundant computation in the dispatcher code for decode, bound check, and target address calculation, thus significantly reducing total instruction count. Our simulation results demonstrate that SCD achieves geomean speedups of 19.9% and 14.1% for two production-grade script interpreters for Lua and JavaScript, respectively. Moreover, our fully synthesizable RTL design based on a RISC-V embedded processor shows that SCD improves the EDP of the Lua interpreter by 24.2%, while increasing the chip area by only 0.72% at a 40nm technology node.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"28 1","pages":"291-303"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77114916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Strober: Fast and Accurate Sample-Based Energy Simulation for Arbitrary RTL Strober:快速准确的基于样本的任意RTL能量模拟

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001151

Donggyu Kim, Adam M. Izraelevitz, Christopher Celio, Hokeun Kim, B. Zimmer, Yunsup Lee, J. Bachrach, K. Asanović

This paper presents a sample-based energy simulation methodology that enables fast and accurate estimations of performance and average power for arbitrary RTL designs. Our approach uses an FPGA to simultaneously simulate the performance of an RTL design and to collect samples containing exact RTL state snapshots. Each snapshot is then replayed in gate-level simulation, resulting in a workload-specific average power estimate with confidence intervals. For arbitrary RTL and workloads, our methodology guarantees a minimum of four-orders-of-magnitude speedup over commercial CAD gate-level simulation tools and gives average energy estimates guaranteed to be within 5% of the true average energy with 99% confidence. We believe our open-source sample-based energy simulation tool Strober can not only rapidly provide ground truth for more abstract power models, but can enable productive design-space exploration early in the RTL design process.

本文提出了一种基于样本的能量仿真方法，可以快速准确地估计任意RTL设计的性能和平均功率。我们的方法使用FPGA来同时模拟RTL设计的性能，并收集包含精确RTL状态快照的样本。然后在门级模拟中重播每个快照，从而产生具有置信区间的特定于工作负载的平均功率估计。对于任意RTL和工作负载，我们的方法保证了比商业CAD门级仿真工具至少4个数量级的加速，并给出了平均能量估计，保证在真实平均能量的5%以内，置信度为99%。我们相信，我们的基于样本的开源能量模拟工具Strober不仅可以快速为更抽象的功率模型提供基础事实，而且可以在RTL设计过程的早期实现高效的设计空间探索。

引用次数: 27