首页 > 最新文献

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)最新文献

英文 中文
Peak Efficiency Aware Scheduling for Highly Energy Proportional Servers 高能量比例服务器的峰值效率感知调度
Daniel Wong
Energy proportionality of data center severs have improved drastically over the past decade to the point where near ideal energy proportional servers are now common. These highly energy proportional servers exhibit the unique property where peak efficiency no longer coincides with peak utilization. In this paper, we explore the implications of this property on data center scheduling. We identified that current state of the art data center schedulers does not efficiently leverage these properties, leading to inefficient scheduling decisions. We propose Peak Efficiency Aware Scheduling (PEAS) which can achieve better-than-ideal energy proportionality at the data center level. We demonstrate that PEAS can reduce average power by 25.5% with 3.0% improvement to TCO compared to state-of-the-art scheduling policies.
在过去的十年中,数据中心服务器的能量比例已经得到了极大的改善,现在接近理想能量比例的服务器已经很常见了。这些高能量比例的服务器表现出独特的特性,即峰值效率不再与峰值利用率一致。在本文中,我们探讨了这一特性对数据中心调度的影响。我们发现,目前最先进的数据中心调度器不能有效地利用这些属性,从而导致低效的调度决策。我们提出了峰值效率感知调度(Peak Efficiency Aware Scheduling, PEAS),它可以在数据中心级别实现优于理想的能量比例。我们证明,与最先进的调度策略相比,豌豆可以将平均功耗降低25.5%,TCO提高3.0%。
{"title":"Peak Efficiency Aware Scheduling for Highly Energy Proportional Servers","authors":"Daniel Wong","doi":"10.1145/3007787.3001188","DOIUrl":"https://doi.org/10.1145/3007787.3001188","url":null,"abstract":"Energy proportionality of data center severs have improved drastically over the past decade to the point where near ideal energy proportional servers are now common. These highly energy proportional servers exhibit the unique property where peak efficiency no longer coincides with peak utilization. In this paper, we explore the implications of this property on data center scheduling. We identified that current state of the art data center schedulers does not efficiently leverage these properties, leading to inefficient scheduling decisions. We propose Peak Efficiency Aware Scheduling (PEAS) which can achieve better-than-ideal energy proportionality at the data center level. We demonstrate that PEAS can reduce average power by 25.5% with 3.0% improvement to TCO compared to state-of-the-art scheduling policies.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"45 1","pages":"481-492"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86318190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Asymmetry-Aware Work-Stealing Runtimes 不对称感知窃取工作的运行时
Christopher Torng, Moyang Wang, C. Batten
Amdahl's law provides architects a compelling reason to introduce system asymmetry to optimize for both serial and parallel regions of execution. Asymmetry in a multicore processor can arise statically (e.g., from core microarchitecture) or dynamically (e.g., applying dynamic voltage/frequency scaling). Work stealing is an increasingly popular approach to task distribution that elegantly balances task-based parallelism across multiple worker threads. In this paper, we propose asymmetry-aware work-stealing (AAWS) runtimes, which are carefully designed to exploit both the static and dynamic asymmetry in modern systems. AAWS runtimes use three key hardware/software techniques: work-pacing, work-sprinting, and work-mugging. Work-pacing and work-sprinting are novel techniques that combine a marginal-utility-based approach with integrated voltage regulators to improve performance and energy efficiency in high-and low-parallel regions. Work-mugging is a previously proposed technique that enables a waiting big core to preemptively migrate work from a busy little core. We propose a simple implementation of work-mugging based on lightweight user-level interrupts. We use a vertically integrated research methodology spanning software, architecture, and VLSI to make the case that holistically combining static asymmetry, dynamic asymmetry, and work-stealing runtimes can improve both performance and energy efficiency in future multicore systems.
Amdahl定律为架构师提供了一个令人信服的理由来引入系统不对称,以优化串行和并行执行区域。多核处理器中的不对称可以静态地(例如,从核心微架构)或动态地(例如,应用动态电压/频率缩放)产生。工作窃取是一种日益流行的任务分发方法,它可以在多个工作线程之间优雅地平衡基于任务的并行性。在本文中,我们提出了不对称感知工作窃取(aws)运行时,它被精心设计以利用现代系统中的静态和动态不对称。aws运行时使用三种关键的硬件/软件技术:工作节奏、工作冲刺和工作抢劫。工作节奏和工作冲刺是一种新颖的技术,它将基于边际效用的方法与集成电压调节器相结合,以提高高、低并联区域的性能和能源效率。工作抢劫是先前提出的一种技术,它使等待的大核心能够从繁忙的小核心中抢先迁移工作。我们提出了一个基于轻量级用户级中断的工作抢劫的简单实现。我们使用跨越软件、架构和VLSI的垂直集成研究方法,以证明整体结合静态不对称、动态不对称和窃取工作的运行时可以提高未来多核系统的性能和能源效率。
{"title":"Asymmetry-Aware Work-Stealing Runtimes","authors":"Christopher Torng, Moyang Wang, C. Batten","doi":"10.1145/3007787.3001142","DOIUrl":"https://doi.org/10.1145/3007787.3001142","url":null,"abstract":"Amdahl's law provides architects a compelling reason to introduce system asymmetry to optimize for both serial and parallel regions of execution. Asymmetry in a multicore processor can arise statically (e.g., from core microarchitecture) or dynamically (e.g., applying dynamic voltage/frequency scaling). Work stealing is an increasingly popular approach to task distribution that elegantly balances task-based parallelism across multiple worker threads. In this paper, we propose asymmetry-aware work-stealing (AAWS) runtimes, which are carefully designed to exploit both the static and dynamic asymmetry in modern systems. AAWS runtimes use three key hardware/software techniques: work-pacing, work-sprinting, and work-mugging. Work-pacing and work-sprinting are novel techniques that combine a marginal-utility-based approach with integrated voltage regulators to improve performance and energy efficiency in high-and low-parallel regions. Work-mugging is a previously proposed technique that enables a waiting big core to preemptively migrate work from a busy little core. We propose a simple implementation of work-mugging based on lightweight user-level interrupts. We use a vertically integrated research methodology spanning software, architecture, and VLSI to make the case that holistically combining static asymmetry, dynamic asymmetry, and work-stealing runtimes can improve both performance and energy efficiency in future multicore systems.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"75 1","pages":"40-52"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85972415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Cambricon: An Instruction Set Architecture for Neural Networks 寒武纪:神经网络的指令集架构
Shaoli Liu, Zidong Du, Jinhua Tao, D. Han, Tao Luo, Yuan Xie, Yunji Chen, Tianshi Chen
Neural Networks (NN) are a family of models for a broad range of emerging machine learning and pattern recondition applications. NN techniques are conventionally executed on general-purpose processors (such as CPU and GPGPU), which are usually not energy-efficient since they invest excessive hardware resources to flexibly support various workloads. Consequently, application-specific hardware accelerators for neural networks have been proposed recently to improve the energy-efficiency. However, such accelerators were designed for a small set of NN techniques sharing similar computational patterns, and they adopt complex and informative instructions (control signals) directly corresponding to high-level functional blocks of an NN (such as layers), or even an NN as a whole. Although straightforward and easy-to-implement for a limited set of similar NN techniques, the lack of agility in the instruction set prevents such accelerator designs from supporting a variety of different NN techniques with sufficient flexibility and efficiency. In this paper, we propose a novel domain-specific Instruction Set Architecture (ISA) for NN accelerators, called Cambricon, which is a load-store architecture that integrates scalar, vector, matrix, logical, data transfer, and control instructions, based on a comprehensive analysis of existing NN techniques. Our evaluation over a total of ten representative yet distinct NN techniques have demonstrated that Cambricon exhibits strong descriptive capacity over a broad range of NN techniques, and provides higher code density than general-purpose ISAs such as ×86, MIPS, and GPGPU. Compared to the latest state-of-the-art NN accelerator design DaDianNao [5] (which can only accommodate 3 types of NN techniques), our Cambricon-based accelerator prototype implemented in TSMC 65nm technology incurs only negligible latency/power/area overheads, with a versatile coverage of 10 different NN benchmarks.
神经网络(NN)是一组用于广泛新兴机器学习和模式修复应用的模型。神经网络技术通常在通用处理器(如CPU和GPGPU)上执行,通常不节能,因为它们投入过多的硬件资源来灵活地支持各种工作负载。因此,最近提出了用于神经网络的特定应用硬件加速器来提高能量效率。然而,这种加速器是为一小部分共享相似计算模式的神经网络技术而设计的,它们采用复杂且信息丰富的指令(控制信号),直接对应于神经网络的高级功能块(如层),甚至是整个神经网络。尽管对于一组有限的类似神经网络技术来说,这种加速器设计简单且易于实现,但指令集缺乏灵活性,阻碍了这种加速器设计以足够的灵活性和效率支持各种不同的神经网络技术。在本文中,我们在对现有神经网络技术进行综合分析的基础上,提出了一种新的用于神经网络加速器的特定领域指令集架构(ISA),称为Cambricon,它是一种集成了标量、向量、矩阵、逻辑、数据传输和控制指令的负载存储架构。我们对十种具有代表性但不同的神经网络技术的评估表明,寒武纪在广泛的神经网络技术中表现出强大的描述能力,并且比通用isa(如×86, MIPS和GPGPU)提供更高的代码密度。与最新的最先进的神经网络加速器设计DaDianNao[5](只能容纳3种类型的神经网络技术)相比,我们采用台积电65nm技术实现的基于寒武纪的加速器原型只会产生微不足道的延迟/功耗/面积开销,并具有10种不同的神经网络基准。
{"title":"Cambricon: An Instruction Set Architecture for Neural Networks","authors":"Shaoli Liu, Zidong Du, Jinhua Tao, D. Han, Tao Luo, Yuan Xie, Yunji Chen, Tianshi Chen","doi":"10.1145/3007787.3001179","DOIUrl":"https://doi.org/10.1145/3007787.3001179","url":null,"abstract":"Neural Networks (NN) are a family of models for a broad range of emerging machine learning and pattern recondition applications. NN techniques are conventionally executed on general-purpose processors (such as CPU and GPGPU), which are usually not energy-efficient since they invest excessive hardware resources to flexibly support various workloads. Consequently, application-specific hardware accelerators for neural networks have been proposed recently to improve the energy-efficiency. However, such accelerators were designed for a small set of NN techniques sharing similar computational patterns, and they adopt complex and informative instructions (control signals) directly corresponding to high-level functional blocks of an NN (such as layers), or even an NN as a whole. Although straightforward and easy-to-implement for a limited set of similar NN techniques, the lack of agility in the instruction set prevents such accelerator designs from supporting a variety of different NN techniques with sufficient flexibility and efficiency. In this paper, we propose a novel domain-specific Instruction Set Architecture (ISA) for NN accelerators, called Cambricon, which is a load-store architecture that integrates scalar, vector, matrix, logical, data transfer, and control instructions, based on a comprehensive analysis of existing NN techniques. Our evaluation over a total of ten representative yet distinct NN techniques have demonstrated that Cambricon exhibits strong descriptive capacity over a broad range of NN techniques, and provides higher code density than general-purpose ISAs such as ×86, MIPS, and GPGPU. Compared to the latest state-of-the-art NN accelerator design DaDianNao [5] (which can only accommodate 3 types of NN techniques), our Cambricon-based accelerator prototype implemented in TSMC 65nm technology incurs only negligible latency/power/area overheads, with a versatile coverage of 10 different NN benchmarks.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"7 1","pages":"393-405"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82686876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 271
Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory Neurocube:一个具有高密度三维存储器的可编程数字神经形态架构
Duckhwan Kim, J. Kung, S. Chai, S. Yalamanchili, S. Mukhopadhyay
This paper presents a programmable and scalable digital neuromorphic architecture based on 3D high-density memory integrated with logic tier for efficient neural computing. The proposed architecture consists of clusters of processing engines, connected by 2D mesh network as a processing tier, which is integrated in 3D with multiple tiers of DRAM. The PE clusters access multiple memory channels (vaults) in parallel. The operating principle, referred to as the memory centric computing, embeds specialized state-machines within the vault controllers of HMC to drive data into the PE clusters. The paper presents the basic architecture of the Neurocube and an analysis of the logic tier synthesized in 28nm and 15nm process technologies. The performance of the Neurocube is evaluated and illustrated through the mapping of a Convolutional Neural Network and estimating the subsequent power and performance for both training and inference.
本文提出了一种基于三维高密度存储器集成逻辑层的可编程、可扩展的数字神经形态体系结构,用于高效的神经计算。所提出的架构由处理引擎集群组成,通过二维网格网络作为处理层连接,该处理层与多层DRAM集成在3D中。PE集群并行访问多个内存通道(vault)。这种操作原理称为以内存为中心的计算,它在HMC的vault控制器中嵌入专门的状态机,将数据驱动到PE集群中。本文介绍了Neurocube的基本架构,并分析了28nm和15nm工艺合成的逻辑层。通过卷积神经网络的映射以及对训练和推理的后续功率和性能的估计,对Neurocube的性能进行了评估和说明。
{"title":"Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory","authors":"Duckhwan Kim, J. Kung, S. Chai, S. Yalamanchili, S. Mukhopadhyay","doi":"10.1145/3007787.3001178","DOIUrl":"https://doi.org/10.1145/3007787.3001178","url":null,"abstract":"This paper presents a programmable and scalable digital neuromorphic architecture based on 3D high-density memory integrated with logic tier for efficient neural computing. The proposed architecture consists of clusters of processing engines, connected by 2D mesh network as a processing tier, which is integrated in 3D with multiple tiers of DRAM. The PE clusters access multiple memory channels (vaults) in parallel. The operating principle, referred to as the memory centric computing, embeds specialized state-machines within the vault controllers of HMC to drive data into the PE clusters. The paper presents the basic architecture of the Neurocube and an analysis of the logic tier synthesized in 28nm and 15nm process technologies. The performance of the Neurocube is evaluated and illustrated through the mapping of a Convolutional Neural Network and estimating the subsequent power and performance for both training and inference.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"3 1","pages":"380-392"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83321017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 355
Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference 跑步机:通过精确负载测试和统计推断确定尾部延迟的来源
Yunqi Zhang, David Meisner, Jason Mars, Lingjia Tang
Managing tail latency of requests has become one of the primary challenges for large-scale Internet services. Data centers are quickly evolving and service operators frequently desire to make changes to the deployed software and production hardware configurations. Such changes demand a confident understanding of the impact on one's service, in particular its effect on tail latency (e.g., 95th-or 99th-percentile response latency of the service). Evaluating the impact on the tail is challenging because of its inherent variability. Existing tools and methodologies for measuring these effects suffer from a number of deficiencies including poor load tester design, statistically inaccurate aggregation, and improper attribution of effects. As shown in the paper, these pitfalls can often result in misleading conclusions. In this paper, we develop a methodology for statistically rigorous performance evaluation and performance factor attribution for server workloads. First, we find that careful design of the server load tester can ensure high quality performance evaluation, and empirically demonstrate the inaccuracy of load testers in previous work. Learning from the design flaws in prior work, we design and develop a modular load tester platform, Treadmill, that overcomes pitfalls of existing tools. Next, utilizing Treadmill, we construct measurement and analysis procedures that can properly attribute performance factors. We rely on statistically-sound performance evaluation and quantile regression, extending it to accommodate the idiosyncrasies of server systems. Finally, we use our augmented methodology to evaluate the impact of common server hardware features with Facebook production workloads on production hardware. We decompose the effects of these features on request tail latency and demonstrate that our evaluation methodology provides superior results, particularly in capturing complicated and counter-intuitive performance behaviors. By tuning the hardware features as suggested by the attribution, we reduce the 99th-percentile latency by 43% and its variance by 93%.
管理请求的尾部延迟已经成为大规模Internet服务的主要挑战之一。数据中心正在快速发展,服务运营商经常希望对部署的软件和生产硬件配置进行更改。这种变化要求对服务的影响有自信的理解,特别是对尾部延迟的影响(例如,服务的95或99百分位响应延迟)。由于其固有的可变性,评估对尾部的影响是具有挑战性的。用于测量这些影响的现有工具和方法存在许多缺陷,包括糟糕的负载测试仪设计、统计不准确的聚合和不适当的影响归因。正如本文所示,这些陷阱往往会导致误导性的结论。在本文中,我们开发了一种方法,用于统计严格的性能评估和服务器工作负载的性能因素归因。首先,我们发现精心设计的服务器负载测试仪可以保证高质量的性能评估,并实证证明了以往工作中负载测试仪的不准确性。从之前工作的设计缺陷中学习,我们设计并开发了一个模块化负载测试平台,跑步机,克服了现有工具的缺陷。其次,利用跑步机,我们构建测量和分析程序,可以适当地归因于性能因素。我们依赖于统计可靠的性能评估和分位数回归,并对其进行扩展以适应服务器系统的特性。最后,我们使用我们的增强方法来评估带有Facebook生产工作负载的通用服务器硬件特性对生产硬件的影响。我们分解了这些特征对请求尾部延迟的影响,并证明我们的评估方法提供了优越的结果,特别是在捕获复杂和反直觉的性能行为方面。通过根据属性调整硬件特性,我们将第99百分位延迟减少了43%,其方差减少了93%。
{"title":"Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference","authors":"Yunqi Zhang, David Meisner, Jason Mars, Lingjia Tang","doi":"10.1145/3007787.3001186","DOIUrl":"https://doi.org/10.1145/3007787.3001186","url":null,"abstract":"Managing tail latency of requests has become one of the primary challenges for large-scale Internet services. Data centers are quickly evolving and service operators frequently desire to make changes to the deployed software and production hardware configurations. Such changes demand a confident understanding of the impact on one's service, in particular its effect on tail latency (e.g., 95th-or 99th-percentile response latency of the service). Evaluating the impact on the tail is challenging because of its inherent variability. Existing tools and methodologies for measuring these effects suffer from a number of deficiencies including poor load tester design, statistically inaccurate aggregation, and improper attribution of effects. As shown in the paper, these pitfalls can often result in misleading conclusions. In this paper, we develop a methodology for statistically rigorous performance evaluation and performance factor attribution for server workloads. First, we find that careful design of the server load tester can ensure high quality performance evaluation, and empirically demonstrate the inaccuracy of load testers in previous work. Learning from the design flaws in prior work, we design and develop a modular load tester platform, Treadmill, that overcomes pitfalls of existing tools. Next, utilizing Treadmill, we construct measurement and analysis procedures that can properly attribute performance factors. We rely on statistically-sound performance evaluation and quantile regression, extending it to accommodate the idiosyncrasies of server systems. Finally, we use our augmented methodology to evaluate the impact of common server hardware features with Facebook production workloads on production hardware. We decompose the effects of these features on request tail latency and demonstrate that our evaluation methodology provides superior results, particularly in capturing complicated and counter-intuitive performance behaviors. By tuning the hardware features as suggested by the attribution, we reduce the 99th-percentile latency by 43% and its variance by 93%.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"369 1","pages":"456-468"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76423413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 86
All-Inclusive ECC: Thorough End-to-End Protection for Reliable Computer Memory 全包ECC:彻底的端到端保护可靠的计算机内存
Jungrae Kim, Michael B. Sullivan, Sangkug Lym, M. Erez
Increasing transfer rates and decreasing I/O voltage levels make signals more vulnerable to transmission errors. While the data in computer memory are well-protected by modern error checking and correcting (ECC) codes, the clock, control, command, and address (CCCA) signals are weakly protected or even unprotected such that transmission errors leave serious gaps in data-only protection. This paper presents All-Inclusive ECC (AIECC), a memory protection scheme that leverages and augments data ECC to also thoroughly protect CCCA signals. AIECC provides strong end-to-end protection of memory, detecting nearly 100% of CCCA errors and also preventing transmission errors from causing latent memory data corruption. AIECC provides these system-level benefits without requiring extra storage and transfer overheads and without degrading the effective level of data protection.
增加传输速率和降低I/O电压水平使信号更容易受到传输错误的影响。计算机内存中的数据受到现代纠错码(ECC)的良好保护,而时钟、控制、命令和地址(CCCA)信号的保护很弱,甚至不受保护,以至于传输错误在数据保护方面留下了严重的空白。本文提出了全包ECC (AIECC),这是一种利用和增强数据ECC来彻底保护CCCA信号的内存保护方案。AIECC提供了强大的端到端内存保护,几乎100%检测到CCCA错误,并防止传输错误导致潜在的内存数据损坏。AIECC提供了这些系统级的好处,而不需要额外的存储和传输开销,也不会降低数据保护的有效级别。
{"title":"All-Inclusive ECC: Thorough End-to-End Protection for Reliable Computer Memory","authors":"Jungrae Kim, Michael B. Sullivan, Sangkug Lym, M. Erez","doi":"10.1145/3007787.3001203","DOIUrl":"https://doi.org/10.1145/3007787.3001203","url":null,"abstract":"Increasing transfer rates and decreasing I/O voltage levels make signals more vulnerable to transmission errors. While the data in computer memory are well-protected by modern error checking and correcting (ECC) codes, the clock, control, command, and address (CCCA) signals are weakly protected or even unprotected such that transmission errors leave serious gaps in data-only protection. This paper presents All-Inclusive ECC (AIECC), a memory protection scheme that leverages and augments data ECC to also thoroughly protect CCCA signals. AIECC provides strong end-to-end protection of memory, detecting nearly 100% of CCCA errors and also preventing transmission errors from causing latent memory data corruption. AIECC provides these system-level benefits without requiring extra storage and transfer overheads and without degrading the effective level of data protection.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"59 1","pages":"622-633"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78018648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
ASIC Clouds: Specializing the Datacenter ASIC云:专门化数据中心
Ikuo Magaki, M. Khazraee, L. V. Gutierrez, M. Taylor
GPU and FPGA-based clouds have already demonstrated the promise of accelerating computing-intensive workloads with greatly improved power and performance. In this paper, we examine the design of ASIC Clouds, which are purpose-built datacenters comprised of large arrays of ASIC accelerators, whose purpose is to optimize the total cost of ownership (TCO) of large, high-volume chronic computations, which are becoming increasingly common as more and more services are built around the Cloud model. On the surface, the creation of ASIC clouds may seem highlyimprobable due to high NREs and the inflexibility of ASICs. Surprisingly, however, large-scale ASIC Clouds have already been deployed by a large number of commercial entities, to implement the distributed Bitcoin cryptocurrency system. We begin with a case study of Bitcoin mining ASIC Clouds, which are perhaps the largest ASIC Clouds to date. From there, we design three more ASIC Clouds, including a YouTube-style video transcoding ASIC Cloud, a Litecoin ASIC Cloud, and a Convolutional Neural Network ASIC Cloud and show 2-3 orders of magnitude better TCO versus CPU and GPU. Among our contributions, we present a methodology that given an accelerator design, derives Pareto-optimal ASIC Cloud Servers, by extracting data from place-and-routed circuits and computational fluid dynamic simulations, and then employing clever but brute-force search to find the best jointly-optimized ASIC, DRAM subsystem, motherboard, power delivery system, cooling system, operating voltage, and case design. Moreover, we show how data center parameters determine which of the many Pareto-optimal points is TCO-optimal. Finally we examine when it makes sense to build an ASIC Cloud, and examine the impact of ASIC NRE.
基于GPU和fpga的云计算已经展示了加速计算密集型工作负载的前景,大大提高了功率和性能。在本文中,我们研究了ASIC云的设计,ASIC云是由大型ASIC加速器阵列组成的专用数据中心,其目的是优化大型,高容量慢性计算的总拥有成本(TCO),随着越来越多的服务围绕云模型构建,这种计算变得越来越普遍。从表面上看,由于高NREs和ASIC的不灵活性,ASIC云的创建似乎极不可能。然而,令人惊讶的是,大规模的ASIC云已经被大量的商业实体部署,以实现分布式比特币加密货币系统。我们从比特币挖矿ASIC云的案例研究开始,这可能是迄今为止最大的ASIC云。从那里,我们设计了另外三个ASIC云,包括youtube风格的视频转码ASIC云,莱特币ASIC云和卷积神经网络ASIC云,并显示出比CPU和GPU更好的2-3个数量级的TCO。在我们的贡献中,我们提出了一种方法,该方法给出了一个加速器设计,通过从放置和路由电路和计算流体动力学模拟中提取数据,得出帕累托最优的ASIC云服务器,然后采用聪明但暴力的搜索来找到最佳的联合优化ASIC, DRAM子系统,主板,供电系统,冷却系统,工作电压和机箱设计。此外,我们还展示了数据中心参数如何决定众多帕累托最优点中哪一个是tco最优的。最后,我们研究了何时构建ASIC云是有意义的,并研究了ASIC NRE的影响。
{"title":"ASIC Clouds: Specializing the Datacenter","authors":"Ikuo Magaki, M. Khazraee, L. V. Gutierrez, M. Taylor","doi":"10.1145/3007787.3001156","DOIUrl":"https://doi.org/10.1145/3007787.3001156","url":null,"abstract":"GPU and FPGA-based clouds have already demonstrated the promise of accelerating computing-intensive workloads with greatly improved power and performance. In this paper, we examine the design of ASIC Clouds, which are purpose-built datacenters comprised of large arrays of ASIC accelerators, whose purpose is to optimize the total cost of ownership (TCO) of large, high-volume chronic computations, which are becoming increasingly common as more and more services are built around the Cloud model. On the surface, the creation of ASIC clouds may seem highlyimprobable due to high NREs and the inflexibility of ASICs. Surprisingly, however, large-scale ASIC Clouds have already been deployed by a large number of commercial entities, to implement the distributed Bitcoin cryptocurrency system. We begin with a case study of Bitcoin mining ASIC Clouds, which are perhaps the largest ASIC Clouds to date. From there, we design three more ASIC Clouds, including a YouTube-style video transcoding ASIC Cloud, a Litecoin ASIC Cloud, and a Convolutional Neural Network ASIC Cloud and show 2-3 orders of magnitude better TCO versus CPU and GPU. Among our contributions, we present a methodology that given an accelerator design, derives Pareto-optimal ASIC Cloud Servers, by extracting data from place-and-routed circuits and computational fluid dynamic simulations, and then employing clever but brute-force search to find the best jointly-optimized ASIC, DRAM subsystem, motherboard, power delivery system, cooling system, operating voltage, and case design. Moreover, we show how data center parameters determine which of the many Pareto-optimal points is TCO-optimal. Finally we examine when it makes sense to build an ASIC Cloud, and examine the impact of ASIC NRE.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"33 1","pages":"178-190"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81475493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 92
ARM Virtualization: Performance and Architectural Implications ARM虚拟化:性能和架构含义
Chris Dall, Shih-wei Li, J. Lim, Jason Nieh, G. Koloventzos
ARM servers are becoming increasingly common, making server technologies such as virtualization for ARM of growing importance. We present the first study of ARM virtualization performance on server hardware, including multi-core measurements of two popular ARM and x86 hypervisors, KVM and Xen. We show how ARM hardware support for virtualization can enable much faster transitions between VMs and the hypervisor, a key hypervisor operation. However, current hypervisor designs, including both Type 1 hypervisors such as Xen and Type 2 hypervisors such as KVM, are not able to leverage this performance benefit for real application workloads. We discuss the reasons why and show that other factors related to hypervisor software design and implementation have a larger role in overall performance. Based on our measurements, we discuss changes to ARM's hardware virtualization support that can potentially bridge the gap to bring its faster VM-to-hypervisor transition mechanism to modern Type 2 hypervisors running real applications. These changes have been incorporated into the latest ARM architecture.
ARM服务器正变得越来越普遍,这使得服务器技术(如面向ARM的虚拟化)变得越来越重要。我们首次对服务器硬件上的ARM虚拟化性能进行了研究,包括对两种流行的ARM和x86管理程序KVM和Xen的多核测量。我们展示了ARM硬件对虚拟化的支持如何能够更快地实现虚拟机和虚拟机管理程序(一个关键的虚拟机管理程序操作)之间的转换。但是,当前的管理程序设计,包括类型1管理程序(如Xen)和类型2管理程序(如KVM),都不能将这种性能优势用于实际的应用程序工作负载。我们将讨论其中的原因,并说明与管理程序软件设计和实现相关的其他因素在总体性能中发挥更大的作用。根据我们的测量,我们讨论了ARM硬件虚拟化支持的变化,这些变化可能会弥合差距,将更快的vm到hypervisor的转换机制引入运行实际应用程序的现代Type 2 hypervisor。这些变化已被纳入最新的ARM架构。
{"title":"ARM Virtualization: Performance and Architectural Implications","authors":"Chris Dall, Shih-wei Li, J. Lim, Jason Nieh, G. Koloventzos","doi":"10.1145/3007787.3001169","DOIUrl":"https://doi.org/10.1145/3007787.3001169","url":null,"abstract":"ARM servers are becoming increasingly common, making server technologies such as virtualization for ARM of growing importance. We present the first study of ARM virtualization performance on server hardware, including multi-core measurements of two popular ARM and x86 hypervisors, KVM and Xen. We show how ARM hardware support for virtualization can enable much faster transitions between VMs and the hypervisor, a key hypervisor operation. However, current hypervisor designs, including both Type 1 hypervisors such as Xen and Type 2 hypervisors such as KVM, are not able to leverage this performance benefit for real application workloads. We discuss the reasons why and show that other factors related to hypervisor software design and implementation have a larger role in overall performance. Based on our measurements, we discuss changes to ARM's hardware virtualization support that can potentially bridge the gap to bring its faster VM-to-hypervisor transition mechanism to modern Type 2 hypervisors running real applications. These changes have been incorporated into the latest ARM architecture.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"1 1","pages":"304-316"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88402884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
Strober: Fast and Accurate Sample-Based Energy Simulation for Arbitrary RTL Strober:快速准确的基于样本的任意RTL能量模拟
Donggyu Kim, Adam M. Izraelevitz, Christopher Celio, Hokeun Kim, B. Zimmer, Yunsup Lee, J. Bachrach, K. Asanović
This paper presents a sample-based energy simulation methodology that enables fast and accurate estimations of performance and average power for arbitrary RTL designs. Our approach uses an FPGA to simultaneously simulate the performance of an RTL design and to collect samples containing exact RTL state snapshots. Each snapshot is then replayed in gate-level simulation, resulting in a workload-specific average power estimate with confidence intervals. For arbitrary RTL and workloads, our methodology guarantees a minimum of four-orders-of-magnitude speedup over commercial CAD gate-level simulation tools and gives average energy estimates guaranteed to be within 5% of the true average energy with 99% confidence. We believe our open-source sample-based energy simulation tool Strober can not only rapidly provide ground truth for more abstract power models, but can enable productive design-space exploration early in the RTL design process.
本文提出了一种基于样本的能量仿真方法,可以快速准确地估计任意RTL设计的性能和平均功率。我们的方法使用FPGA来同时模拟RTL设计的性能,并收集包含精确RTL状态快照的样本。然后在门级模拟中重播每个快照,从而产生具有置信区间的特定于工作负载的平均功率估计。对于任意RTL和工作负载,我们的方法保证了比商业CAD门级仿真工具至少4个数量级的加速,并给出了平均能量估计,保证在真实平均能量的5%以内,置信度为99%。我们相信,我们的基于样本的开源能量模拟工具Strober不仅可以快速为更抽象的功率模型提供基础事实,而且可以在RTL设计过程的早期实现高效的设计空间探索。
{"title":"Strober: Fast and Accurate Sample-Based Energy Simulation for Arbitrary RTL","authors":"Donggyu Kim, Adam M. Izraelevitz, Christopher Celio, Hokeun Kim, B. Zimmer, Yunsup Lee, J. Bachrach, K. Asanović","doi":"10.1145/3007787.3001151","DOIUrl":"https://doi.org/10.1145/3007787.3001151","url":null,"abstract":"This paper presents a sample-based energy simulation methodology that enables fast and accurate estimations of performance and average power for arbitrary RTL designs. Our approach uses an FPGA to simultaneously simulate the performance of an RTL design and to collect samples containing exact RTL state snapshots. Each snapshot is then replayed in gate-level simulation, resulting in a workload-specific average power estimate with confidence intervals. For arbitrary RTL and workloads, our methodology guarantees a minimum of four-orders-of-magnitude speedup over commercial CAD gate-level simulation tools and gives average energy estimates guaranteed to be within 5% of the true average energy with 99% confidence. We believe our open-source sample-based energy simulation tool Strober can not only rapidly provide ground truth for more abstract power models, but can enable productive design-space exploration early in the RTL design process.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"36 1","pages":"128-139"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81207549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Exploiting Dynamic Timing Slack for Energy Efficiency in Ultra-Low-Power Embedded Systems 利用动态时序松弛提高超低功耗嵌入式系统的能效
Hari Cherupalli, Rakesh Kumar, J. Sartori
Many emerging applications such as the internet of things, wearables, and sensor networks have ultra-low-power requirements. At the same time, cost and programmability considerations dictate that many of these applications will be powered by general purpose embedded microprocessors and microcontrollers, not ASICs. In this paper, we exploit a new opportunity for improving energy efficiency in ultralow-power processors expected to drive these applications -- dynamic timing slack. Dynamic timing slack exists when an embedded software application executed on a processor does not exercise the processor's static critical paths. In such scenarios, the longest path exercised by the application has additional timing slack which can be exploited for power savings at no performance cost by scaling down the processor's voltage at the same frequency until the longest exercised paths just meet timing constraints. Paths that cannot be exercised by an application can safely be allowed to violate timing constraints. We show that dynamic timing slack exists for many ultra-low-power applications and that exploiting dynamic timing slack can result in significant power savings for any ultra-low-power processors. We also present an automated methodology for identifying dynamic timing slack and selecting a safe operating point for a processor and a particular embedded software. Our approach for identifying and exploiting dynamic timing slack is non-speculative, requires no programmer intervention and little or no hardware support, and demonstrates potential power savings of up to 32%, 25% on average, over a range of embedded applications running on a common ultra-low-power processor, at no performance cost.
许多新兴应用,如物联网、可穿戴设备和传感器网络,都有超低功耗要求。同时,考虑到成本和可编程性,这些应用程序中的许多将由通用嵌入式微处理器和微控制器提供支持,而不是asic。在本文中,我们利用了一个新的机会来提高超低功耗处理器的能源效率,有望推动这些应用——动态时序松弛。当在处理器上执行的嵌入式软件应用程序不执行处理器的静态关键路径时,就存在动态时序松弛。在这种情况下,应用程序运行的最长路径具有额外的时序松弛,可以通过在相同频率下降低处理器电压,直到最长路径刚好满足时序约束,从而在不增加性能成本的情况下节省电力。应用程序不能执行的路径可以被允许违反时间约束。我们表明,动态定时松弛存在于许多超低功耗应用程序中,并且利用动态定时松弛可以为任何超低功耗处理器带来显着的功耗节省。我们还提出了一种用于识别动态时序松弛和为处理器和特定嵌入式软件选择安全工作点的自动化方法。我们识别和利用动态时间空闲的方法是非推测性的,不需要程序员干预,很少或根本不需要硬件支持,并且在不降低性能成本的情况下,在一个普通超低功耗处理器上运行的一系列嵌入式应用程序中,可以节省高达32%,平均25%的电力。
{"title":"Exploiting Dynamic Timing Slack for Energy Efficiency in Ultra-Low-Power Embedded Systems","authors":"Hari Cherupalli, Rakesh Kumar, J. Sartori","doi":"10.1145/3007787.3001208","DOIUrl":"https://doi.org/10.1145/3007787.3001208","url":null,"abstract":"Many emerging applications such as the internet of things, wearables, and sensor networks have ultra-low-power requirements. At the same time, cost and programmability considerations dictate that many of these applications will be powered by general purpose embedded microprocessors and microcontrollers, not ASICs. In this paper, we exploit a new opportunity for improving energy efficiency in ultralow-power processors expected to drive these applications -- dynamic timing slack. Dynamic timing slack exists when an embedded software application executed on a processor does not exercise the processor's static critical paths. In such scenarios, the longest path exercised by the application has additional timing slack which can be exploited for power savings at no performance cost by scaling down the processor's voltage at the same frequency until the longest exercised paths just meet timing constraints. Paths that cannot be exercised by an application can safely be allowed to violate timing constraints. We show that dynamic timing slack exists for many ultra-low-power applications and that exploiting dynamic timing slack can result in significant power savings for any ultra-low-power processors. We also present an automated methodology for identifying dynamic timing slack and selecting a safe operating point for a processor and a particular embedded software. Our approach for identifying and exploiting dynamic timing slack is non-speculative, requires no programmer intervention and little or no hardware support, and demonstrates potential power savings of up to 32%, 25% on average, over a range of embedded applications running on a common ultra-low-power processor, at no performance cost.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"26 3 1","pages":"671-681"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83600514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
期刊
2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1