2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献

英文中文

Understanding the Future of Energy Efficiency in Multi-Module GPUs 了解未来多模块gpu的能源效率

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-03-26 DOI: 10.1109/HPCA.2019.00063

A. Arunkumar, Evgeny Bolotin, D. Nellans, Carole-Jean Wu

As Moore’s law slows down, GPUs must pivot towards multi-module designs to continue scaling performance at historical rates. Prior work on multi-module GPUs has focused on performance, while largely ignoring the issue of energy efficiency. In this work, we propose a new metric for GPU efficiency called EDP Scaling Efficiency that quantifies the effects of both strong performance scaling and overall energy efficiency in these designs. To enable this analysis, we develop a novel top-down GPU energy estimation framework that is accurate within 10% of a recent GPU design. Being decoupled from granular GPU microarchitectural details, the framework is appropriate for energy efficiency studies in future GPUs. Using this model in conjunction with performance simulation, we show that the dominating factor influencing the energy efficiency of GPUs over the next decade is GPUmodule (GPM) idle time. Furthermore, neither inter-module interconnect energy, nor GPM microarchitectural design is expected to play a key role in this regard. We demonstrate that multi-module GPUs are on a trajectory to become 2⇥ less energy efficient than current monolithic designs; a significant issue for data centers which are already energy constrained. Finally, we show that architects must be willing to spend more (not less) energy to enable higher bandwidth inter-GPM connections, because counter-intuitively, this additional energy expenditure can reduce total GPU energy consumption by as much as 45%, providing a path to energy efficient strong scaling in the future.

随着摩尔定律的放缓，gpu必须转向多模块设计，以继续以历史速度扩展性能。先前多模块gpu的工作主要集中在性能上，而在很大程度上忽略了能源效率的问题。在这项工作中，我们提出了一个新的GPU效率指标，称为EDP缩放效率，量化了这些设计中强大的性能缩放和整体能源效率的影响。为了实现这种分析，我们开发了一种新颖的自上而下的GPU能量估计框架，其精度在最新GPU设计的10%以内。该框架与粒度粒度的GPU微架构细节解耦，适用于未来GPU的能效研究。将此模型与性能仿真相结合，我们表明在未来十年影响gpu能效的主要因素是gpu模块(GPM)空闲时间。此外，无论是模块间互连能源，还是GPM微架构设计，预计都不会在这方面发挥关键作用。我们证明，多模块gpu的能效将比目前的单片设计低2 × 2;对于已经受到能源限制的数据中心来说，这是一个重大问题。最后，我们表明架构师必须愿意花费更多(而不是更少)的能量来实现更高带宽的gpm之间的连接，因为与直觉相反，这种额外的能量消耗可以将GPU的总能耗降低多达45%，为未来的节能强扩展提供了一条途径。

{"title":"Understanding the Future of Energy Efficiency in Multi-Module GPUs","authors":"A. Arunkumar, Evgeny Bolotin, D. Nellans, Carole-Jean Wu","doi":"10.1109/HPCA.2019.00063","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00063","url":null,"abstract":"As Moore’s law slows down, GPUs must pivot towards multi-module designs to continue scaling performance at historical rates. Prior work on multi-module GPUs has focused on performance, while largely ignoring the issue of energy efficiency. In this work, we propose a new metric for GPU efficiency called EDP Scaling Efficiency that quantifies the effects of both strong performance scaling and overall energy efficiency in these designs. To enable this analysis, we develop a novel top-down GPU energy estimation framework that is accurate within 10% of a recent GPU design. Being decoupled from granular GPU microarchitectural details, the framework is appropriate for energy efficiency studies in future GPUs. Using this model in conjunction with performance simulation, we show that the dominating factor influencing the energy efficiency of GPUs over the next decade is GPUmodule (GPM) idle time. Furthermore, neither inter-module interconnect energy, nor GPM microarchitectural design is expected to play a key role in this regard. We demonstrate that multi-module GPUs are on a trajectory to become 2⇥ less energy efficient than current monolithic designs; a significant issue for data centers which are already energy constrained. Finally, we show that architects must be willing to spend more (not less) energy to enable higher bandwidth inter-GPM connections, because counter-intuitively, this additional energy expenditure can reduce total GPU energy consumption by as much as 45%, providing a path to energy efficient strong scaling in the future.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128564650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

The Accelerator Wall: Limits of Chip Specialization 加速器墙:芯片专业化的极限

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-03-26 DOI: 10.1109/HPCA.2019.00023

Adi Fuchs, D. Wentzlaff

Specializing chips using hardware accelerators has become the prime means to alleviate the gap between the growing computational demands and the stagnating transistor budgets caused by the slowdown of CMOS scaling. Much of the benefits of chip specialization stems from optimizing a computational problem within a given chip’s transistor budget. Unfortunately, the stagnation of the number of transistors available on a chip will limit the accelerator design optimization space, leading to diminishing specialization returns, ultimately hitting an accelerator wall. In this work, we tackle the question of what are the limits of future accelerators and chip specialization? We do this by characterizing how current accelerators depend on CMOS scaling, based on a physical modeling tool that we constructed using datasheets of thousands of chips. We identify key concepts used in chip specialization, and explore case studies to understand how specialization has progressed over time in different applications and chip platforms (e.g., GPUs, FPGAs, ASICs)1. Utilizing these insights, we build a model which projects forward to see what future gains can and cannot be enabled from chip specialization. A quantitative analysis of specialization returns and technological boundaries is critical to help researchers understand the limits of accelerators and develop methods to surmount them. Keywords-Accelerator Wall; Moore’s Law; CMOS Scaling

使用硬件加速器的专用芯片已成为缓解不断增长的计算需求与CMOS缩放速度放缓导致的晶体管预算停滞之间差距的主要手段。芯片专门化的大部分好处来自于在给定芯片的晶体管预算内优化计算问题。不幸的是，芯片上可用晶体管数量的停滞将限制加速器设计优化的空间，导致专业化回报的减少，最终撞上加速器的墙。在这项工作中，我们解决了未来加速器和芯片专业化的限制是什么?我们通过描述当前加速器如何依赖CMOS缩放来实现这一点，这是基于我们使用数千个芯片的数据表构建的物理建模工具。我们确定了芯片专门化中使用的关键概念，并探索案例研究，以了解专门化在不同应用和芯片平台(例如，gpu, fpga, asic)中是如何随着时间的推移而发展的。利用这些见解，我们建立了一个模型，该模型预测了芯片专业化可以实现和不能实现的未来收益。对专业化回报和技术边界的定量分析对于帮助研究人员理解加速器的局限性并找到克服它们的方法至关重要。Keywords-Accelerator墙;摩尔定律;CMOS扩展

{"title":"The Accelerator Wall: Limits of Chip Specialization","authors":"Adi Fuchs, D. Wentzlaff","doi":"10.1109/HPCA.2019.00023","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00023","url":null,"abstract":"Specializing chips using hardware accelerators has become the prime means to alleviate the gap between the growing computational demands and the stagnating transistor budgets caused by the slowdown of CMOS scaling. Much of the benefits of chip specialization stems from optimizing a computational problem within a given chip’s transistor budget. Unfortunately, the stagnation of the number of transistors available on a chip will limit the accelerator design optimization space, leading to diminishing specialization returns, ultimately hitting an accelerator wall. In this work, we tackle the question of what are the limits of future accelerators and chip specialization? We do this by characterizing how current accelerators depend on CMOS scaling, based on a physical modeling tool that we constructed using datasheets of thousands of chips. We identify key concepts used in chip specialization, and explore case studies to understand how specialization has progressed over time in different applications and chip platforms (e.g., GPUs, FPGAs, ASICs)1. Utilizing these insights, we build a model which projects forward to see what future gains can and cannot be enabled from chip specialization. A quantitative analysis of specialization returns and technological boundaries is critical to help researchers understand the limits of accelerators and develop methods to surmount them. Keywords-Accelerator Wall; Moore’s Law; CMOS Scaling","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134624413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

POWERT Channels: A Novel Class of Covert CommunicationExploiting Power Management Vulnerabilities POWERT通道:一类利用电源管理漏洞的新型隐蔽通信

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-03-26 DOI: 10.1109/HPCA.2019.00045

S. K. Khatamifard, Longfei Wang, A. Das, Selçuk Köse, Ulya R. Karpuzcu

To be able to meet demanding application performance requirements within a tight power budget, runtime power management must track hardware activity at a very fine granularity in both space and time. This gives rise to sophisticated power management algorithms, which need the underlying system to be both highly observable (to be able to sense changes in instantaneous power demand timely) and controllable (to be able to react to changes in instantaneous power demand timely). The end goal is allocating the power budget, which itself represents a very critical shared resource, in a fair way among active tasks of execution. Fundamentally, if not carefully managed, any system-wide shared resource can give rise to covert communication. Power budget does not represent an exception, particularly as systems are becoming more and more observable and controllable. In this paper, we demonstrate how power management vulnerabilities can enable covert communication over a previously unexplored, novel class of covert channels which we will refer to as POWERT channels. We also provide a comprehensive characterization of the POWERT channel capacity under various sharing and activity scenarios. Our analysis based on experiments on representative commercial systems reveal a peak channel capacity of 121.6 bits per second (bps). Keywords-covert channels; power management; power headroom modulation.

为了能够在有限的电源预算内满足苛刻的应用程序性能需求，运行时电源管理必须在空间和时间上以非常精细的粒度跟踪硬件活动。这就产生了复杂的电源管理算法，这需要底层系统具有高度可观察性(能够及时感知瞬时功率需求的变化)和可控性(能够及时对瞬时功率需求的变化做出反应)。最终目标是以公平的方式在活动的执行任务之间分配电力预算，它本身代表了非常关键的共享资源。从根本上说，如果不仔细管理，任何系统范围的共享资源都可能导致秘密通信。电力预算并不是一个例外，特别是当系统变得越来越可观察和可控时。在本文中，我们演示了电源管理漏洞如何在以前未开发的新型隐蔽通道上实现隐蔽通信，我们将其称为POWERT通道。我们还提供了各种共享和活动场景下POWERT信道容量的全面表征。我们基于代表性商业系统的实验分析显示，峰值信道容量为每秒121.6比特(bps)。Keywords-covert渠道;电源管理;功率净空调制。

{"title":"POWERT Channels: A Novel Class of Covert CommunicationExploiting Power Management Vulnerabilities","authors":"S. K. Khatamifard, Longfei Wang, A. Das, Selçuk Köse, Ulya R. Karpuzcu","doi":"10.1109/HPCA.2019.00045","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00045","url":null,"abstract":"To be able to meet demanding application performance requirements within a tight power budget, runtime power management must track hardware activity at a very fine granularity in both space and time. This gives rise to sophisticated power management algorithms, which need the underlying system to be both highly observable (to be able to sense changes in instantaneous power demand timely) and controllable (to be able to react to changes in instantaneous power demand timely). The end goal is allocating the power budget, which itself represents a very critical shared resource, in a fair way among active tasks of execution. Fundamentally, if not carefully managed, any system-wide shared resource can give rise to covert communication. Power budget does not represent an exception, particularly as systems are becoming more and more observable and controllable. In this paper, we demonstrate how power management vulnerabilities can enable covert communication over a previously unexplored, novel class of covert channels which we will refer to as POWERT channels. We also provide a comprehensive characterization of the POWERT channel capacity under various sharing and activity scenarios. Our analysis based on experiments on representative commercial systems reveal a peak channel capacity of 121.6 bits per second (bps). Keywords-covert channels; power management; power headroom modulation.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131377553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

Machine Learning at Facebook: Understanding Inference at the Edge Facebook的机器学习:理解边缘推理

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-03-26 DOI: 10.1109/HPCA.2019.00048

Carole-Jean Wu, D. Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, K. Hazelwood, Eldad Isaac, Yangqing Jia, Bill Jia, Tommer Leyvand, Hao Lu, Yang Lu, Lin Qiao, Brandon Reagen, Joe Spisak, Fei Sun, Andrew Tulloch, Péter Vajda, Xiaodong Wang, Yanghan Wang, Bram Wasti, Yiming Wu, Ran Xian, S. Yoo, Peizhao Zhang

At Facebook, machine learning provides a wide range of capabilities that drive many aspects of user experience including ranking posts, content understanding, object detection and tracking for augmented and virtual reality, speech and text translations. While machine learning models are currently trained on customized datacenter infrastructure, Facebook is working to bring machine learning inference to the edge. By doing so, user experience is improved with reduced latency (inference time) and becomes less dependent on network connectivity. Furthermore, this also enables many more applications of deep learning with important features only made available at the edge. This paper takes a datadriven approach to present the opportunities and design challenges faced by Facebook in order to enable machine learning inference locally on smartphones and other edge platforms.

在Facebook，机器学习提供了广泛的功能，推动了用户体验的许多方面，包括帖子排名、内容理解、增强现实和虚拟现实的对象检测和跟踪、语音和文本翻译。虽然机器学习模型目前是在定制的数据中心基础设施上训练的，但Facebook正在努力将机器学习推理带到边缘。通过这样做，用户体验得到改善，减少了延迟(推理时间)，并且减少了对网络连接的依赖。此外，这也使深度学习的更多应用程序具有仅在边缘可用的重要功能。本文采用数据驱动的方法来呈现Facebook面临的机遇和设计挑战，以便在智能手机和其他边缘平台上本地实现机器学习推理。

{"title":"Machine Learning at Facebook: Understanding Inference at the Edge","authors":"Carole-Jean Wu, D. Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, K. Hazelwood, Eldad Isaac, Yangqing Jia, Bill Jia, Tommer Leyvand, Hao Lu, Yang Lu, Lin Qiao, Brandon Reagen, Joe Spisak, Fei Sun, Andrew Tulloch, Péter Vajda, Xiaodong Wang, Yanghan Wang, Bram Wasti, Yiming Wu, Ran Xian, S. Yoo, Peizhao Zhang","doi":"10.1109/HPCA.2019.00048","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00048","url":null,"abstract":"At Facebook, machine learning provides a wide range of capabilities that drive many aspects of user experience including ranking posts, content understanding, object detection and tracking for augmented and virtual reality, speech and text translations. While machine learning models are currently trained on customized datacenter infrastructure, Facebook is working to bring machine learning inference to the edge. By doing so, user experience is improved with reduced latency (inference time) and becomes less dependent on network connectivity. Furthermore, this also enables many more applications of deep learning with important features only made available at the edge. This paper takes a datadriven approach to present the opportunities and design challenges faced by Facebook in order to enable machine learning inference locally on smartphones and other edge platforms.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124106493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 357

Publisher's Information 出版商的信息

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/hpca.2019.00070

引用次数: 0

FPGA Accelerated INDEL Realignment in the Cloud FPGA加速云中的INDEL对齐

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00044

Lisa Wu, David Bruns-Smith, Frank A. Nothaft, Qijing Huang, S. Karandikar, Johnny Le, Andrew Lin, Howard Mao, B. Sweeney, K. Asanović, D. Patterson, A. Joseph

The amount of data being generated in genomics is predicted to be between 2 and 40 exabytes per year for the next decade, making genomic analysis the new frontier and the new challenge for precision medicine. This paper explores targeted deployment of hardware accelerators in the cloud to improve the runtime and throughput of immensescale genomic data analyses. In particular, INDEL (INsertion/DELetion) realignment is a critical operation that enables diagnostic testings of cancer through error correction prior to variant calling. It is the slowest part of the somatic (cancer) genomic analysis pipeline, the alignment refinement pipeline, and represents roughly one-third of the execution time of timesensitive diagnostics for acute cancer patients. To accelerate genomic analysis, this paper describes a hardware accelerator for INDEL realignment (IR), and a hardware-software framework leveraging FPGAs-as-a-service in the cloud. We chose to implement genomics analytics on FPGAs because genomic algorithms are still rapidly evolving (e.g. the de facto standard “GATK Best Practices” has had five releases since January of this year). We chose to deploy genomics accelerators in the cloud to reduce capital expenditure and to provide a more quantitative performance and cost analysis. We built and deployed a sea of IR accelerators using our hardware-software accelerator development framework on AWS EC2 F1 instances. We show that our IR accelerator system performed 81× better than multi-threaded genomic analysis software while being 32× more cost efficient. Keywords-Computer Architecture, Microarchitecture, Accelerator Architecture, Hardware Specialization, Genomic Analytics, INDEL Realignment, FPGA Acceleration, FPGAs-as-aservice, Cloud FPGAs

据预测，在未来十年，基因组学每年产生的数据量将在2到40艾字节之间，这使得基因组分析成为精准医疗的新前沿和新挑战。本文探讨了在云中有针对性地部署硬件加速器，以改善大规模基因组数据分析的运行时间和吞吐量。特别是，INDEL(插入/删除)重组是一项关键操作，可以在变体调用之前通过错误纠正来进行癌症诊断测试。它是体细胞(癌症)基因组分析管道(校准优化管道)中最慢的部分，大约占急性癌症患者时间敏感诊断执行时间的三分之一。为了加速基因组分析，本文描述了一个用于INDEL重组(IR)的硬件加速器，以及一个利用云端fpga即服务的硬件软件框架。我们选择在fpga上实现基因组分析，因为基因组算法仍在快速发展(例如，事实上的标准“GATK最佳实践”自今年1月以来已经发布了五个版本)。我们选择在云中部署基因组加速器，以减少资本支出，并提供更定量的性能和成本分析。我们在AWS EC2 F1实例上使用我们的硬件软件加速器开发框架构建并部署了大量IR加速器。我们的IR加速系统比多线程基因组分析软件性能好81倍，成本效率高32倍。关键词:计算机体系结构，微体系结构，加速器体系结构，硬件专业化，基因组分析，INDEL重新排列，FPGA加速，FPGA即服务，云FPGA

{"title":"FPGA Accelerated INDEL Realignment in the Cloud","authors":"Lisa Wu, David Bruns-Smith, Frank A. Nothaft, Qijing Huang, S. Karandikar, Johnny Le, Andrew Lin, Howard Mao, B. Sweeney, K. Asanović, D. Patterson, A. Joseph","doi":"10.1109/HPCA.2019.00044","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00044","url":null,"abstract":"The amount of data being generated in genomics is predicted to be between 2 and 40 exabytes per year for the next decade, making genomic analysis the new frontier and the new challenge for precision medicine. This paper explores targeted deployment of hardware accelerators in the cloud to improve the runtime and throughput of immensescale genomic data analyses. In particular, INDEL (INsertion/DELetion) realignment is a critical operation that enables diagnostic testings of cancer through error correction prior to variant calling. It is the slowest part of the somatic (cancer) genomic analysis pipeline, the alignment refinement pipeline, and represents roughly one-third of the execution time of timesensitive diagnostics for acute cancer patients. To accelerate genomic analysis, this paper describes a hardware accelerator for INDEL realignment (IR), and a hardware-software framework leveraging FPGAs-as-a-service in the cloud. We chose to implement genomics analytics on FPGAs because genomic algorithms are still rapidly evolving (e.g. the de facto standard “GATK Best Practices” has had five releases since January of this year). We chose to deploy genomics accelerators in the cloud to reduce capital expenditure and to provide a more quantitative performance and cost analysis. We built and deployed a sea of IR accelerators using our hardware-software accelerator development framework on AWS EC2 F1 instances. We show that our IR accelerator system performed 81× better than multi-threaded genomic analysis software while being 32× more cost efficient. Keywords-Computer Architecture, Microarchitecture, Accelerator Architecture, Hardware Specialization, Genomic Analytics, INDEL Realignment, FPGA Acceleration, FPGAs-as-aservice, Cloud FPGAs","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133813927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Efficient Load Value Prediction Using Multiple Predictors and Filters 使用多预测器和过滤器的高效负荷值预测

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00057

Rami Sheikh, Derek Hower

引用次数: 11

Composite-ISA Cores: Enabling Multi-ISA Heterogeneity Using a Single ISA 复合ISA核心:使用单个ISA实现多ISA异构

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00026

A. Venkat, H. Basavaraj, D. Tullsen

—Heterogeneous multicore architectures are com- prised of multiple cores of different sizes, organizations, and capabilities. These architectures maximize both performance and energy efﬁciency by allowing applications to adapt to phase changes by migrating execution to the most efﬁcient core. Heterogeneous-ISA architectures further take advantage of the inherent ISA preferences of different application phases to provide additional performance and efﬁciency gains. This work proposes composite-ISA cores that implement composite feature sets made available from a single large superset ISA. This architecture has the potential to recreate, and in many cases supersede, the gains of multi-ISA heterogeneity, by leveraging a single composite-ISA, exploiting greater ﬂexibility in ISA choice. Composite-ISA CMPs enhance existing performance gains due to hardware heterogeneity by an average of 19%, and have the potential to achieve an additional 31% energy savings and 35% reduction in Energy Delay Product, with no loss in performance.

异构多核架构由不同大小、组织和功能的多个核心组成。这些架构允许应用程序通过将执行迁移到最高效的核心来适应阶段变化，从而最大限度地提高了性能和能源效率。异构ISA体系结构进一步利用了不同应用程序阶段的固有ISA首选项，以提供额外的性能和效率增益。这项工作提出了复合ISA核心，实现了从单个大型超集ISA中获得的复合功能集。该体系结构有可能通过利用单个组合ISA，在ISA选择中利用更大的灵活性，重新创建(在许多情况下取代)多ISA异构性的好处。由于硬件的异构性，复合isa cmp将现有的性能提升了平均19%，并且有可能实现额外31%的能源节约和35%的能源延迟产品减少，而不会造成性能损失。

引用次数: 19

Stretch: Balancing QoS and Throughput for Colocated Server Workloads on SMT Cores Stretch:在SMT内核上平衡服务器工作负载的QoS和吞吐量

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00024

Artemiy Margaritov, Siddharth Gupta, Rekai González-Alberquilla, Boris Grot

—In a drive to maximize resource utilization, today’s datacenters are moving to colocation of latency-sensitive and batch workloads on the same server. State-of-the-art deployments, such as those at Google, colocate such diverse workloads even on a single SMT core. This form of aggressive colocation is afforded by virtue of the fact that a latency-sensitive service operating below its peak load has signiﬁcant slack in its response latency with respect to the QoS target. The slack affords a degradation in single-thread performance, which is inevitable under SMT colocation, without compromising QoS targets.This work makes the observation that many batch applications can greatly beneﬁt from a large instruction window to uncover ILP and MLP. Under SMT colocation, conventional wisdom holds that individual hardware threads should be limited in their ability to acquire and hold a disproportionately large share of microarchitectural resources so as not to compromise the performance of a co-running thread. We show that the performance slack inherent in latency-sensitive workloads operating at low to moderate load makes it safe to shift microarchitectural resources to a co-running batch thread without compromising QoS targets. Based on this insight, we introduce Stretch, a simple ROB partitioning scheme that is invoked by system software to provide one hardware thread with a much larger ROB partition at the expense of another thread. When Stretch is enabled for latency-sensitive workloads operating below their peak load on an SMT core, co-running batch applications gain 13% of performance on average (30% max) over a baseline SMT colocation and without compromising QoS constraints.

为了最大限度地利用资源，今天的数据中心正在转向将对延迟敏感的批处理工作负载托管在同一台服务器上。最先进的部署(例如Google的部署)甚至可以在单个SMT核心上配置如此多样化的工作负载。这种形式的主动托管是由于以下事实提供的:运行在其峰值负载以下的延迟敏感服务的响应延迟相对于QoS目标具有显著的松弛。这种松弛会导致单线程性能下降，这在SMT托管下是不可避免的，但不会影响QoS目标。这项工作观察到，许多批处理应用程序可以从一个大的指令窗口中获益，以发现ILP和MLP。在SMT托管下，传统观点认为单个硬件线程获取和持有不成比例的大量微架构资源的能力应该受到限制，以免影响协同运行线程的性能。我们表明，在低到中等负载下运行的延迟敏感工作负载固有的性能松弛使得将微架构资源转移到共同运行的批处理线程而不影响QoS目标是安全的。基于这一见解，我们介绍了Stretch，这是一种简单的ROB分区方案，系统软件可以调用它，以牺牲另一个线程为代价，为一个硬件线程提供更大的ROB分区。当在SMT核心上为低于其峰值负载的延迟敏感工作负载启用Stretch时，共同运行的批处理应用程序比基线SMT托管平均获得13%的性能(最高30%)，并且不会影响QoS约束。

{"title":"Stretch: Balancing QoS and Throughput for Colocated Server Workloads on SMT Cores","authors":"Artemiy Margaritov, Siddharth Gupta, Rekai González-Alberquilla, Boris Grot","doi":"10.1109/HPCA.2019.00024","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00024","url":null,"abstract":"—In a drive to maximize resource utilization, today’s datacenters are moving to colocation of latency-sensitive and batch workloads on the same server. State-of-the-art deployments, such as those at Google, colocate such diverse workloads even on a single SMT core. This form of aggressive colocation is afforded by virtue of the fact that a latency-sensitive service operating below its peak load has signiﬁcant slack in its response latency with respect to the QoS target. The slack affords a degradation in single-thread performance, which is inevitable under SMT colocation, without compromising QoS targets.This work makes the observation that many batch applications can greatly beneﬁt from a large instruction window to uncover ILP and MLP. Under SMT colocation, conventional wisdom holds that individual hardware threads should be limited in their ability to acquire and hold a disproportionately large share of microarchitectural resources so as not to compromise the performance of a co-running thread. We show that the performance slack inherent in latency-sensitive workloads operating at low to moderate load makes it safe to shift microarchitectural resources to a co-running batch thread without compromising QoS targets. Based on this insight, we introduce Stretch, a simple ROB partitioning scheme that is invoked by system software to provide one hardware thread with a much larger ROB partition at the expense of another thread. When Stretch is enabled for latency-sensitive workloads operating below their peak load on an SMT core, co-running batch applications gain 13% of performance on average (30% max) over a baseline SMT colocation and without compromising QoS constraints.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124023401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Resilient Low Voltage Accelerators for High Energy Efficiency 用于高能效的弹性低压加速器

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00034

Nandhini Chandramoorthy, Karthik Swaminathan, M. Cochet, A. Paidimarri, Schuyler Eldridge, R. Joshi, M. Ziegler, A. Buyuktosunoglu, P. Bose

引用次数: 41

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀