2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献

英文中文

FUSE: Fusing STT-MRAM into GPUs to Alleviate Off-Chip Memory Access Overheads FUSE:将STT-MRAM融合到gpu中，以减轻片外内存访问开销

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00055

Jie Zhang, Myoungsoo Jung, M. Kandemir

In this work, we propose FUSE, a novel GPU cache system that integrates spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip L1D cache. FUSE can minimize the number of outgoing memory accesses over the interconnection network of GPU's multiprocessors, which in turn can considerably improve the level of massive computing parallelism in GPUs. Specifically, FUSE predicts a read-level of GPU memory accesses by extracting GPU runtime information and places write-once-read-multiple (WORM) data blocks into the STT-MRAM, while accommodating write-multiple data blocks over a small portion of SRAM in the L1D cache. To further reduce the off-chip memory accesses, FUSE also allows WORM data blocks to be allocated anywhere in the STT-MRAM by approximating the associativity with the limited number of tag comparators and I/O peripherals. Our evaluation results show that, in comparison to a traditional GPU cache, our proposed heterogeneous cache reduces the number of outgoing memory references by 32% across the interconnection network, thereby improving the overall performance by 217% and reducing energy cost by 53%.

在这项工作中，我们提出了FUSE，一种将自旋传递扭矩磁随机存取存储器(STT-MRAM)集成到片上L1D缓存中的新型GPU缓存系统。FUSE可以最大限度地减少GPU多处理器互连网络上的出站内存访问次数，从而大大提高GPU的大规模计算并行性水平。具体来说，FUSE通过提取GPU运行时信息来预测GPU内存访问的读级，并将write-once-read-multiple (WORM)数据块放入STT-MRAM中，同时在L1D缓存中的一小部分SRAM上容纳write-multiple数据块。为了进一步减少片外内存访问，FUSE还允许在STT-MRAM中的任何位置分配WORM数据块，方法是通过与有限数量的标签比较器和I/O外设近似的关联性。我们的评估结果表明，与传统的GPU缓存相比，我们提出的异构缓存在互连网络中减少了32%的传出内存引用数量，从而将整体性能提高了217%，并降低了53%的能源成本。

{"title":"FUSE: Fusing STT-MRAM into GPUs to Alleviate Off-Chip Memory Access Overheads","authors":"Jie Zhang, Myoungsoo Jung, M. Kandemir","doi":"10.1109/HPCA.2019.00055","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00055","url":null,"abstract":"In this work, we propose FUSE, a novel GPU cache system that integrates spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip L1D cache. FUSE can minimize the number of outgoing memory accesses over the interconnection network of GPU's multiprocessors, which in turn can considerably improve the level of massive computing parallelism in GPUs. Specifically, FUSE predicts a read-level of GPU memory accesses by extracting GPU runtime information and places write-once-read-multiple (WORM) data blocks into the STT-MRAM, while accommodating write-multiple data blocks over a small portion of SRAM in the L1D cache. To further reduce the off-chip memory accesses, FUSE also allows WORM data blocks to be allocated anywhere in the STT-MRAM by approximating the associativity with the limited number of tag comparators and I/O peripherals. Our evaluation results show that, in comparison to a traditional GPU cache, our proposed heterogeneous cache reduces the number of outgoing memory references by 32% across the interconnection network, thereby improving the overall performance by 217% and reducing energy cost by 53%.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123541233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Enabling Transparent Memory-Compression for Commodity Memory Systems 为商用内存系统启用透明内存压缩

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00010

Vinson Young, S. Kariyappa, Moinuddin K. Qureshi

引用次数: 19

Freeway: Maximizing MLP for Slice-Out-of-Order Execution 高速公路:最大化MLP的切片乱序执行

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00009

Rakesh Kumar, M. Alipour, D. Black-Schaffer

Exploiting memory level parallelism (MLP) is crucial to hide long memory and last level cache access latencies. While out-of-order (OoO) cores, and techniques building on them, are effective at exp ...

利用内存级并行性(MLP)对于隐藏长内存和最后一级缓存访问延迟至关重要。虽然无序核(OoO)和基于它们的技术在exp方面是有效的……

引用次数: 19

Conditional Speculation: An Effective Approach to Safeguard Out-of-Order Execution Against Spectre Attacks 条件推测:防止幽灵攻击的一种有效方法

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00043

Peinan Li, Lutan Zhao, Rui Hou, Lixin Zhang, Dan Meng

Speculative execution side-channel vulnerabilities such as Spectre reveal that conventional architecture designs lack security consideration. This paper proposes a software transparent defense mechanism, named as Conditional Speculation, against Spectre vulnerabilities found on traditional out-of-order microprocessors. It introduces the concept of security dependence to mark speculative memory instructions which could leak information with potential security risk. More specifically, security-dependent instructions are detected and marked with suspect speculation flags in the Issue Queue. All the instructions can be speculatively issued for execution in accordance with the classic out-of-order pipeline. For those instructions with suspect speculation flags, they are considered as safe instructions if their speculative execution will not refill new cache lines with unauthorized privilege data. Otherwise, they are considered as unsafe instructions and thus not allowed to execute speculatively. To reduce the performance impact from not executing unsafe instructions speculatively, we investigate two filtering mechanisms, Cachehit based Hazard Filter and Trusted Page Buffer based Hazard Filter to filter out false security hazards. Our design philosophy is to speculatively execute safe instructions to maintain the performance benefits of out-of-order execution while blocking the speculative execution of unsafe instructions for security consideration. We evaluate Conditional Speculation in terms of performance, security and area. The experimental results show that the hardware overhead is marginal and the performance overhead is minimal. Keywords-Spectre vulnerabilities defense; Security dependence; Speculative execution side-channel vulnerabilities;

推测的执行侧通道漏洞(如Spectre)揭示了传统架构设计缺乏安全考虑。针对传统无序微处理器中的幽灵漏洞，提出了一种软件透明防御机制——条件推测。它引入了安全依赖的概念来标记投机内存指令，这些指令可能泄露有潜在安全风险的信息。更具体地说，在问题队列中检测并标记与安全相关的指令，并使用可疑猜测标记。所有指令都可以按照经典的乱序管道推测性地发出执行。对于那些带有可疑推测标志的指令，如果它们的推测执行不会用未经授权的特权数据填充新的缓存行，则它们被认为是安全指令。否则，它们被认为是不安全的指令，因此不允许投机地执行。为了减少不执行不安全指令对性能的影响，我们研究了两种过滤机制，基于缓存的危险过滤器和基于可信页面缓冲区的危险过滤器，以过滤掉虚假的安全隐患。我们的设计理念是推测性地执行安全指令，以保持乱序执行的性能优势，同时出于安全考虑阻止不安全指令的推测性执行。我们评估条件投机方面的性能，安全性和面积。实验结果表明，该方法的硬件开销很小，性能开销很小。关键词:幽灵漏洞防御;安全依赖;推测执行侧通道漏洞;

{"title":"Conditional Speculation: An Effective Approach to Safeguard Out-of-Order Execution Against Spectre Attacks","authors":"Peinan Li, Lutan Zhao, Rui Hou, Lixin Zhang, Dan Meng","doi":"10.1109/HPCA.2019.00043","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00043","url":null,"abstract":"Speculative execution side-channel vulnerabilities such as Spectre reveal that conventional architecture designs lack security consideration. This paper proposes a software transparent defense mechanism, named as Conditional Speculation, against Spectre vulnerabilities found on traditional out-of-order microprocessors. It introduces the concept of security dependence to mark speculative memory instructions which could leak information with potential security risk. More specifically, security-dependent instructions are detected and marked with suspect speculation flags in the Issue Queue. All the instructions can be speculatively issued for execution in accordance with the classic out-of-order pipeline. For those instructions with suspect speculation flags, they are considered as safe instructions if their speculative execution will not refill new cache lines with unauthorized privilege data. Otherwise, they are considered as unsafe instructions and thus not allowed to execute speculatively. To reduce the performance impact from not executing unsafe instructions speculatively, we investigate two filtering mechanisms, Cachehit based Hazard Filter and Trusted Page Buffer based Hazard Filter to filter out false security hazards. Our design philosophy is to speculatively execute safe instructions to maintain the performance benefits of out-of-order execution while blocking the speculative execution of unsafe instructions for security consideration. We evaluate Conditional Speculation in terms of performance, security and area. The experimental results show that the hardware overhead is marginal and the performance overhead is minimal. Keywords-Spectre vulnerabilities defense; Security dependence; Speculative execution side-channel vulnerabilities;","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123077421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 73

String Figure: A Scalable and Elastic Memory Network Architecture 字符串图:一个可伸缩和弹性的内存网络架构

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00016

Matheus A. Ogleari, Ye Yu, Chen Qian, E. Miller, Jishen Zhao

Author(s): Ogleari, Matheus Almeida; Yu, Ye; Qian, Chen; Miller, Ethan L; Zhao, Jishen; IEEE

作者:Ogleari, Matheus Almeida;Yu,你们;钱,陈;伊桑·L·米勒;赵,Jishen;IEEE

引用次数: 10

A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation 通过源级分析和基于跟踪的仿真实现快速准确GPU性能估计的混合框架

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00062

Xiebing Wang, Kai Huang, A. Knoll, Xuehai Qian

This paper proposes a hybrid framework for fast and accurate performance estimation of OpenCL kernels running on GPUs. The kernel execution flow is statically analyzed and thereupon the execution trace is generated via a loop-based bidirectional branch search. Then the trace is dynamically simulated to perform a dummy execution of the kernel to obtain the estimated time. The framework does not rely on profiling or measurement results which are used in conventional performance estimation techniques. Moreover, the lightweight trace-based simulation consumes much less time than a fine-grained GPU simulator. Our framework can accurately grasp the variation trend of the execution time in the design space and robustly predict the performance of the kernels across two generations of recent Nvidia GPU architectures. Experiments on four Commercial Off-The-Shelf (COTS) GPUs show that our framework can predict the runtime performance with average Mean Absolute Percentage Error (MAPE) of 17.04% and time consumption of a few seconds. We also demonstrate the practicability of our framework with a realworld application.

本文提出了一个混合框架，用于快速准确地估计在gpu上运行的OpenCL内核的性能。对内核执行流进行静态分析，然后通过基于循环的双向分支搜索生成执行跟踪。然后动态模拟跟踪以执行内核的虚拟执行以获得估计的时间。该框架不依赖于传统性能评估技术中使用的分析或测量结果。此外，轻量级的基于跟踪的模拟比细粒度的GPU模拟器消耗的时间要少得多。我们的框架能够准确把握设计空间内执行时间的变化趋势，稳健地预测两代Nvidia最新GPU架构的内核性能。在4个商用gpu上的实验表明，我们的框架可以预测运行时性能，平均绝对百分比误差(MAPE)为17.04%，时间消耗为几秒。我们还通过一个实际应用程序演示了我们的框架的实用性。

{"title":"A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation","authors":"Xiebing Wang, Kai Huang, A. Knoll, Xuehai Qian","doi":"10.1109/HPCA.2019.00062","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00062","url":null,"abstract":"This paper proposes a hybrid framework for fast and accurate performance estimation of OpenCL kernels running on GPUs. The kernel execution flow is statically analyzed and thereupon the execution trace is generated via a loop-based bidirectional branch search. Then the trace is dynamically simulated to perform a dummy execution of the kernel to obtain the estimated time. The framework does not rely on profiling or measurement results which are used in conventional performance estimation techniques. Moreover, the lightweight trace-based simulation consumes much less time than a fine-grained GPU simulator. Our framework can accurately grasp the variation trend of the execution time in the design space and robustly predict the performance of the kernels across two generations of recent Nvidia GPU architectures. Experiments on four Commercial Off-The-Shelf (COTS) GPUs show that our framework can predict the runtime performance with average Mean Absolute Percentage Error (MAPE) of 17.04% and time consumption of a few seconds. We also demonstrate the practicability of our framework with a realworld application.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128159014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Poly: Efficient Heterogeneous System and Application Management for Interactive Applications 高效异构系统和交互式应用程序管理

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00038

Shuo Wang, Yun Liang, Wei Zhang

QoS-sensitive workloads, common in warehousescale datacenters, require a guaranteed stable tail latency percentile response latency) of the service. Unfortunately, the system load (e.g., RPS) fluctuates drastically during daily datacenter operations. In order to meet the maximum system RPS requirement, datacenter tends to overprovision the hardware accelerators, which makes the datacenter underutilized.Therefore, the throughput and energy efficiency scaling of the current accelerator-outfitted datacenter are very expensive for QoS-sensitive workloads. To overcome this challenge, this work introduces Poly, an OpenCL based heterogeneous system optimization framework that targets to improve the overall throughput scalability and energy proportionality while guaranteeing the QoS by efficiently utilizing GPUs and FPGAs based accelerators within datacenter. Poly is mainly composed of two phases. At compile-time, Poly automatically captures the parallel patterns in the applications and explores a comprehensive design space within and across parallel patterns. At runtime, Poly relies on a runtime kernel scheduler to judiciously make the scheduling decisions to accommodate the dynamic latency and throughput requirements. Experiments using a variety of cloud QoS-sensitive applications show that Poly improves the energy proportionality by 23%(17%) without sacrificing the QoS compared to the state-of-the-art GPU (FPGA) solution, respectively. Keywords-Heterogeneous; GPU; FPGA; Performance Optimization;

对qos敏感的工作负载(在仓库级数据中心中很常见)需要保证稳定的服务尾部延迟(响应延迟百分比)。不幸的是，系统负载(例如RPS)在日常数据中心操作期间波动很大。为了满足最大的系统RPS需求，数据中心往往会过度配置硬件加速器，从而导致数据中心利用率不足。因此，当前配备加速器的数据中心的吞吐量和能效扩展对于qos敏感的工作负载来说是非常昂贵的。为了克服这一挑战，本工作引入了Poly，一个基于OpenCL的异构系统优化框架，旨在通过有效利用数据中心内基于gpu和fpga的加速器来提高整体吞吐量可扩展性和能量比例性，同时保证QoS。Poly主要由两相组成。在编译时，Poly自动捕获应用程序中的并行模式，并在并行模式内部和跨并行模式探索一个全面的设计空间。在运行时，Poly依赖运行时内核调度器明智地做出调度决策，以适应动态延迟和吞吐量需求。使用各种云QoS敏感应用的实验表明，与最先进的GPU (FPGA)解决方案相比，Poly在不牺牲QoS的情况下分别将能量比例提高了23%(17%)。Keywords-Heterogeneous;GPU;FPGA;性能优化;

{"title":"Poly: Efficient Heterogeneous System and Application Management for Interactive Applications","authors":"Shuo Wang, Yun Liang, Wei Zhang","doi":"10.1109/HPCA.2019.00038","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00038","url":null,"abstract":"QoS-sensitive workloads, common in warehousescale datacenters, require a guaranteed stable tail latency percentile response latency) of the service. Unfortunately, the system load (e.g., RPS) fluctuates drastically during daily datacenter operations. In order to meet the maximum system RPS requirement, datacenter tends to overprovision the hardware accelerators, which makes the datacenter underutilized.Therefore, the throughput and energy efficiency scaling of the current accelerator-outfitted datacenter are very expensive for QoS-sensitive workloads. To overcome this challenge, this work introduces Poly, an OpenCL based heterogeneous system optimization framework that targets to improve the overall throughput scalability and energy proportionality while guaranteeing the QoS by efficiently utilizing GPUs and FPGAs based accelerators within datacenter. Poly is mainly composed of two phases. At compile-time, Poly automatically captures the parallel patterns in the applications and explores a comprehensive design space within and across parallel patterns. At runtime, Poly relies on a runtime kernel scheduler to judiciously make the scheduling decisions to accommodate the dynamic latency and throughput requirements. Experiments using a variety of cloud QoS-sensitive applications show that Poly improves the energy proportionality by 23%(17%) without sacrificing the QoS compared to the state-of-the-art GPU (FPGA) solution, respectively. Keywords-Heterogeneous; GPU; FPGA; Performance Optimization;","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127512748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Understanding the Impact of Socket Density in Density Optimized Servers 了解Socket密度对密度优化服务器的影响

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00066

Manish Arora, Matt Skach, Wei Huang, Xudong An, Jason Mars, Lingjia Tang, D. Tullsen

The increasing demand for computational power has led to the creation and deployment of large-scale data centers. During the last few years, data centers have seen improvements aimed at increasing computational density – the amount of throughput that can be achieved within the allocated physical footprint. This need to pack more compute in the same physical space has led to density optimized server designs. Density optimized servers push compute density significantly beyond what can be achieved by blade servers by using innovative modular chassis based designs. This paper presents a comprehensive analysis of the impact of socket density on intra-server thermals and demonstrates that increased socket density inside the server leads to large temperature variations among sockets due to inter-socket thermal coupling. The paper shows that traditional chip-level and data center-level temperature-aware scheduling techniques do not work well for thermally-coupled sockets. The paper proposes new scheduling techniques that account for the thermals of the socket a task is scheduled on, as well as thermally coupled nearby sockets. The proposed mechanisms provide 2.5% to 6.5% performance improvements across various workloads and as much as 17% over traditional temperature-aware schedulers for computation-heavy workloads. Keywords-Server; Data center; Density Optimized Server; Scheduling

对计算能力日益增长的需求导致了大规模数据中心的创建和部署。在过去几年中，数据中心已经看到了旨在提高计算密度的改进，即在分配的物理占用空间内可以实现的吞吐量。这种在相同物理空间中打包更多计算的需求导致了密度优化的服务器设计。密度优化的服务器通过使用创新的模块化机箱设计，大大提高了刀片服务器所能达到的计算密度。本文全面分析了插座密度对服务器内部热量的影响，并证明了服务器内部插座密度的增加会导致插座之间由于插座间的热耦合而产生较大的温度变化。本文表明，传统的芯片级和数据中心级温度感知调度技术不适用于热耦合插座。本文提出了新的调度技术，该技术考虑了任务被调度的插座的热量，以及附近插座的热耦合。提议的机制在各种工作负载中提供2.5%到6.5%的性能改进，对于计算繁重的工作负载，比传统的温度感知调度器提高17%。Keywords-Server;数据中心;密度优化服务器;调度

{"title":"Understanding the Impact of Socket Density in Density Optimized Servers","authors":"Manish Arora, Matt Skach, Wei Huang, Xudong An, Jason Mars, Lingjia Tang, D. Tullsen","doi":"10.1109/HPCA.2019.00066","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00066","url":null,"abstract":"The increasing demand for computational power has led to the creation and deployment of large-scale data centers. During the last few years, data centers have seen improvements aimed at increasing computational density – the amount of throughput that can be achieved within the allocated physical footprint. This need to pack more compute in the same physical space has led to density optimized server designs. Density optimized servers push compute density significantly beyond what can be achieved by blade servers by using innovative modular chassis based designs. This paper presents a comprehensive analysis of the impact of socket density on intra-server thermals and demonstrates that increased socket density inside the server leads to large temperature variations among sockets due to inter-socket thermal coupling. The paper shows that traditional chip-level and data center-level temperature-aware scheduling techniques do not work well for thermally-coupled sockets. The paper proposes new scheduling techniques that account for the thermals of the socket a task is scheduled on, as well as thermally coupled nearby sockets. The proposed mechanisms provide 2.5% to 6.5% performance improvements across various workloads and as much as 17% over traditional temperature-aware schedulers for computation-heavy workloads. Keywords-Server; Data center; Density Optimized Server; Scheduling","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132293164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Kelp: QoS for Accelerated Machine Learning Systems 加速机器学习系统的QoS

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00036

Haishan Zhu, David Lo, Liqun Cheng, R. Govindaraju, Parthasarathy Ranganathan, M. Erez

Development and deployment of machine learning (ML) accelerators in Warehouse Scale Computers (WSCs) demand significant capital investments and engineering efforts. However, even though heavy computation can be offloaded to the accelerators, applications often depend on the host system for various supporting tasks. As a result, contention on host resources, such as memory bandwidth, can significantly discount the performance and efficiency gains of accelerators. The impact of performance interference is further amplified in distributed learning, which has become increasingly common as model sizes continue to grow. In this work, we study the performance of four production machine learning workloads on three accelerator platforms. Our experiments show that these workloads are highly sensitive to host memory bandwidth contention, which can cause 40% average performance degradation when left unmanaged. To tackle this problem, we design and implement Kelp, a software runtime that isolates high priority accelerated ML tasks from memory resource interference. We evaluate Kelp with both production and artificial aggressor workloads, and compare its effectiveness with previously proposed solutions. Our evaluation shows that Kelp is effective in mitigating performance degradation of the accelerated tasks, and improves performance by 24% on average. Compared to previous work, Kelp reduces performance degradation of ML tasks by 7% and improves system efficiency by 17%. Our results further expose opportunities in future architecture designs.

在仓库规模计算机(WSCs)中开发和部署机器学习(ML)加速器需要大量的资本投资和工程努力。然而，即使繁重的计算可以卸载到加速器上，应用程序经常依赖于主机系统来完成各种支持任务。因此，对主机资源(如内存带宽)的争用会大大降低加速器的性能和效率增益。性能干扰的影响在分布式学习中被进一步放大，随着模型规模的不断增长，分布式学习变得越来越普遍。在这项工作中，我们研究了四个生产机器学习工作负载在三个加速器平台上的性能。我们的实验表明，这些工作负载对主机内存带宽争用非常敏感，如果不进行管理，可能会导致40%的平均性能下降。为了解决这个问题，我们设计并实现了Kelp，一个将高优先级加速ML任务与内存资源干扰隔离开来的软件运行时。我们评估了海带在生产和人工侵蚀负荷下的有效性，并将其与先前提出的解决方案进行了比较。我们的评估表明，海带有效地缓解了加速任务的性能下降，平均提高了24%的性能。与之前的工作相比，Kelp将ML任务的性能降低了7%，并将系统效率提高了17%。我们的研究结果进一步揭示了未来建筑设计的机会。

{"title":"Kelp: QoS for Accelerated Machine Learning Systems","authors":"Haishan Zhu, David Lo, Liqun Cheng, R. Govindaraju, Parthasarathy Ranganathan, M. Erez","doi":"10.1109/HPCA.2019.00036","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00036","url":null,"abstract":"Development and deployment of machine learning (ML) accelerators in Warehouse Scale Computers (WSCs) demand significant capital investments and engineering efforts. However, even though heavy computation can be offloaded to the accelerators, applications often depend on the host system for various supporting tasks. As a result, contention on host resources, such as memory bandwidth, can significantly discount the performance and efficiency gains of accelerators. The impact of performance interference is further amplified in distributed learning, which has become increasingly common as model sizes continue to grow. In this work, we study the performance of four production machine learning workloads on three accelerator platforms. Our experiments show that these workloads are highly sensitive to host memory bandwidth contention, which can cause 40% average performance degradation when left unmanaged. To tackle this problem, we design and implement Kelp, a software runtime that isolates high priority accelerated ML tasks from memory resource interference. We evaluate Kelp with both production and artificial aggressor workloads, and compare its effectiveness with previously proposed solutions. Our evaluation shows that Kelp is effective in mitigating performance degradation of the accelerated tasks, and improves performance by 24% on average. Compared to previous work, Kelp reduces performance degradation of ML tasks by 7% and improves system efficiency by 17%. Our results further expose opportunities in future architecture designs.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128606134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Reliability Evaluation of Mixed-Precision Architectures 混合精度体系结构的可靠性评估

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00041

F. Santos, Caio B. Lunardi, Daniel Oliveira, F. Libano, P. Rech

引用次数: 23

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀