In this work, we propose FUSE, a novel GPU cache system that integrates spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip L1D cache. FUSE can minimize the number of outgoing memory accesses over the interconnection network of GPU's multiprocessors, which in turn can considerably improve the level of massive computing parallelism in GPUs. Specifically, FUSE predicts a read-level of GPU memory accesses by extracting GPU runtime information and places write-once-read-multiple (WORM) data blocks into the STT-MRAM, while accommodating write-multiple data blocks over a small portion of SRAM in the L1D cache. To further reduce the off-chip memory accesses, FUSE also allows WORM data blocks to be allocated anywhere in the STT-MRAM by approximating the associativity with the limited number of tag comparators and I/O peripherals. Our evaluation results show that, in comparison to a traditional GPU cache, our proposed heterogeneous cache reduces the number of outgoing memory references by 32% across the interconnection network, thereby improving the overall performance by 217% and reducing energy cost by 53%.
{"title":"FUSE: Fusing STT-MRAM into GPUs to Alleviate Off-Chip Memory Access Overheads","authors":"Jie Zhang, Myoungsoo Jung, M. Kandemir","doi":"10.1109/HPCA.2019.00055","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00055","url":null,"abstract":"In this work, we propose FUSE, a novel GPU cache system that integrates spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip L1D cache. FUSE can minimize the number of outgoing memory accesses over the interconnection network of GPU's multiprocessors, which in turn can considerably improve the level of massive computing parallelism in GPUs. Specifically, FUSE predicts a read-level of GPU memory accesses by extracting GPU runtime information and places write-once-read-multiple (WORM) data blocks into the STT-MRAM, while accommodating write-multiple data blocks over a small portion of SRAM in the L1D cache. To further reduce the off-chip memory accesses, FUSE also allows WORM data blocks to be allocated anywhere in the STT-MRAM by approximating the associativity with the limited number of tag comparators and I/O peripherals. Our evaluation results show that, in comparison to a traditional GPU cache, our proposed heterogeneous cache reduces the number of outgoing memory references by 32% across the interconnection network, thereby improving the overall performance by 217% and reducing energy cost by 53%.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123541233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enabling Transparent Memory-Compression for Commodity Memory Systems","authors":"Vinson Young, S. Kariyappa, Moinuddin K. Qureshi","doi":"10.1109/HPCA.2019.00010","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00010","url":null,"abstract":"","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123630728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Exploiting memory level parallelism (MLP) is crucial to hide long memory and last level cache access latencies. While out-of-order (OoO) cores, and techniques building on them, are effective at exp ...
{"title":"Freeway: Maximizing MLP for Slice-Out-of-Order Execution","authors":"Rakesh Kumar, M. Alipour, D. Black-Schaffer","doi":"10.1109/HPCA.2019.00009","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00009","url":null,"abstract":"Exploiting memory level parallelism (MLP) is crucial to hide long memory and last level cache access latencies. While out-of-order (OoO) cores, and techniques building on them, are effective at exp ...","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120862627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peinan Li, Lutan Zhao, Rui Hou, Lixin Zhang, Dan Meng
Speculative execution side-channel vulnerabilities such as Spectre reveal that conventional architecture designs lack security consideration. This paper proposes a software transparent defense mechanism, named as Conditional Speculation, against Spectre vulnerabilities found on traditional out-of-order microprocessors. It introduces the concept of security dependence to mark speculative memory instructions which could leak information with potential security risk. More specifically, security-dependent instructions are detected and marked with suspect speculation flags in the Issue Queue. All the instructions can be speculatively issued for execution in accordance with the classic out-of-order pipeline. For those instructions with suspect speculation flags, they are considered as safe instructions if their speculative execution will not refill new cache lines with unauthorized privilege data. Otherwise, they are considered as unsafe instructions and thus not allowed to execute speculatively. To reduce the performance impact from not executing unsafe instructions speculatively, we investigate two filtering mechanisms, Cachehit based Hazard Filter and Trusted Page Buffer based Hazard Filter to filter out false security hazards. Our design philosophy is to speculatively execute safe instructions to maintain the performance benefits of out-of-order execution while blocking the speculative execution of unsafe instructions for security consideration. We evaluate Conditional Speculation in terms of performance, security and area. The experimental results show that the hardware overhead is marginal and the performance overhead is minimal. Keywords-Spectre vulnerabilities defense; Security dependence; Speculative execution side-channel vulnerabilities;
{"title":"Conditional Speculation: An Effective Approach to Safeguard Out-of-Order Execution Against Spectre Attacks","authors":"Peinan Li, Lutan Zhao, Rui Hou, Lixin Zhang, Dan Meng","doi":"10.1109/HPCA.2019.00043","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00043","url":null,"abstract":"Speculative execution side-channel vulnerabilities such as Spectre reveal that conventional architecture designs lack security consideration. This paper proposes a software transparent defense mechanism, named as Conditional Speculation, against Spectre vulnerabilities found on traditional out-of-order microprocessors. It introduces the concept of security dependence to mark speculative memory instructions which could leak information with potential security risk. More specifically, security-dependent instructions are detected and marked with suspect speculation flags in the Issue Queue. All the instructions can be speculatively issued for execution in accordance with the classic out-of-order pipeline. For those instructions with suspect speculation flags, they are considered as safe instructions if their speculative execution will not refill new cache lines with unauthorized privilege data. Otherwise, they are considered as unsafe instructions and thus not allowed to execute speculatively. To reduce the performance impact from not executing unsafe instructions speculatively, we investigate two filtering mechanisms, Cachehit based Hazard Filter and Trusted Page Buffer based Hazard Filter to filter out false security hazards. Our design philosophy is to speculatively execute safe instructions to maintain the performance benefits of out-of-order execution while blocking the speculative execution of unsafe instructions for security consideration. We evaluate Conditional Speculation in terms of performance, security and area. The experimental results show that the hardware overhead is marginal and the performance overhead is minimal. Keywords-Spectre vulnerabilities defense; Security dependence; Speculative execution side-channel vulnerabilities;","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123077421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a hybrid framework for fast and accurate performance estimation of OpenCL kernels running on GPUs. The kernel execution flow is statically analyzed and thereupon the execution trace is generated via a loop-based bidirectional branch search. Then the trace is dynamically simulated to perform a dummy execution of the kernel to obtain the estimated time. The framework does not rely on profiling or measurement results which are used in conventional performance estimation techniques. Moreover, the lightweight trace-based simulation consumes much less time than a fine-grained GPU simulator. Our framework can accurately grasp the variation trend of the execution time in the design space and robustly predict the performance of the kernels across two generations of recent Nvidia GPU architectures. Experiments on four Commercial Off-The-Shelf (COTS) GPUs show that our framework can predict the runtime performance with average Mean Absolute Percentage Error (MAPE) of 17.04% and time consumption of a few seconds. We also demonstrate the practicability of our framework with a realworld application.
{"title":"A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation","authors":"Xiebing Wang, Kai Huang, A. Knoll, Xuehai Qian","doi":"10.1109/HPCA.2019.00062","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00062","url":null,"abstract":"This paper proposes a hybrid framework for fast and accurate performance estimation of OpenCL kernels running on GPUs. The kernel execution flow is statically analyzed and thereupon the execution trace is generated via a loop-based bidirectional branch search. Then the trace is dynamically simulated to perform a dummy execution of the kernel to obtain the estimated time. The framework does not rely on profiling or measurement results which are used in conventional performance estimation techniques. Moreover, the lightweight trace-based simulation consumes much less time than a fine-grained GPU simulator. Our framework can accurately grasp the variation trend of the execution time in the design space and robustly predict the performance of the kernels across two generations of recent Nvidia GPU architectures. Experiments on four Commercial Off-The-Shelf (COTS) GPUs show that our framework can predict the runtime performance with average Mean Absolute Percentage Error (MAPE) of 17.04% and time consumption of a few seconds. We also demonstrate the practicability of our framework with a realworld application.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128159014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
QoS-sensitive workloads, common in warehousescale datacenters, require a guaranteed stable tail latency percentile response latency) of the service. Unfortunately, the system load (e.g., RPS) fluctuates drastically during daily datacenter operations. In order to meet the maximum system RPS requirement, datacenter tends to overprovision the hardware accelerators, which makes the datacenter underutilized.Therefore, the throughput and energy efficiency scaling of the current accelerator-outfitted datacenter are very expensive for QoS-sensitive workloads. To overcome this challenge, this work introduces Poly, an OpenCL based heterogeneous system optimization framework that targets to improve the overall throughput scalability and energy proportionality while guaranteeing the QoS by efficiently utilizing GPUs and FPGAs based accelerators within datacenter. Poly is mainly composed of two phases. At compile-time, Poly automatically captures the parallel patterns in the applications and explores a comprehensive design space within and across parallel patterns. At runtime, Poly relies on a runtime kernel scheduler to judiciously make the scheduling decisions to accommodate the dynamic latency and throughput requirements. Experiments using a variety of cloud QoS-sensitive applications show that Poly improves the energy proportionality by 23%(17%) without sacrificing the QoS compared to the state-of-the-art GPU (FPGA) solution, respectively. Keywords-Heterogeneous; GPU; FPGA; Performance Optimization;
{"title":"Poly: Efficient Heterogeneous System and Application Management for Interactive Applications","authors":"Shuo Wang, Yun Liang, Wei Zhang","doi":"10.1109/HPCA.2019.00038","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00038","url":null,"abstract":"QoS-sensitive workloads, common in warehousescale datacenters, require a guaranteed stable tail latency percentile response latency) of the service. Unfortunately, the system load (e.g., RPS) fluctuates drastically during daily datacenter operations. In order to meet the maximum system RPS requirement, datacenter tends to overprovision the hardware accelerators, which makes the datacenter underutilized.Therefore, the throughput and energy efficiency scaling of the current accelerator-outfitted datacenter are very expensive for QoS-sensitive workloads. To overcome this challenge, this work introduces Poly, an OpenCL based heterogeneous system optimization framework that targets to improve the overall throughput scalability and energy proportionality while guaranteeing the QoS by efficiently utilizing GPUs and FPGAs based accelerators within datacenter. Poly is mainly composed of two phases. At compile-time, Poly automatically captures the parallel patterns in the applications and explores a comprehensive design space within and across parallel patterns. At runtime, Poly relies on a runtime kernel scheduler to judiciously make the scheduling decisions to accommodate the dynamic latency and throughput requirements. Experiments using a variety of cloud QoS-sensitive applications show that Poly improves the energy proportionality by 23%(17%) without sacrificing the QoS compared to the state-of-the-art GPU (FPGA) solution, respectively. Keywords-Heterogeneous; GPU; FPGA; Performance Optimization;","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127512748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manish Arora, Matt Skach, Wei Huang, Xudong An, Jason Mars, Lingjia Tang, D. Tullsen
The increasing demand for computational power has led to the creation and deployment of large-scale data centers. During the last few years, data centers have seen improvements aimed at increasing computational density – the amount of throughput that can be achieved within the allocated physical footprint. This need to pack more compute in the same physical space has led to density optimized server designs. Density optimized servers push compute density significantly beyond what can be achieved by blade servers by using innovative modular chassis based designs. This paper presents a comprehensive analysis of the impact of socket density on intra-server thermals and demonstrates that increased socket density inside the server leads to large temperature variations among sockets due to inter-socket thermal coupling. The paper shows that traditional chip-level and data center-level temperature-aware scheduling techniques do not work well for thermally-coupled sockets. The paper proposes new scheduling techniques that account for the thermals of the socket a task is scheduled on, as well as thermally coupled nearby sockets. The proposed mechanisms provide 2.5% to 6.5% performance improvements across various workloads and as much as 17% over traditional temperature-aware schedulers for computation-heavy workloads. Keywords-Server; Data center; Density Optimized Server; Scheduling
{"title":"Understanding the Impact of Socket Density in Density Optimized Servers","authors":"Manish Arora, Matt Skach, Wei Huang, Xudong An, Jason Mars, Lingjia Tang, D. Tullsen","doi":"10.1109/HPCA.2019.00066","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00066","url":null,"abstract":"The increasing demand for computational power has led to the creation and deployment of large-scale data centers. During the last few years, data centers have seen improvements aimed at increasing computational density – the amount of throughput that can be achieved within the allocated physical footprint. This need to pack more compute in the same physical space has led to density optimized server designs. Density optimized servers push compute density significantly beyond what can be achieved by blade servers by using innovative modular chassis based designs. This paper presents a comprehensive analysis of the impact of socket density on intra-server thermals and demonstrates that increased socket density inside the server leads to large temperature variations among sockets due to inter-socket thermal coupling. The paper shows that traditional chip-level and data center-level temperature-aware scheduling techniques do not work well for thermally-coupled sockets. The paper proposes new scheduling techniques that account for the thermals of the socket a task is scheduled on, as well as thermally coupled nearby sockets. The proposed mechanisms provide 2.5% to 6.5% performance improvements across various workloads and as much as 17% over traditional temperature-aware schedulers for computation-heavy workloads. Keywords-Server; Data center; Density Optimized Server; Scheduling","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132293164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haishan Zhu, David Lo, Liqun Cheng, R. Govindaraju, Parthasarathy Ranganathan, M. Erez
Development and deployment of machine learning (ML) accelerators in Warehouse Scale Computers (WSCs) demand significant capital investments and engineering efforts. However, even though heavy computation can be offloaded to the accelerators, applications often depend on the host system for various supporting tasks. As a result, contention on host resources, such as memory bandwidth, can significantly discount the performance and efficiency gains of accelerators. The impact of performance interference is further amplified in distributed learning, which has become increasingly common as model sizes continue to grow. In this work, we study the performance of four production machine learning workloads on three accelerator platforms. Our experiments show that these workloads are highly sensitive to host memory bandwidth contention, which can cause 40% average performance degradation when left unmanaged. To tackle this problem, we design and implement Kelp, a software runtime that isolates high priority accelerated ML tasks from memory resource interference. We evaluate Kelp with both production and artificial aggressor workloads, and compare its effectiveness with previously proposed solutions. Our evaluation shows that Kelp is effective in mitigating performance degradation of the accelerated tasks, and improves performance by 24% on average. Compared to previous work, Kelp reduces performance degradation of ML tasks by 7% and improves system efficiency by 17%. Our results further expose opportunities in future architecture designs.
{"title":"Kelp: QoS for Accelerated Machine Learning Systems","authors":"Haishan Zhu, David Lo, Liqun Cheng, R. Govindaraju, Parthasarathy Ranganathan, M. Erez","doi":"10.1109/HPCA.2019.00036","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00036","url":null,"abstract":"Development and deployment of machine learning (ML) accelerators in Warehouse Scale Computers (WSCs) demand significant capital investments and engineering efforts. However, even though heavy computation can be offloaded to the accelerators, applications often depend on the host system for various supporting tasks. As a result, contention on host resources, such as memory bandwidth, can significantly discount the performance and efficiency gains of accelerators. The impact of performance interference is further amplified in distributed learning, which has become increasingly common as model sizes continue to grow. In this work, we study the performance of four production machine learning workloads on three accelerator platforms. Our experiments show that these workloads are highly sensitive to host memory bandwidth contention, which can cause 40% average performance degradation when left unmanaged. To tackle this problem, we design and implement Kelp, a software runtime that isolates high priority accelerated ML tasks from memory resource interference. We evaluate Kelp with both production and artificial aggressor workloads, and compare its effectiveness with previously proposed solutions. Our evaluation shows that Kelp is effective in mitigating performance degradation of the accelerated tasks, and improves performance by 24% on average. Compared to previous work, Kelp reduces performance degradation of ML tasks by 7% and improves system efficiency by 17%. Our results further expose opportunities in future architecture designs.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128606134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Santos, Caio B. Lunardi, Daniel Oliveira, F. Libano, P. Rech
{"title":"Reliability Evaluation of Mixed-Precision Architectures","authors":"F. Santos, Caio B. Lunardi, Daniel Oliveira, F. Libano, P. Rech","doi":"10.1109/HPCA.2019.00041","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00041","url":null,"abstract":"","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133871939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}