高带宽互连中CUDA通信原语的特性评估

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering Pub Date : 2019-04-04 DOI:10.1145/3297663.3310299

Carl Pearson, Abdul Dakkak, Sarah Hashash, Cheng Li, I. Chung, Jinjun Xiong, Wen-mei W. Hwu

{"title":"高带宽互连中CUDA通信原语的特性评估","authors":"Carl Pearson, Abdul Dakkak, Sarah Hashash, Cheng Li, I. Chung, Jinjun Xiong, Wen-mei W. Hwu","doi":"10.1145/3297663.3310299","DOIUrl":null,"url":null,"abstract":"Data-intensive applications such as machine learning and analytics have created a demand for faster interconnects to avert the memory bandwidth wall and allow GPUs to be effectively leveraged for lower compute intensity tasks. This has resulted in wide adoption of heterogeneous systems with varying underlying interconnects, and has delegated the task of understanding and copying data to the system or application developer. No longer is a malloc followed by memcpy the only or dominating modality of data transfer; application developers are faced with additional options such as unified memory and zero-copy memory. Data transfer performance on these systems is now impacted by many factors including data transfer modality, system interconnect hardware details, CPU caching state, CPU power management state, driver policies, virtual memory paging efficiency, and data placement. This paper presents Comm|Scope, a set of microbenchmarks designed for system and application developers to understand memory transfer behavior across different data placement and exchange scenarios. Comm|Scope comprehensively measures the latency and bandwidth of CUDA data transfer primitives, and avoids common pitfalls in ad-hoc measurements by controlling CPU caches, clock frequencies, and avoids measuring synchronization costs imposed by the measurement methodology where possible. This paper also presents an evaluation of Comm|Scope on systems featuring the POWER and x86 CPU architectures and PCIe 3, NVLink 1, and NVLink 2 interconnects. These systems are chosen as representative configurations of current high-performance GPU platforms. Comm|Scope measurements can serve to update insights about the relative performance of data transfer methods on current systems. This work also reports insights for how high-level system design choices affect the performance of these data transfers, and how developers can optimize applications on these systems.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":"{\"title\":\"Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects\",\"authors\":\"Carl Pearson, Abdul Dakkak, Sarah Hashash, Cheng Li, I. Chung, Jinjun Xiong, Wen-mei W. Hwu\",\"doi\":\"10.1145/3297663.3310299\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data-intensive applications such as machine learning and analytics have created a demand for faster interconnects to avert the memory bandwidth wall and allow GPUs to be effectively leveraged for lower compute intensity tasks. This has resulted in wide adoption of heterogeneous systems with varying underlying interconnects, and has delegated the task of understanding and copying data to the system or application developer. No longer is a malloc followed by memcpy the only or dominating modality of data transfer; application developers are faced with additional options such as unified memory and zero-copy memory. Data transfer performance on these systems is now impacted by many factors including data transfer modality, system interconnect hardware details, CPU caching state, CPU power management state, driver policies, virtual memory paging efficiency, and data placement. This paper presents Comm|Scope, a set of microbenchmarks designed for system and application developers to understand memory transfer behavior across different data placement and exchange scenarios. Comm|Scope comprehensively measures the latency and bandwidth of CUDA data transfer primitives, and avoids common pitfalls in ad-hoc measurements by controlling CPU caches, clock frequencies, and avoids measuring synchronization costs imposed by the measurement methodology where possible. This paper also presents an evaluation of Comm|Scope on systems featuring the POWER and x86 CPU architectures and PCIe 3, NVLink 1, and NVLink 2 interconnects. These systems are chosen as representative configurations of current high-performance GPU platforms. Comm|Scope measurements can serve to update insights about the relative performance of data transfer methods on current systems. This work also reports insights for how high-level system design choices affect the performance of these data transfers, and how developers can optimize applications on these systems.\",\"PeriodicalId\":273447,\"journal\":{\"name\":\"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering\",\"volume\":\"56 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-04-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"18\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3297663.3310299\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3297663.3310299","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

摘要

数据密集型应用(如机器学习和分析)产生了对更快互连的需求，以避免内存带宽墙，并允许gpu有效地用于较低计算强度的任务。这导致广泛采用具有不同底层互连的异构系统，并将理解和复制数据的任务委托给系统或应用程序开发人员。malloc和memcpy不再是唯一或主要的数据传输方式;应用程序开发人员面临着其他选项，如统一内存和零复制内存。这些系统上的数据传输性能现在受到许多因素的影响，包括数据传输方式、系统互连硬件细节、CPU缓存状态、CPU电源管理状态、驱动程序策略、虚拟内存分页效率和数据放置。本文介绍了Comm Scope，这是一组为系统和应用程序开发人员设计的微基准测试，用于了解不同数据放置和交换场景中的内存传输行为。Comm|Scope全面测量CUDA数据传输原语的延迟和带宽，并通过控制CPU缓存，时钟频率避免了ad-hoc测量中的常见陷阱，并避免了测量方法在可能的情况下测量同步成本。本文还介绍了Comm|Scope在具有POWER和x86 CPU架构以及PCIe 3、NVLink 1和NVLink 2互连的系统上的评估。选择这些系统作为当前高性能GPU平台的代表性配置。Comm Scope测量可以用于更新当前系统中数据传输方法的相对性能的见解。这项工作还报告了高级系统设计选择如何影响这些数据传输的性能，以及开发人员如何在这些系统上优化应用程序的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects

Data-intensive applications such as machine learning and analytics have created a demand for faster interconnects to avert the memory bandwidth wall and allow GPUs to be effectively leveraged for lower compute intensity tasks. This has resulted in wide adoption of heterogeneous systems with varying underlying interconnects, and has delegated the task of understanding and copying data to the system or application developer. No longer is a malloc followed by memcpy the only or dominating modality of data transfer; application developers are faced with additional options such as unified memory and zero-copy memory. Data transfer performance on these systems is now impacted by many factors including data transfer modality, system interconnect hardware details, CPU caching state, CPU power management state, driver policies, virtual memory paging efficiency, and data placement. This paper presents Comm|Scope, a set of microbenchmarks designed for system and application developers to understand memory transfer behavior across different data placement and exchange scenarios. Comm|Scope comprehensively measures the latency and bandwidth of CUDA data transfer primitives, and avoids common pitfalls in ad-hoc measurements by controlling CPU caches, clock frequencies, and avoids measuring synchronization costs imposed by the measurement methodology where possible. This paper also presents an evaluation of Comm|Scope on systems featuring the POWER and x86 CPU architectures and PCIe 3, NVLink 1, and NVLink 2 interconnects. These systems are chosen as representative configurations of current high-performance GPU platforms. Comm|Scope measurements can serve to update insights about the relative performance of data transfer methods on current systems. This work also reports insights for how high-level system design choices affect the performance of these data transfers, and how developers can optimize applications on these systems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

自引率

0.00%

发文量

期刊最新文献

Performance Evaluation of Multi-Path TCP for Data Center and Cloud Workloads Cachematic - Automatic Invalidation in Application-Level Caching Systems Memory Centric Characterization and Analysis of SPEC CPU2017 Suite Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Yardstick: A Benchmark for Minecraft-like Services