减少linux框架的开销，以支持gpu异构架构上的短期任务

高性能计算技术 Pub Date : 2015-11-15 DOI:10.1145/2830018.2830023

B. Peterson, H. Dasari, A. Humphrey, J. Sutherland, T. Saad, M. Berzins

{"title":"减少linux框架的开销，以支持gpu异构架构上的短期任务","authors":"B. Peterson, H. Dasari, A. Humphrey, J. Sutherland, T. Saad, M. Berzins","doi":"10.1145/2830018.2830023","DOIUrl":null,"url":null,"abstract":"The Uintah computational framework is used for the parallel solution of partial differential equations on adaptive mesh refinement grids using modern supercomputers. Uintah is structured with an application layer and a separate runtime system. The Uintah runtime system is based on a distributed directed acyclic graph (DAG) of computational tasks, with a task scheduler that efficiently schedules and execute these tasks on both CPU cores and on-node accelerators. The runtime system identifies task dependencies, creates a taskgraph prior to an iteration based on these dependencies, prepares data for tasks, automatically generates MPI message tags, and manages data after task computation. Managing tasks for accelerators pose significant challenges over their CPU task counterparts due to supporting more memory regions, API call latency, memory bandwidth concerns, and the added complexity of development. These challenges are greatest when tasks compute within a few milliseconds, especially those that have stencil based computations that involve halo data, have little reuse of data, and/or require many computational variables. Current and emerging heterogeneous architectures necessitate addressing these challenges within Uintah. This work is not designed to improve performance of existing tasks, but rather reduce runtime overhead to allow developers writing short-lived computational tasks to utilize Uintah in a heterogeneous environment. This work analyzes an initial approach for managing accelerator tasks alongside existing CPU tasks within Uintah. The principal contribution of this work is to identify and address inefficiencies that arise when mapping tasks onto the GPU, to implement new schemes to reduce runtime system overhead, to introduce new features that allow for more tasks to leverage on-node accelerators, and to show overhead reduction results from these improvements.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Reducing overhead in the Uintah framework to support short-lived tasks on GPU-heterogeneous architectures\",\"authors\":\"B. Peterson, H. Dasari, A. Humphrey, J. Sutherland, T. Saad, M. Berzins\",\"doi\":\"10.1145/2830018.2830023\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Uintah computational framework is used for the parallel solution of partial differential equations on adaptive mesh refinement grids using modern supercomputers. Uintah is structured with an application layer and a separate runtime system. The Uintah runtime system is based on a distributed directed acyclic graph (DAG) of computational tasks, with a task scheduler that efficiently schedules and execute these tasks on both CPU cores and on-node accelerators. The runtime system identifies task dependencies, creates a taskgraph prior to an iteration based on these dependencies, prepares data for tasks, automatically generates MPI message tags, and manages data after task computation. Managing tasks for accelerators pose significant challenges over their CPU task counterparts due to supporting more memory regions, API call latency, memory bandwidth concerns, and the added complexity of development. These challenges are greatest when tasks compute within a few milliseconds, especially those that have stencil based computations that involve halo data, have little reuse of data, and/or require many computational variables. Current and emerging heterogeneous architectures necessitate addressing these challenges within Uintah. This work is not designed to improve performance of existing tasks, but rather reduce runtime overhead to allow developers writing short-lived computational tasks to utilize Uintah in a heterogeneous environment. This work analyzes an initial approach for managing accelerator tasks alongside existing CPU tasks within Uintah. The principal contribution of this work is to identify and address inefficiencies that arise when mapping tasks onto the GPU, to implement new schemes to reduce runtime system overhead, to introduce new features that allow for more tasks to leverage on-node accelerators, and to show overhead reduction results from these improvements.\",\"PeriodicalId\":59014,\"journal\":{\"name\":\"高性能计算技术\",\"volume\":\"23 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"高性能计算技术\",\"FirstCategoryId\":\"1093\",\"ListUrlMain\":\"https://doi.org/10.1145/2830018.2830023\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"高性能计算技术","FirstCategoryId":"1093","ListUrlMain":"https://doi.org/10.1145/2830018.2830023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

在现代超级计算机的自适应网格精细化网格上，利用untah计算框架并行求解偏微分方程。ubuntu由一个应用层和一个独立的运行时系统构成。tah运行时系统基于计算任务的分布式有向无环图(DAG)，具有任务调度器，可以在CPU内核和节点上加速器上有效地调度和执行这些任务。运行时系统识别任务依赖项，在基于这些依赖项的迭代之前创建任务图，为任务准备数据，自动生成MPI消息标记，并在任务计算后管理数据。由于支持更多的内存区域、API调用延迟、内存带宽问题以及增加的开发复杂性，管理加速器的任务对CPU任务的对应项构成了重大挑战。当任务在几毫秒内进行计算时，这些挑战是最大的，特别是那些具有基于模板的计算，涉及光环数据，数据重用很少，和/或需要许多计算变量的任务。当前和新兴的异构架构需要在untah内部解决这些挑战。这项工作的目的不是提高现有任务的性能，而是减少运行时开销，允许开发人员编写短期的计算任务，以便在异构环境中利用ubuntu。这项工作分析了一种用于管理加速器任务和现有CPU任务的初始方法。这项工作的主要贡献是识别和解决在将任务映射到GPU时出现的低效率问题，实现新方案以减少运行时系统开销，引入允许更多任务利用节点上加速器的新功能，并显示这些改进带来的开销减少结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Reducing overhead in the Uintah framework to support short-lived tasks on GPU-heterogeneous architectures

The Uintah computational framework is used for the parallel solution of partial differential equations on adaptive mesh refinement grids using modern supercomputers. Uintah is structured with an application layer and a separate runtime system. The Uintah runtime system is based on a distributed directed acyclic graph (DAG) of computational tasks, with a task scheduler that efficiently schedules and execute these tasks on both CPU cores and on-node accelerators. The runtime system identifies task dependencies, creates a taskgraph prior to an iteration based on these dependencies, prepares data for tasks, automatically generates MPI message tags, and manages data after task computation. Managing tasks for accelerators pose significant challenges over their CPU task counterparts due to supporting more memory regions, API call latency, memory bandwidth concerns, and the added complexity of development. These challenges are greatest when tasks compute within a few milliseconds, especially those that have stencil based computations that involve halo data, have little reuse of data, and/or require many computational variables. Current and emerging heterogeneous architectures necessitate addressing these challenges within Uintah. This work is not designed to improve performance of existing tasks, but rather reduce runtime overhead to allow developers writing short-lived computational tasks to utilize Uintah in a heterogeneous environment. This work analyzes an initial approach for managing accelerator tasks alongside existing CPU tasks within Uintah. The principal contribution of this work is to identify and address inefficiencies that arise when mapping tasks onto the GPU, to implement new schemes to reduce runtime system overhead, to introduce new features that allow for more tasks to leverage on-node accelerators, and to show overhead reduction results from these improvements.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

高性能计算技术

自引率

0.00%

发文量

1121