一个应用于PGAS计算的多核生产者/消费者执行模型

2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS) Pub Date : 2014-12-01 DOI:10.1109/PADSW.2014.7097863

David Ozog, A. Malony, J. Hammond, P. Balaji

{"title":"一个应用于PGAS计算的多核生产者/消费者执行模型","authors":"David Ozog, A. Malony, J. Hammond, P. Balaji","doi":"10.1109/PADSW.2014.7097863","DOIUrl":null,"url":null,"abstract":"Partitioned global address space (PGAS) applications, such as the Tensor Contraction Engine (TCE) in NWChem, often apply a one-process-per-core mapping in which each process iterates through the following work-processing cycle: (1) determine a work-item dynamically, (2) get data via one-sided operations on remote blocks, (3) perform computation on the data locally, (4) put (or accumulate) resultant data into an appropriate remote location, and (5) repeat the cycle. However, this simple flow of execution does not effectively hide communication latency costs despite the opportunities for making asynchronous progress. Utilizing nonblocking communication calls is not sufficient unless care is taken to efficiently manage a responsive queue of outstanding communication requests. This paper presents a new runtime model and its library implementation for managing tunable “work queues” in PGAS applications. Our runtime execution model, called WorkQ, assigns some number of on-node “producer” processes to primarily do communication (steps 1, 2, 4, and 5) and the other “consumer” processes to do computation (step 3); but processes can switch roles dynamically for the sake of performance. Load balance, synchronization, and overlap of communication and computation are facilitated by a tunable nodewise FIFO message queue protocol. Our WorkQ library implementation enables an MPI+X hybrid programming model where the X comprises SysV message queues and the user's choice of SysV, POSIX, and MPI shared memory. We develop a simplified software mini-application that mimics the performance behavior of the TCE at arbitrary scale, and we show that the WorkQ engine outperforms the original model by about a factor of 2. We also show performance improvement in the TCE coupled cluster module of NWChem.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"WorkQ: A many-core producer/consumer execution model applied to PGAS computations\",\"authors\":\"David Ozog, A. Malony, J. Hammond, P. Balaji\",\"doi\":\"10.1109/PADSW.2014.7097863\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Partitioned global address space (PGAS) applications, such as the Tensor Contraction Engine (TCE) in NWChem, often apply a one-process-per-core mapping in which each process iterates through the following work-processing cycle: (1) determine a work-item dynamically, (2) get data via one-sided operations on remote blocks, (3) perform computation on the data locally, (4) put (or accumulate) resultant data into an appropriate remote location, and (5) repeat the cycle. However, this simple flow of execution does not effectively hide communication latency costs despite the opportunities for making asynchronous progress. Utilizing nonblocking communication calls is not sufficient unless care is taken to efficiently manage a responsive queue of outstanding communication requests. This paper presents a new runtime model and its library implementation for managing tunable “work queues” in PGAS applications. Our runtime execution model, called WorkQ, assigns some number of on-node “producer” processes to primarily do communication (steps 1, 2, 4, and 5) and the other “consumer” processes to do computation (step 3); but processes can switch roles dynamically for the sake of performance. Load balance, synchronization, and overlap of communication and computation are facilitated by a tunable nodewise FIFO message queue protocol. Our WorkQ library implementation enables an MPI+X hybrid programming model where the X comprises SysV message queues and the user's choice of SysV, POSIX, and MPI shared memory. We develop a simplified software mini-application that mimics the performance behavior of the TCE at arbitrary scale, and we show that the WorkQ engine outperforms the original model by about a factor of 2. We also show performance improvement in the TCE coupled cluster module of NWChem.\",\"PeriodicalId\":421740,\"journal\":{\"name\":\"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PADSW.2014.7097863\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PADSW.2014.7097863","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

分区全局地址空间(PGAS)应用程序，如NWChem中的张量收缩引擎(TCE)，通常采用一核一进程的映射，其中每个进程迭代以下工作处理周期:(1)动态确定工作项，(2)通过远程块上的单侧操作获取数据，(3)在本地执行数据计算，(4)将结果数据放入(或累积)到适当的远程位置，(5)重复此循环。然而，这个简单的执行流并不能有效地隐藏通信延迟成本，尽管有机会进行异步进程。利用非阻塞通信调用是不够的，除非注意有效地管理未完成通信请求的响应队列。本文提出了一种新的运行时模型及其库实现，用于管理PGAS应用程序中可调的“工作队列”。我们的运行时执行模型，称为WorkQ，分配了一些节点上的“生产者”进程来主要进行通信(步骤1、2、4和5)，其他“消费者”进程来进行计算(步骤3);但是进程可以为了性能而动态地切换角色。负载平衡、同步以及通信和计算的重叠由可调节点FIFO消息队列协议促进。我们的WorkQ库实现实现了MPI+X混合编程模型，其中X包括SysV消息队列和用户选择的SysV, POSIX和MPI共享内存。我们开发了一个简化的软件迷你应用程序，以任意规模模仿TCE的性能行为，并且我们表明，WorkQ引擎的性能比原始模型高出约2倍。我们还展示了NWChem的TCE耦合集群模块的性能改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

WorkQ: A many-core producer/consumer execution model applied to PGAS computations

Partitioned global address space (PGAS) applications, such as the Tensor Contraction Engine (TCE) in NWChem, often apply a one-process-per-core mapping in which each process iterates through the following work-processing cycle: (1) determine a work-item dynamically, (2) get data via one-sided operations on remote blocks, (3) perform computation on the data locally, (4) put (or accumulate) resultant data into an appropriate remote location, and (5) repeat the cycle. However, this simple flow of execution does not effectively hide communication latency costs despite the opportunities for making asynchronous progress. Utilizing nonblocking communication calls is not sufficient unless care is taken to efficiently manage a responsive queue of outstanding communication requests. This paper presents a new runtime model and its library implementation for managing tunable “work queues” in PGAS applications. Our runtime execution model, called WorkQ, assigns some number of on-node “producer” processes to primarily do communication (steps 1, 2, 4, and 5) and the other “consumer” processes to do computation (step 3); but processes can switch roles dynamically for the sake of performance. Load balance, synchronization, and overlap of communication and computation are facilitated by a tunable nodewise FIFO message queue protocol. Our WorkQ library implementation enables an MPI+X hybrid programming model where the X comprises SysV message queues and the user's choice of SysV, POSIX, and MPI shared memory. We develop a simplified software mini-application that mimics the performance behavior of the TCE at arbitrary scale, and we show that the WorkQ engine outperforms the original model by about a factor of 2. We also show performance improvement in the TCE coupled cluster module of NWChem.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)

自引率

0.00%

发文量

期刊最新文献

Optimal bandwidth allocation with dynamic multi-path routing for non-critical traffic in AFDX networks Sensor-free corner shape detection by wireless networks Accelerated variance reduction methods on GPU Fault-Tolerant bi-directional communications in web-based applications Performance analysis of HPC applications with irregular tree data structures