Task-DAG Support in Single-Source PHAST Library: Enabling Flexible Assignment of Tasks to CPUs and GPUs in Heterogeneous Architectures

Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores Pub Date : 2019-02-17 DOI:10.1145/3303084.3309496

Biagio Peccerillo, S. Bartolini

{"title":"Task-DAG Support in Single-Source PHAST Library: Enabling Flexible Assignment of Tasks to CPUs and GPUs in Heterogeneous Architectures","authors":"Biagio Peccerillo, S. Bartolini","doi":"10.1145/3303084.3309496","DOIUrl":null,"url":null,"abstract":"Nowadays, the majority of desktop, mobile, and embedded devices in the consumer and industrial markets are heterogeneous, as they contain at least multi-core CPU and GPU resources in the same system. However, exploiting the performance and energy-efficiency of these diverse processing elements does not come for free from a software point of view: programmers need to a) code each activity through the specific approaches, libraries, and frameworks suitable for their target architecture (e.g., CPUs and GPUs) along with the orchestration of such heterogeneous execution, and b) decide the distribution of sequential and parallel activities towards the different parallel hardware resources available. Current frameworks typically provide either low-abstraction-level target-specific and/or generic but not high-performance interfaces, which complicate the exploration of different task assignments, with DAG1 precedence relationship, to the available heterogeneous resources. To enable this, tasks would typically need to be coded one time for each target architecture due to the profound differences in their programming. In this work, we include the support of tasks and DAGs of data-parallel tasks within the single-source PHAST library, which currently supports both multi-core CPUs and NVIDIA GPUs, so that tasks are coded in a target-agnostic fashion and their targeting to multi-core or GPU architectures is automatic and efficient. The integration of this coding approach with tasks can help to postpone the choice of the execution platform for each task up to the testing, or even to the runtime, phase. Finally, we demonstrate the effects of this approach in the case of a sample image pipeline benchmark from the computer vision domain. We compare our implementation to a SYCL implementation from a productivity point of view. Also, we show that various task assignments can be seamlessly explored by implementing both the PEFT2 mapping technique along with an exhaustive search in the mapping space.","PeriodicalId":408167,"journal":{"name":"Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3303084.3309496","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Nowadays, the majority of desktop, mobile, and embedded devices in the consumer and industrial markets are heterogeneous, as they contain at least multi-core CPU and GPU resources in the same system. However, exploiting the performance and energy-efficiency of these diverse processing elements does not come for free from a software point of view: programmers need to a) code each activity through the specific approaches, libraries, and frameworks suitable for their target architecture (e.g., CPUs and GPUs) along with the orchestration of such heterogeneous execution, and b) decide the distribution of sequential and parallel activities towards the different parallel hardware resources available. Current frameworks typically provide either low-abstraction-level target-specific and/or generic but not high-performance interfaces, which complicate the exploration of different task assignments, with DAG1 precedence relationship, to the available heterogeneous resources. To enable this, tasks would typically need to be coded one time for each target architecture due to the profound differences in their programming. In this work, we include the support of tasks and DAGs of data-parallel tasks within the single-source PHAST library, which currently supports both multi-core CPUs and NVIDIA GPUs, so that tasks are coded in a target-agnostic fashion and their targeting to multi-core or GPU architectures is automatic and efficient. The integration of this coding approach with tasks can help to postpone the choice of the execution platform for each task up to the testing, or even to the runtime, phase. Finally, we demonstrate the effects of this approach in the case of a sample image pipeline benchmark from the computer vision domain. We compare our implementation to a SYCL implementation from a productivity point of view. Also, we show that various task assignments can be seamlessly explored by implementing both the PEFT2 mapping technique along with an exhaustive search in the mapping space.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

单源PHAST库中的Task-DAG支持:在异构架构中为cpu和gpu灵活分配任务

如今，消费和工业市场中的大多数桌面、移动和嵌入式设备都是异构的，因为它们在同一系统中至少包含多核CPU和GPU资源。然而，从软件的角度来看，利用这些不同处理元素的性能和能源效率并不是免费的:程序员需要a)通过适合其目标体系结构(例如，cpu和gpu)的特定方法、库和框架对每个活动进行编码，并对此类异构执行进行编排;b)决定顺序和并行活动的分布，以适应不同的可用并行硬件资源。当前的框架通常提供低抽象级别的特定于目标和/或通用但不高性能的接口，这会使探索不同的任务分配(具有DAG1优先关系)到可用的异构资源变得复杂。为了实现这一点，通常需要为每个目标体系结构编写一次任务，因为它们的编程存在很大的差异。在这项工作中，我们在单源PHAST库中包括对数据并行任务的任务和dag的支持，该库目前支持多核cpu和NVIDIA GPU，因此任务以目标不可知的方式编码，并且它们针对多核或GPU架构的目标是自动且高效的。这种编码方法与任务的集成可以帮助将每个任务的执行平台的选择推迟到测试阶段，甚至推迟到运行时阶段。最后，我们在计算机视觉领域的样本图像管道基准测试中演示了该方法的效果。我们从生产力的角度将我们的实现与SYCL实现进行比较。此外，我们还展示了通过实现PEFT2映射技术以及映射空间中的穷举搜索，可以无缝地探索各种任务分配。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores

自引率

0.00%

发文量

期刊最新文献

Formal Verification through Combinatorial Topology: the CAS-Extended Model Wait-free Dynamic Transactions for Linked Data Structures Deciphering Predictive Schedulers for Heterogeneous-ISA Multicore Architectures LiTM: A Lightweight Deterministic Software Transactional Memory System Process Barrier for Predictable and Repeatable Concurrent Execution