变量批处理DGEMM

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP) Pub Date : 2018-03-21 DOI:10.1109/PDP2018.2018.00065

Pedro Valero-Lara, I. Martínez-Pérez, Sergi Mateo, R. Sirvent, Vicencc Beltran, X. Martorell, Jesús Labarta

{"title":"变量批处理DGEMM","authors":"Pedro Valero-Lara, I. Martínez-Pérez, Sergi Mateo, R. Sirvent, Vicencc Beltran, X. Martorell, Jesús Labarta","doi":"10.1109/PDP2018.2018.00065","DOIUrl":null,"url":null,"abstract":"Many scientific applications are in need to solve a high number of small-size independent problems. These individual problems do not provide enough parallelism and then, these must be computed as a batch. Today, vendors such as Intel and NVIDIA are developing their own suite of batch routines. Although most of the works focus on computing batches of fixed size, in real applications we can not assume a uniform size for all set of problems. We explore and analyze different strategies based on parallel for, task and taskloop OpenMP pragmas. Although these strategies are straightforward from a programmer's point of view, they have a different impact on performance. We also analyze a new prototype provided by Intel (MKL), which deals with batch operations (cblas dgemm batch). We propose a new approach called grouping. It basically groups a set of problems until filling a limit in terms of memory occupancy or number of operations. In this way, groups composed by different number of problems are distributed on cores, achieving a more balanced distribution in terms of computational cost. This strategy is able to be up to 6× faster than the Intel (MKL) batch routine.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Variable Batched DGEMM\",\"authors\":\"Pedro Valero-Lara, I. Martínez-Pérez, Sergi Mateo, R. Sirvent, Vicencc Beltran, X. Martorell, Jesús Labarta\",\"doi\":\"10.1109/PDP2018.2018.00065\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many scientific applications are in need to solve a high number of small-size independent problems. These individual problems do not provide enough parallelism and then, these must be computed as a batch. Today, vendors such as Intel and NVIDIA are developing their own suite of batch routines. Although most of the works focus on computing batches of fixed size, in real applications we can not assume a uniform size for all set of problems. We explore and analyze different strategies based on parallel for, task and taskloop OpenMP pragmas. Although these strategies are straightforward from a programmer's point of view, they have a different impact on performance. We also analyze a new prototype provided by Intel (MKL), which deals with batch operations (cblas dgemm batch). We propose a new approach called grouping. It basically groups a set of problems until filling a limit in terms of memory occupancy or number of operations. In this way, groups composed by different number of problems are distributed on cores, achieving a more balanced distribution in terms of computational cost. This strategy is able to be up to 6× faster than the Intel (MKL) batch routine.\",\"PeriodicalId\":333367,\"journal\":{\"name\":\"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)\",\"volume\":\"119 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-03-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDP2018.2018.00065\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDP2018.2018.00065","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

许多科学应用都需要解决大量的小尺度独立问题。这些单独的问题没有提供足够的并行性，然后，这些必须作为一个批计算。今天，像英特尔和NVIDIA这样的供应商正在开发他们自己的批处理例程套件。虽然大多数工作集中在计算固定大小的批，但在实际应用中，我们不能对所有问题集假设一个统一的大小。我们探索和分析了基于并行for、任务和任务循环OpenMP语用的不同策略。尽管从程序员的角度来看，这些策略很简单，但它们对性能的影响不同。我们还分析了英特尔公司(MKL)提供的一个处理批处理(cblas dgemm batch)的新原型。我们提出了一种新的方法，叫做分组。它基本上对一组问题进行分组，直到在内存占用或操作数量方面达到限制。这样，由不同数量的问题组成的组分布在核心上，在计算成本方面实现了更平衡的分布。这种策略能够比Intel (MKL)批处理例程快6倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Variable Batched DGEMM

Many scientific applications are in need to solve a high number of small-size independent problems. These individual problems do not provide enough parallelism and then, these must be computed as a batch. Today, vendors such as Intel and NVIDIA are developing their own suite of batch routines. Although most of the works focus on computing batches of fixed size, in real applications we can not assume a uniform size for all set of problems. We explore and analyze different strategies based on parallel for, task and taskloop OpenMP pragmas. Although these strategies are straightforward from a programmer's point of view, they have a different impact on performance. We also analyze a new prototype provided by Intel (MKL), which deals with batch operations (cblas dgemm batch). We propose a new approach called grouping. It basically groups a set of problems until filling a limit in terms of memory occupancy or number of operations. In this way, groups composed by different number of problems are distributed on cores, achieving a more balanced distribution in terms of computational cost. This strategy is able to be up to 6× faster than the Intel (MKL) batch routine.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

自引率

0.00%

发文量