Fast Parallel Stream Compaction for IA-Based Multi/many-core Processors

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) Pub Date : 2016-05-16 DOI:10.1109/CCGrid.2016.112

Qiao Sun, Chao Yang, Changmao Wu, Leisheng Li, Fangfang Liu

{"title":"Fast Parallel Stream Compaction for IA-Based Multi/many-core Processors","authors":"Qiao Sun, Chao Yang, Changmao Wu, Leisheng Li, Fangfang Liu","doi":"10.1109/CCGrid.2016.112","DOIUrl":null,"url":null,"abstract":"Stream compaction, frequently found in a large variety of applications, serves as a general primitive that reduces an input stream to a subset containing only the wanted elements so that the follow-on computation can be done efficiently. In this paper, we propose a fast parallel stream compaction for IA-based multi-/many-core processors. Unlike the previously studied algorithms that depend heavily on a black-box parallel scan, we open the black-box in the proposed algorithm and manually tailor it so that both the workload and the memory footprint is significantly reduced. By further eliminating the conditional statements and applying automatic code generation/optimization for performance-critical kernels, the proposed parallel stream compaction achieves high performance in different cases and for various data types across different IA-based multi/manycore platforms. Experimental results on three typical IA-based processors, including a quad-core Core-i7 CPU, a dual-socket 8-core Xeon CPU, and a 61-core Xeon Phi accelerator show that the proposed implementation outperforms the referenced parallel counterpart in the state-of-art library Thrust. On top of the above, we apply it in the random forest based data classifier to show its potential to boost the performance of real-world applications.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid.2016.112","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Stream compaction, frequently found in a large variety of applications, serves as a general primitive that reduces an input stream to a subset containing only the wanted elements so that the follow-on computation can be done efficiently. In this paper, we propose a fast parallel stream compaction for IA-based multi-/many-core processors. Unlike the previously studied algorithms that depend heavily on a black-box parallel scan, we open the black-box in the proposed algorithm and manually tailor it so that both the workload and the memory footprint is significantly reduced. By further eliminating the conditional statements and applying automatic code generation/optimization for performance-critical kernels, the proposed parallel stream compaction achieves high performance in different cases and for various data types across different IA-based multi/manycore platforms. Experimental results on three typical IA-based processors, including a quad-core Core-i7 CPU, a dual-socket 8-core Xeon CPU, and a 61-core Xeon Phi accelerator show that the proposed implementation outperforms the referenced parallel counterpart in the state-of-art library Thrust. On top of the above, we apply it in the random forest based data classifier to show its potential to boost the performance of real-world applications.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于ia的多/多核处理器的快速并行流压缩

流压缩经常出现在各种各样的应用程序中，它作为一种通用的原语，将输入流减少到只包含所需元素的子集，以便能够有效地完成后续计算。在本文中，我们提出了一种基于ia的多核/多核处理器的快速并行流压缩。与之前研究的算法严重依赖于黑盒并行扫描不同，我们在提出的算法中打开黑盒并手动裁剪它，从而显着降低了工作负载和内存占用。通过进一步消除条件语句并为性能关键型内核应用自动代码生成/优化，所提出的并行流压缩在不同情况下以及跨不同基于ia的多/多核平台的各种数据类型中实现了高性能。在四核Core-i7 CPU、双插槽8核Xeon CPU和61核Xeon Phi加速器等三种典型的基于ia的处理器上的实验结果表明，所提出的实现优于最先进库Thrust中的参考并行处理器。在上面的基础上，我们将其应用于基于随机森林的数据分类器中，以显示其提高实际应用程序性能的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

自引率

0.00%

发文量