使用记忆无关优化的离散小波变换算法的有效并行化

2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS) Pub Date : 2015-12-07 DOI:10.1109/PATMOS.2015.7347583

A. Keliris, Vasilis Dimitsas, O. Kremmyda, D. Gizopoulos, M. Maniatakos

{"title":"使用记忆无关优化的离散小波变换算法的有效并行化","authors":"A. Keliris, Vasilis Dimitsas, O. Kremmyda, D. Gizopoulos, M. Maniatakos","doi":"10.1109/PATMOS.2015.7347583","DOIUrl":null,"url":null,"abstract":"As the rate of single-thread CPU performance improvement per generation has diminished due to lower transistor-speed scaling and energy related issues, researchers and industry have shifted their interest towards multi-core and many-core architectures for improving performance. Comparisons between optimized applications for parallel architectures have been quantified many times in the literature, but contradictory results have been reported mainly due to biased methods of evaluating and comparing these architectures. In this paper, we present memory-oblivious optimizations of the widely used Discrete Wavelet Transform (DWT), and provide detailed comparisons of the algorithm on Intel and AMD multi-core CPUs, Nvidia many-core GPUs, as well as the Intel's Xeon Phi many-core coprocessor. Our results indicate that, compared to their respective non-optimized single thread implementations, memory-oblivious optimization delivers up to 17.9×-197.2× performance improvement for the various architectures examined. Furthermore, compared to the state-of-the-art, the presented CPU and GPU memory-oblivious implementations are 2.6× and 1.3× faster respectively than the fastest implementations of DWT currently available in the literature. No comparison to the state-of-the-art can be made for the Xeon Phi, as, to the best of our knowledge, this is the first study that optimizes the DWT for this newfangled architecture.","PeriodicalId":325869,"journal":{"name":"2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Efficient parallelization of the Discrete Wavelet Transform algorithm using memory-oblivious optimizations\",\"authors\":\"A. Keliris, Vasilis Dimitsas, O. Kremmyda, D. Gizopoulos, M. Maniatakos\",\"doi\":\"10.1109/PATMOS.2015.7347583\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As the rate of single-thread CPU performance improvement per generation has diminished due to lower transistor-speed scaling and energy related issues, researchers and industry have shifted their interest towards multi-core and many-core architectures for improving performance. Comparisons between optimized applications for parallel architectures have been quantified many times in the literature, but contradictory results have been reported mainly due to biased methods of evaluating and comparing these architectures. In this paper, we present memory-oblivious optimizations of the widely used Discrete Wavelet Transform (DWT), and provide detailed comparisons of the algorithm on Intel and AMD multi-core CPUs, Nvidia many-core GPUs, as well as the Intel's Xeon Phi many-core coprocessor. Our results indicate that, compared to their respective non-optimized single thread implementations, memory-oblivious optimization delivers up to 17.9×-197.2× performance improvement for the various architectures examined. Furthermore, compared to the state-of-the-art, the presented CPU and GPU memory-oblivious implementations are 2.6× and 1.3× faster respectively than the fastest implementations of DWT currently available in the literature. No comparison to the state-of-the-art can be made for the Xeon Phi, as, to the best of our knowledge, this is the first study that optimizes the DWT for this newfangled architecture.\",\"PeriodicalId\":325869,\"journal\":{\"name\":\"2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS)\",\"volume\":\"87 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-12-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PATMOS.2015.7347583\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PATMOS.2015.7347583","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

由于较低的晶体管速度缩放和能源相关问题，每一代单线程CPU性能的提高速度已经降低，研究人员和工业界已经将他们的兴趣转向多核和多核架构，以提高性能。在文献中，并行架构优化应用程序之间的比较已经被量化了很多次，但由于评估和比较这些架构的方法存在偏差，因此报告了相互矛盾的结果。在本文中，我们提出了广泛使用的离散小波变换(DWT)的内存无关优化，并详细比较了该算法在Intel和AMD多核cpu、Nvidia多核gpu以及Intel的Xeon Phi多核协处理器上的性能。我们的结果表明，与各自未优化的单线程实现相比，无关内存的优化为所研究的各种体系结构提供了17.9×-197.2×性能改进。此外，与最先进的技术相比，所提出的CPU和GPU内存无关实现分别比目前文献中最快的DWT实现快2.6倍和1.3倍。对于Xeon Phi处理器来说，目前还无法与最先进的技术进行比较，因为据我们所知，这是第一次针对这种新颖的架构优化DWT的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Efficient parallelization of the Discrete Wavelet Transform algorithm using memory-oblivious optimizations

As the rate of single-thread CPU performance improvement per generation has diminished due to lower transistor-speed scaling and energy related issues, researchers and industry have shifted their interest towards multi-core and many-core architectures for improving performance. Comparisons between optimized applications for parallel architectures have been quantified many times in the literature, but contradictory results have been reported mainly due to biased methods of evaluating and comparing these architectures. In this paper, we present memory-oblivious optimizations of the widely used Discrete Wavelet Transform (DWT), and provide detailed comparisons of the algorithm on Intel and AMD multi-core CPUs, Nvidia many-core GPUs, as well as the Intel's Xeon Phi many-core coprocessor. Our results indicate that, compared to their respective non-optimized single thread implementations, memory-oblivious optimization delivers up to 17.9×-197.2× performance improvement for the various architectures examined. Furthermore, compared to the state-of-the-art, the presented CPU and GPU memory-oblivious implementations are 2.6× and 1.3× faster respectively than the fastest implementations of DWT currently available in the literature. No comparison to the state-of-the-art can be made for the Xeon Phi, as, to the best of our knowledge, this is the first study that optimizes the DWT for this newfangled architecture.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS)

自引率

0.00%

发文量

期刊最新文献

Adaptive energy minimization of embedded heterogeneous systems using regression-based learning Energy management via PI control for data parallel applications with throughput constraints Energy-efficient Level Shifter topology Asynchronous sub-threshold ultra-low power processor Combining Pel Decimation with Partial Distortion Elimination to increase SAD energy efficiency