通过空间填充曲线内存布局提高多核平台上结构化内存、数据密集型应用程序的性能

2015 IEEE International Parallel and Distributed Processing Symposium Workshop Pub Date : 2015-05-25 DOI:10.1109/IPDPSW.2015.71

E. W. Bethel, David Camp, D. Donofrio, Mark Howison

{"title":"通过空间填充曲线内存布局提高多核平台上结构化内存、数据密集型应用程序的性能","authors":"E. W. Bethel, David Camp, D. Donofrio, Mark Howison","doi":"10.1109/IPDPSW.2015.71","DOIUrl":null,"url":null,"abstract":"Many data-intensive algorithms -- particularly in visualization, image processing, and data analysis -- operate on structured data, that is, data organized in multidimensional arrays. While many of these algorithms are quite numerically intensive, by and large, their performance is limited by the cost of memory accesses. As we move towards the exascale regime of computing, one central research challenge is finding ways to minimize data movement through the memory hierarchy, particularly within a node in a shared-memory parallel setting. We study the effects that an alternative in-memory data layout format has in terms of runtime performance gains resulting from reducing the amount of data moved through the memory hierarchy. We focus the study on shared-memory parallel implementations of two algorithms common in visualization and analysis: a stencil-based convolution kernel, which uses a structured memory access pattern, and ray casting volume rendering, which uses a semi-structured memory access pattern. The question we study is to better understand to what degree an alternative memory layout, when used by these key algorithms, will result in improved runtime performance and memory system utilization. Our approach uses a layout based on a Z-order (Morton-order) space-filling curve data organization, and we measure and report runtime and various metrics and counters associated with memory system utilization. Our results show nearly uniform improved runtime performance and improved utilization of the memory hierarchy across varying levels of concurrency the applications we tested. This approach is complementary to other memory optimization strategies like cache blocking, but may also be more general and widely applicable to a diverse set of applications.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Improving Performance of Structured-Memory, Data-Intensive Applications on Multi-core Platforms via a Space-Filling Curve Memory Layout\",\"authors\":\"E. W. Bethel, David Camp, D. Donofrio, Mark Howison\",\"doi\":\"10.1109/IPDPSW.2015.71\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many data-intensive algorithms -- particularly in visualization, image processing, and data analysis -- operate on structured data, that is, data organized in multidimensional arrays. While many of these algorithms are quite numerically intensive, by and large, their performance is limited by the cost of memory accesses. As we move towards the exascale regime of computing, one central research challenge is finding ways to minimize data movement through the memory hierarchy, particularly within a node in a shared-memory parallel setting. We study the effects that an alternative in-memory data layout format has in terms of runtime performance gains resulting from reducing the amount of data moved through the memory hierarchy. We focus the study on shared-memory parallel implementations of two algorithms common in visualization and analysis: a stencil-based convolution kernel, which uses a structured memory access pattern, and ray casting volume rendering, which uses a semi-structured memory access pattern. The question we study is to better understand to what degree an alternative memory layout, when used by these key algorithms, will result in improved runtime performance and memory system utilization. Our approach uses a layout based on a Z-order (Morton-order) space-filling curve data organization, and we measure and report runtime and various metrics and counters associated with memory system utilization. Our results show nearly uniform improved runtime performance and improved utilization of the memory hierarchy across varying levels of concurrency the applications we tested. This approach is complementary to other memory optimization strategies like cache blocking, but may also be more general and widely applicable to a diverse set of applications.\",\"PeriodicalId\":340697,\"journal\":{\"name\":\"2015 IEEE International Parallel and Distributed Processing Symposium Workshop\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-05-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE International Parallel and Distributed Processing Symposium Workshop\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW.2015.71\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2015.71","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

许多数据密集型算法——特别是在可视化、图像处理和数据分析方面——操作结构化数据，即组织在多维数组中的数据。虽然这些算法中的许多都是数字密集型的，但总的来说，它们的性能受到内存访问成本的限制。随着我们向百亿级计算体系迈进，一个核心的研究挑战是找到最小化内存层次结构中的数据移动的方法，特别是在共享内存并行设置中的节点内。我们研究了另一种内存数据布局格式在运行时性能提升方面的影响，因为它减少了通过内存层次结构移动的数据量。我们重点研究了可视化和分析中常见的两种算法的共享内存并行实现:基于模板的卷积核，它使用结构化内存访问模式，以及射线投射体渲染，它使用半结构化内存访问模式。我们研究的问题是更好地理解，当这些关键算法使用替代内存布局时，将在多大程度上提高运行时性能和内存系统利用率。我们的方法使用基于Z-order (Morton-order)空间填充曲线数据组织的布局，我们测量和报告运行时以及与内存系统利用率相关的各种指标和计数器。我们的结果显示，在我们测试的应用程序的不同并发级别上，运行时性能和内存层次结构的利用率几乎一致地得到了改进。这种方法是对其他内存优化策略(如缓存阻塞)的补充，但也可能更通用，更广泛地适用于各种应用程序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Improving Performance of Structured-Memory, Data-Intensive Applications on Multi-core Platforms via a Space-Filling Curve Memory Layout

Many data-intensive algorithms -- particularly in visualization, image processing, and data analysis -- operate on structured data, that is, data organized in multidimensional arrays. While many of these algorithms are quite numerically intensive, by and large, their performance is limited by the cost of memory accesses. As we move towards the exascale regime of computing, one central research challenge is finding ways to minimize data movement through the memory hierarchy, particularly within a node in a shared-memory parallel setting. We study the effects that an alternative in-memory data layout format has in terms of runtime performance gains resulting from reducing the amount of data moved through the memory hierarchy. We focus the study on shared-memory parallel implementations of two algorithms common in visualization and analysis: a stencil-based convolution kernel, which uses a structured memory access pattern, and ray casting volume rendering, which uses a semi-structured memory access pattern. The question we study is to better understand to what degree an alternative memory layout, when used by these key algorithms, will result in improved runtime performance and memory system utilization. Our approach uses a layout based on a Z-order (Morton-order) space-filling curve data organization, and we measure and report runtime and various metrics and counters associated with memory system utilization. Our results show nearly uniform improved runtime performance and improved utilization of the memory hierarchy across varying levels of concurrency the applications we tested. This approach is complementary to other memory optimization strategies like cache blocking, but may also be more general and widely applicable to a diverse set of applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

自引率

0.00%

发文量

期刊最新文献

Accelerating Large-Scale Single-Source Shortest Path on FPGA Relocation-Aware Floorplanning for Partially-Reconfigurable FPGA-Based Systems iWAPT Introduction and Committees Computing the Pseudo-Inverse of a Graph's Laplacian Using GPUs Optimizing Defensive Investments in Energy-Based Cyber-Physical Systems