模板计算的隐式和显式优化

Workshop on Memory System Performance and Correctness Pub Date : 2006-10-22 DOI:10.1145/1178597.1178605

S. Kamil, K. Datta, Samuel Williams, L. Oliker, J. Shalf, K. Yelick

{"title":"模板计算的隐式和显式优化","authors":"S. Kamil, K. Datta, Samuel Williams, L. Oliker, J. Shalf, K. Yelick","doi":"10.1145/1178597.1178605","DOIUrl":null,"url":null,"abstract":"Stencil-based kernels constitute the core of many scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. We examine several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor. The optimizations target cache reuse across stencil sweeps, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure. Finally, we consider stencil computations on a machine with an explicitly-managed memory hierarchy, the Cell processor. Overall, results show that a cache-aware approach is significantly faster than a cache oblivious approach and that the explicitly managed memory on Cell is more efficient: Relative to the Power5, it has almost 2x more memory bandwidth and is 3.7x faster.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"153","resultStr":"{\"title\":\"Implicit and explicit optimizations for stencil computations\",\"authors\":\"S. Kamil, K. Datta, Samuel Williams, L. Oliker, J. Shalf, K. Yelick\",\"doi\":\"10.1145/1178597.1178605\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stencil-based kernels constitute the core of many scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. We examine several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor. The optimizations target cache reuse across stencil sweeps, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure. Finally, we consider stencil computations on a machine with an explicitly-managed memory hierarchy, the Cell processor. Overall, results show that a cache-aware approach is significantly faster than a cache oblivious approach and that the explicitly managed memory on Cell is more efficient: Relative to the Power5, it has almost 2x more memory bandwidth and is 3.7x faster.\",\"PeriodicalId\":130040,\"journal\":{\"name\":\"Workshop on Memory System Performance and Correctness\",\"volume\":\"70 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-10-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"153\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Workshop on Memory System Performance and Correctness\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1178597.1178605\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Memory System Performance and Correctness","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1178597.1178605","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 153

摘要

基于模板的核构成了许多关于块结构网格的科学应用的核心。不幸的是，由于处理器和主存储器速度之间的差异，这些代码只能达到峰值性能的一小部分。我们研究了在Itanium 2、Opteron和Power5的传统基于缓存的内存系统上的几种优化，以及Cell处理器的异构多核设计。优化的目标是跨模板扫描的缓存重用，包括隐式缓存无关方法和缓存感知算法，以匹配缓存结构。最后，我们考虑在具有显式管理的内存层次结构的机器上的模板计算，即Cell处理器。总体而言，结果表明，缓存感知方法比缓存无关方法要快得多，并且Cell上显式管理的内存效率更高:相对于Power5，它的内存带宽几乎增加了2倍，速度提高了3.7倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Implicit and explicit optimizations for stencil computations

Stencil-based kernels constitute the core of many scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. We examine several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor. The optimizations target cache reuse across stencil sweeps, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure. Finally, we consider stencil computations on a machine with an explicitly-managed memory hierarchy, the Cell processor. Overall, results show that a cache-aware approach is significantly faster than a cache oblivious approach and that the explicitly managed memory on Cell is more efficient: Relative to the Power5, it has almost 2x more memory bandwidth and is 3.7x faster.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Workshop on Memory System Performance and Correctness

自引率

0.00%

发文量

期刊最新文献

All-window data liveness Cache rationing for multicore Software-controlled transparent management of heterogeneous memory resources in virtualized systems Program-centric cost models for locality A study of data structures with a deep heap shape