Improving cache locality with blocked array layouts

Evangelia Athanasaki, N. Koziris
{"title":"Improving cache locality with blocked array layouts","authors":"Evangelia Athanasaki, N. Koziris","doi":"10.1109/EMPDP.2004.1271460","DOIUrl":null,"url":null,"abstract":"Minimizing cache misses is one of the most important factors to reduce average latency for memory accesses. Tiled codes modify the instruction stream to exploit cache locality for array accesses. Here, we further reduce cache misses, restructuring the memory layout of multidimensional arrays, that are accessed by tiled instruction code. In our method, array elements are stored in a blocked way, exactly as they are swept by the tiled instruction stream. We present a straightforward way to easily translate multidimensional indexing of arrays into their blocked memory layout using simple binary-mask operations. Indices for such array layouts are easily calculated based on the algebra of dilated integers, similarly to morton-order indexing. Actual experimental results, using matrix multiplication and LU-decomposition on various size arrays, illustrate that execution time is greatly improved when combining tiled code with tiled array layouts and binary mask-based index translation functions. Simulations using the Simplescalar tool, verify that enhanced performance is due to the considerable reduction of total cache misses.","PeriodicalId":105726,"journal":{"name":"12th Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2004. Proceedings.","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2004-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"12th Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2004. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EMPDP.2004.1271460","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Minimizing cache misses is one of the most important factors to reduce average latency for memory accesses. Tiled codes modify the instruction stream to exploit cache locality for array accesses. Here, we further reduce cache misses, restructuring the memory layout of multidimensional arrays, that are accessed by tiled instruction code. In our method, array elements are stored in a blocked way, exactly as they are swept by the tiled instruction stream. We present a straightforward way to easily translate multidimensional indexing of arrays into their blocked memory layout using simple binary-mask operations. Indices for such array layouts are easily calculated based on the algebra of dilated integers, similarly to morton-order indexing. Actual experimental results, using matrix multiplication and LU-decomposition on various size arrays, illustrate that execution time is greatly improved when combining tiled code with tiled array layouts and binary mask-based index translation functions. Simulations using the Simplescalar tool, verify that enhanced performance is due to the considerable reduction of total cache misses.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过阻塞数组布局改进缓存局部性
最小化缓存丢失是减少内存访问平均延迟的最重要因素之一。平铺代码修改指令流以利用缓存局部性进行数组访问。在这里,我们进一步减少缓存缺失,重构由平铺指令代码访问的多维数组的内存布局。在我们的方法中,数组元素以阻塞的方式存储,就像它们被平铺指令流扫描一样。我们提供了一种简单的方法,可以使用简单的二进制掩码操作,轻松地将数组的多维索引转换为它们的阻塞内存布局。这种数组布局的索引很容易基于扩展整数的代数计算,类似于morton-order索引。在不同大小的数组上使用矩阵乘法和lu分解的实际实验结果表明,将平铺代码与平铺数组布局和基于二进制掩码的索引转换函数相结合可以大大提高执行时间。使用Simplescalar工具进行模拟,验证性能的增强是由于大量减少了总缓存丢失。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Fast dependence analysis in a multimedia vectorizing compiler Empirical characterization of the latency of long asynchronous pipelines with data-dependent module delays Parallelization and comparison of 3D iterative reconstruction algorithms Efficient monitoring to detect wireless channel failures for MPI programs Exporting processing power of home embedded devices to global computing applications
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1