DMA Performance Analysis and Multi-core Memory Optimization for SWIM Benchmark on the Cell Processor

Y. Dou, Lin Deng, Jinhui Xu, Yi Zheng
{"title":"DMA Performance Analysis and Multi-core Memory Optimization for SWIM Benchmark on the Cell Processor","authors":"Y. Dou, Lin Deng, Jinhui Xu, Yi Zheng","doi":"10.1109/ISPA.2008.54","DOIUrl":null,"url":null,"abstract":"The Cell processor is a typical heterogeneous multi-core processor, which owns powerful computing capability. But we are facing the challenges of 'memory wall' in developing parallel applications, such as, limited capacity of local memory, limited memory bandwidth for multi-cores and the long latency for data communication. The DMA transfer mechanism is often used to hide the long latency and improve the effective usage of memory bandwidth. In the paper, we start with a series of DMA experimental tests in the context of the Cell processor architecture, and perform mathematical analysis to setup a unified formula on the average bandwidth of DMA by means of exponential fitting, which describes that SPE amount and DMA block size take main effects on DMA bandwidth in quantity. With the supports of the DMA performance formula, we perform 4 types of memory optimization in the process of parallelizing the SWIM benchmark program into a multi-core version. We take Sony PlayStation 3 (PS3) as our test-bed. For SWIM benchmark, with 6 SPE cores, we obtain over 13 times of speedup compared to single PPE, and 3.3 to 6.18 times to AMD and Intel CPU.","PeriodicalId":345341,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing with Applications","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 IEEE International Symposium on Parallel and Distributed Processing with Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPA.2008.54","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

The Cell processor is a typical heterogeneous multi-core processor, which owns powerful computing capability. But we are facing the challenges of 'memory wall' in developing parallel applications, such as, limited capacity of local memory, limited memory bandwidth for multi-cores and the long latency for data communication. The DMA transfer mechanism is often used to hide the long latency and improve the effective usage of memory bandwidth. In the paper, we start with a series of DMA experimental tests in the context of the Cell processor architecture, and perform mathematical analysis to setup a unified formula on the average bandwidth of DMA by means of exponential fitting, which describes that SPE amount and DMA block size take main effects on DMA bandwidth in quantity. With the supports of the DMA performance formula, we perform 4 types of memory optimization in the process of parallelizing the SWIM benchmark program into a multi-core version. We take Sony PlayStation 3 (PS3) as our test-bed. For SWIM benchmark, with 6 SPE cores, we obtain over 13 times of speedup compared to single PPE, and 3.3 to 6.18 times to AMD and Intel CPU.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Cell处理器上基于SWIM基准的DMA性能分析和多核内存优化
Cell处理器是一种典型的异构多核处理器,具有强大的计算能力。但是,在开发并行应用程序时,我们面临着“内存墙”的挑战,例如本地内存容量有限,多核内存带宽有限以及数据通信的长延迟。DMA传输机制通常用于隐藏长延迟和提高内存带宽的有效利用率。本文从在Cell处理器架构下的一系列DMA实验测试入手,通过数学分析,采用指数拟合的方法建立了DMA平均带宽的统一公式,说明SPE量和DMA块大小在数量上对DMA带宽有主要影响。在DMA性能公式的支持下,我们在将SWIM基准程序并行化成多核版本的过程中进行了4种类型的内存优化。我们以索尼PlayStation 3 (PS3)作为我们的测试平台。对于SWIM基准测试,使用6个SPE内核,与单个PPE相比,我们获得了超过13倍的加速,与AMD和Intel CPU相比,我们获得了3.3到6.18倍的加速。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Image Feature Vector Construction Using Interest Point Based Regions A Fully Dynamic Distributed Algorithm for a B-Coloring of Graphs Fixed Point Decimal Multiplication Using RPS Algorithm Self-Stabilizing Construction of Bounded Size Clusters ScatterClipse: A Model-Driven Tool-Chain for Developing, Testing, and Prototyping Wireless Sensor Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1