DMA Performance Analysis and Multi-core Memory Optimization for SWIM Benchmark on the Cell Processor

2008 IEEE International Symposium on Parallel and Distributed Processing with Applications Pub Date : 2008-12-10 DOI:10.1109/ISPA.2008.54

Y. Dou, Lin Deng, Jinhui Xu, Yi Zheng

{"title":"DMA Performance Analysis and Multi-core Memory Optimization for SWIM Benchmark on the Cell Processor","authors":"Y. Dou, Lin Deng, Jinhui Xu, Yi Zheng","doi":"10.1109/ISPA.2008.54","DOIUrl":null,"url":null,"abstract":"The Cell processor is a typical heterogeneous multi-core processor, which owns powerful computing capability. But we are facing the challenges of 'memory wall' in developing parallel applications, such as, limited capacity of local memory, limited memory bandwidth for multi-cores and the long latency for data communication. The DMA transfer mechanism is often used to hide the long latency and improve the effective usage of memory bandwidth. In the paper, we start with a series of DMA experimental tests in the context of the Cell processor architecture, and perform mathematical analysis to setup a unified formula on the average bandwidth of DMA by means of exponential fitting, which describes that SPE amount and DMA block size take main effects on DMA bandwidth in quantity. With the supports of the DMA performance formula, we perform 4 types of memory optimization in the process of parallelizing the SWIM benchmark program into a multi-core version. We take Sony PlayStation 3 (PS3) as our test-bed. For SWIM benchmark, with 6 SPE cores, we obtain over 13 times of speedup compared to single PPE, and 3.3 to 6.18 times to AMD and Intel CPU.","PeriodicalId":345341,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing with Applications","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 IEEE International Symposium on Parallel and Distributed Processing with Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPA.2008.54","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

The Cell processor is a typical heterogeneous multi-core processor, which owns powerful computing capability. But we are facing the challenges of 'memory wall' in developing parallel applications, such as, limited capacity of local memory, limited memory bandwidth for multi-cores and the long latency for data communication. The DMA transfer mechanism is often used to hide the long latency and improve the effective usage of memory bandwidth. In the paper, we start with a series of DMA experimental tests in the context of the Cell processor architecture, and perform mathematical analysis to setup a unified formula on the average bandwidth of DMA by means of exponential fitting, which describes that SPE amount and DMA block size take main effects on DMA bandwidth in quantity. With the supports of the DMA performance formula, we perform 4 types of memory optimization in the process of parallelizing the SWIM benchmark program into a multi-core version. We take Sony PlayStation 3 (PS3) as our test-bed. For SWIM benchmark, with 6 SPE cores, we obtain over 13 times of speedup compared to single PPE, and 3.3 to 6.18 times to AMD and Intel CPU.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Cell处理器上基于SWIM基准的DMA性能分析和多核内存优化

Cell处理器是一种典型的异构多核处理器，具有强大的计算能力。但是，在开发并行应用程序时，我们面临着“内存墙”的挑战，例如本地内存容量有限，多核内存带宽有限以及数据通信的长延迟。DMA传输机制通常用于隐藏长延迟和提高内存带宽的有效利用率。本文从在Cell处理器架构下的一系列DMA实验测试入手，通过数学分析，采用指数拟合的方法建立了DMA平均带宽的统一公式，说明SPE量和DMA块大小在数量上对DMA带宽有主要影响。在DMA性能公式的支持下，我们在将SWIM基准程序并行化成多核版本的过程中进行了4种类型的内存优化。我们以索尼PlayStation 3 (PS3)作为我们的测试平台。对于SWIM基准测试，使用6个SPE内核，与单个PPE相比，我们获得了超过13倍的加速，与AMD和Intel CPU相比，我们获得了3.3到6.18倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2008 IEEE International Symposium on Parallel and Distributed Processing with Applications

自引率

0.00%

发文量

期刊最新文献

Image Feature Vector Construction Using Interest Point Based Regions A Fully Dynamic Distributed Algorithm for a B-Coloring of Graphs Fixed Point Decimal Multiplication Using RPS Algorithm Self-Stabilizing Construction of Bounded Size Clusters ScatterClipse: A Model-Driven Tool-Chain for Developing, Testing, and Prototyping Wireless Sensor Networks