减少嵌入式多处理器并行处理开销的体系结构支持

2010 IEEE/IFIP International Conference on Embedded and Ubiquitous Computing Pub Date : 2010-12-11 DOI:10.1109/EUC.2010.17

Jian Wang, Joar Sohl, Dake Liu

{"title":"减少嵌入式多处理器并行处理开销的体系结构支持","authors":"Jian Wang, Joar Sohl, Dake Liu","doi":"10.1109/EUC.2010.17","DOIUrl":null,"url":null,"abstract":"The host-multi-SIMD chip multiprocessor (CMP) architecture has been proved to be an efficient architecture for high performance signal processing which explores both task level parallelism by multi-core processing and data level parallelism by SIMD processors. Different from the cache-based memory subsystem in most general purpose processors, this architecture uses on-chip scratchpad memory (SPM) as processor local data buffer and allows software to explicitly control the data movements in the memory hierarchy. This SPM-based solution is more efficient for predictable signal processing in embedded systems where data access patterns are known at design time. The predictable performance is especially important for real time signal processing. According to Amdahl¡¯s law, the nonparallelizable part of an algorithm has critical impact on the overall performance. Implementing an algorithm in a parallel platform usually produces control and communication overhead which is not parallelizable. This paper presents the architectural support in an embedded multiprocessor platform to maximally reduce the parallel processing overhead. The effectiveness of these architecture designs in boosting parallel performance is evaluated by an implementation example of 64x64 complex matrix multiplication. The result shows that the parallel processing overhead is reduced from 369% to 28%.","PeriodicalId":265175,"journal":{"name":"2010 IEEE/IFIP International Conference on Embedded and Ubiquitous Computing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2010-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Architectural Support for Reducing Parallel Processing Overhead in an Embedded Multiprocessor\",\"authors\":\"Jian Wang, Joar Sohl, Dake Liu\",\"doi\":\"10.1109/EUC.2010.17\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The host-multi-SIMD chip multiprocessor (CMP) architecture has been proved to be an efficient architecture for high performance signal processing which explores both task level parallelism by multi-core processing and data level parallelism by SIMD processors. Different from the cache-based memory subsystem in most general purpose processors, this architecture uses on-chip scratchpad memory (SPM) as processor local data buffer and allows software to explicitly control the data movements in the memory hierarchy. This SPM-based solution is more efficient for predictable signal processing in embedded systems where data access patterns are known at design time. The predictable performance is especially important for real time signal processing. According to Amdahl¡¯s law, the nonparallelizable part of an algorithm has critical impact on the overall performance. Implementing an algorithm in a parallel platform usually produces control and communication overhead which is not parallelizable. This paper presents the architectural support in an embedded multiprocessor platform to maximally reduce the parallel processing overhead. The effectiveness of these architecture designs in boosting parallel performance is evaluated by an implementation example of 64x64 complex matrix multiplication. The result shows that the parallel processing overhead is reduced from 369% to 28%.\",\"PeriodicalId\":265175,\"journal\":{\"name\":\"2010 IEEE/IFIP International Conference on Embedded and Ubiquitous Computing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-12-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE/IFIP International Conference on Embedded and Ubiquitous Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/EUC.2010.17\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE/IFIP International Conference on Embedded and Ubiquitous Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EUC.2010.17","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

主机-多SIMD芯片多处理器(CMP)架构既探索了多核处理的任务级并行性，又探索了SIMD处理器的数据级并行性，已被证明是一种高效的高性能信号处理架构。与大多数通用处理器中基于缓存的内存子系统不同，该体系结构使用片上刮板内存(SPM)作为处理器本地数据缓冲区，并允许软件显式地控制内存层次结构中的数据移动。这种基于spm的解决方案对于在设计时就知道数据访问模式的嵌入式系统中的可预测信号处理更有效。可预测的性能对实时信号处理尤为重要。根据Amdahl定律，算法的不可并行部分对整体性能有关键影响。在并行平台上实现算法通常会产生不可并行化的控制和通信开销。本文提出了嵌入式多处理器平台的架构支持，以最大限度地减少并行处理开销。通过一个64 × 64复矩阵乘法的实现实例，评估了这些架构设计在提高并行性能方面的有效性。结果表明，并行处理开销从369%降低到28%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Architectural Support for Reducing Parallel Processing Overhead in an Embedded Multiprocessor

The host-multi-SIMD chip multiprocessor (CMP) architecture has been proved to be an efficient architecture for high performance signal processing which explores both task level parallelism by multi-core processing and data level parallelism by SIMD processors. Different from the cache-based memory subsystem in most general purpose processors, this architecture uses on-chip scratchpad memory (SPM) as processor local data buffer and allows software to explicitly control the data movements in the memory hierarchy. This SPM-based solution is more efficient for predictable signal processing in embedded systems where data access patterns are known at design time. The predictable performance is especially important for real time signal processing. According to Amdahl¡¯s law, the nonparallelizable part of an algorithm has critical impact on the overall performance. Implementing an algorithm in a parallel platform usually produces control and communication overhead which is not parallelizable. This paper presents the architectural support in an embedded multiprocessor platform to maximally reduce the parallel processing overhead. The effectiveness of these architecture designs in boosting parallel performance is evaluated by an implementation example of 64x64 complex matrix multiplication. The result shows that the parallel processing overhead is reduced from 369% to 28%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2010 IEEE/IFIP International Conference on Embedded and Ubiquitous Computing

自引率

0.00%

发文量