An adaptive Non-Uniform Loop Tiling for DMA-based bulk data transfers on many-core processor

2016 IEEE 34th International Conference on Computer Design (ICCD) Pub Date : 2016-10-01 DOI:10.1109/ICCD.2016.7753255

Keni Qiu, Yuanhui Ni, Wei-gong Zhang, Jing Wang, Xiaoqiang Wu, C. Xue, Tao Li

{"title":"An adaptive Non-Uniform Loop Tiling for DMA-based bulk data transfers on many-core processor","authors":"Keni Qiu, Yuanhui Ni, Wei-gong Zhang, Jing Wang, Xiaoqiang Wu, C. Xue, Tao Li","doi":"10.1109/ICCD.2016.7753255","DOIUrl":null,"url":null,"abstract":"Mesh Network-on-Chip (NoC) is a key fabric to interconnect many cores with desirable scalability, reliability and interoperability. We observe that DMA-based bulk data block transfer exhibits non-negligible NoC latency due to heavy congestions. Loop tiling is an effective way to partition data space for SPM+DMA-based data block transfer. Nevertheless, we observe that the unbalanced NoC latency can degrade the effectiveness of loop tiling in a uniform fashion. In this paper, we propose a NoC-aware Non-Uniform Loop Tiling (NULT) scheme to improve DMA performance. A NULT framework is built on the proposed model to adaptively hide DMA latency into computation time and reduce the overall execution time. The framework first groups cores into different families taking into account their distance-to-data in NoC. Then a heuristic method is presented to solve the near optimal tiling factors for each core family. In this way, different core families are assigned non-uniform tiling sizes. We evaluate the NULT scheme on the NIRGAM platform. Compared to the traditional uniform tiling approach, the proposed NULT technique shows more benefit to overlap memory access time and computation time and thus reduce the overall execution time of a loop nest.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 34th International Conference on Computer Design (ICCD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCD.2016.7753255","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Mesh Network-on-Chip (NoC) is a key fabric to interconnect many cores with desirable scalability, reliability and interoperability. We observe that DMA-based bulk data block transfer exhibits non-negligible NoC latency due to heavy congestions. Loop tiling is an effective way to partition data space for SPM+DMA-based data block transfer. Nevertheless, we observe that the unbalanced NoC latency can degrade the effectiveness of loop tiling in a uniform fashion. In this paper, we propose a NoC-aware Non-Uniform Loop Tiling (NULT) scheme to improve DMA performance. A NULT framework is built on the proposed model to adaptively hide DMA latency into computation time and reduce the overall execution time. The framework first groups cores into different families taking into account their distance-to-data in NoC. Then a heuristic method is presented to solve the near optimal tiling factors for each core family. In this way, different core families are assigned non-uniform tiling sizes. We evaluate the NULT scheme on the NIRGAM platform. Compared to the traditional uniform tiling approach, the proposed NULT technique shows more benefit to overlap memory access time and computation time and thus reduce the overall execution time of a loop nest.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

多核处理器上基于dma的批量数据传输的自适应非均匀循环平铺

网状片上网络(NoC)是实现多个核心互连的关键结构，具有理想的可扩展性、可靠性和互操作性。我们观察到，由于严重的拥塞，基于dma的批量数据块传输表现出不可忽略的NoC延迟。对于基于SPM+ dma的数据块传输，循环平铺是一种有效的数据空间分区方法。然而，我们观察到不平衡的NoC延迟会以统一的方式降低循环平铺的有效性。在本文中，我们提出了一种noc感知的非均匀环路平铺(NULT)方案来提高DMA性能。在该模型的基础上建立了一个NULT框架，自适应地将DMA延迟隐藏到计算时间中，减少了总体执行时间。该框架首先考虑到它们在NoC中与数据的距离，将核心分为不同的类。然后提出了一种启发式方法来求解每个核心族的近最优平铺因子。通过这种方式，不同的核心家庭被分配了不均匀的瓷砖尺寸。我们在NIRGAM平台上对NULT方案进行了评估。与传统的均匀平铺方法相比，所提出的NULT技术在内存访问时间和计算时间重叠方面更有优势，从而减少了循环巢的总体执行时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 IEEE 34th International Conference on Computer Design (ICCD)

自引率

0.00%

发文量