An adaptive Non-Uniform Loop Tiling for DMA-based bulk data transfers on many-core processor

Keni Qiu, Yuanhui Ni, Wei-gong Zhang, Jing Wang, Xiaoqiang Wu, C. Xue, Tao Li
{"title":"An adaptive Non-Uniform Loop Tiling for DMA-based bulk data transfers on many-core processor","authors":"Keni Qiu, Yuanhui Ni, Wei-gong Zhang, Jing Wang, Xiaoqiang Wu, C. Xue, Tao Li","doi":"10.1109/ICCD.2016.7753255","DOIUrl":null,"url":null,"abstract":"Mesh Network-on-Chip (NoC) is a key fabric to interconnect many cores with desirable scalability, reliability and interoperability. We observe that DMA-based bulk data block transfer exhibits non-negligible NoC latency due to heavy congestions. Loop tiling is an effective way to partition data space for SPM+DMA-based data block transfer. Nevertheless, we observe that the unbalanced NoC latency can degrade the effectiveness of loop tiling in a uniform fashion. In this paper, we propose a NoC-aware Non-Uniform Loop Tiling (NULT) scheme to improve DMA performance. A NULT framework is built on the proposed model to adaptively hide DMA latency into computation time and reduce the overall execution time. The framework first groups cores into different families taking into account their distance-to-data in NoC. Then a heuristic method is presented to solve the near optimal tiling factors for each core family. In this way, different core families are assigned non-uniform tiling sizes. We evaluate the NULT scheme on the NIRGAM platform. Compared to the traditional uniform tiling approach, the proposed NULT technique shows more benefit to overlap memory access time and computation time and thus reduce the overall execution time of a loop nest.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 34th International Conference on Computer Design (ICCD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCD.2016.7753255","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Mesh Network-on-Chip (NoC) is a key fabric to interconnect many cores with desirable scalability, reliability and interoperability. We observe that DMA-based bulk data block transfer exhibits non-negligible NoC latency due to heavy congestions. Loop tiling is an effective way to partition data space for SPM+DMA-based data block transfer. Nevertheless, we observe that the unbalanced NoC latency can degrade the effectiveness of loop tiling in a uniform fashion. In this paper, we propose a NoC-aware Non-Uniform Loop Tiling (NULT) scheme to improve DMA performance. A NULT framework is built on the proposed model to adaptively hide DMA latency into computation time and reduce the overall execution time. The framework first groups cores into different families taking into account their distance-to-data in NoC. Then a heuristic method is presented to solve the near optimal tiling factors for each core family. In this way, different core families are assigned non-uniform tiling sizes. We evaluate the NULT scheme on the NIRGAM platform. Compared to the traditional uniform tiling approach, the proposed NULT technique shows more benefit to overlap memory access time and computation time and thus reduce the overall execution time of a loop nest.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
多核处理器上基于dma的批量数据传输的自适应非均匀循环平铺
网状片上网络(NoC)是实现多个核心互连的关键结构,具有理想的可扩展性、可靠性和互操作性。我们观察到,由于严重的拥塞,基于dma的批量数据块传输表现出不可忽略的NoC延迟。对于基于SPM+ dma的数据块传输,循环平铺是一种有效的数据空间分区方法。然而,我们观察到不平衡的NoC延迟会以统一的方式降低循环平铺的有效性。在本文中,我们提出了一种noc感知的非均匀环路平铺(NULT)方案来提高DMA性能。在该模型的基础上建立了一个NULT框架,自适应地将DMA延迟隐藏到计算时间中,减少了总体执行时间。该框架首先考虑到它们在NoC中与数据的距离,将核心分为不同的类。然后提出了一种启发式方法来求解每个核心族的近最优平铺因子。通过这种方式,不同的核心家庭被分配了不均匀的瓷砖尺寸。我们在NIRGAM平台上对NULT方案进行了评估。与传统的均匀平铺方法相比,所提出的NULT技术在内存访问时间和计算时间重叠方面更有优势,从而减少了循环巢的总体执行时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
CNN-MERP: An FPGA-based memory-efficient reconfigurable processor for forward and backward propagation of convolutional neural networks VARIUS-TC: A modular architecture-level model of parametric variation for thin-channel switches A readback based general debugging framework for soft-core processors How logic masking can improve path delay analysis for Hardware Trojan detection ONAC: Optimal number of active cores detector for energy efficient GPU computing
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1