Keni Qiu, Yuanhui Ni, Wei-gong Zhang, Jing Wang, Xiaoqiang Wu, C. Xue, Tao Li
{"title":"An adaptive Non-Uniform Loop Tiling for DMA-based bulk data transfers on many-core processor","authors":"Keni Qiu, Yuanhui Ni, Wei-gong Zhang, Jing Wang, Xiaoqiang Wu, C. Xue, Tao Li","doi":"10.1109/ICCD.2016.7753255","DOIUrl":null,"url":null,"abstract":"Mesh Network-on-Chip (NoC) is a key fabric to interconnect many cores with desirable scalability, reliability and interoperability. We observe that DMA-based bulk data block transfer exhibits non-negligible NoC latency due to heavy congestions. Loop tiling is an effective way to partition data space for SPM+DMA-based data block transfer. Nevertheless, we observe that the unbalanced NoC latency can degrade the effectiveness of loop tiling in a uniform fashion. In this paper, we propose a NoC-aware Non-Uniform Loop Tiling (NULT) scheme to improve DMA performance. A NULT framework is built on the proposed model to adaptively hide DMA latency into computation time and reduce the overall execution time. The framework first groups cores into different families taking into account their distance-to-data in NoC. Then a heuristic method is presented to solve the near optimal tiling factors for each core family. In this way, different core families are assigned non-uniform tiling sizes. We evaluate the NULT scheme on the NIRGAM platform. Compared to the traditional uniform tiling approach, the proposed NULT technique shows more benefit to overlap memory access time and computation time and thus reduce the overall execution time of a loop nest.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 34th International Conference on Computer Design (ICCD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCD.2016.7753255","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Mesh Network-on-Chip (NoC) is a key fabric to interconnect many cores with desirable scalability, reliability and interoperability. We observe that DMA-based bulk data block transfer exhibits non-negligible NoC latency due to heavy congestions. Loop tiling is an effective way to partition data space for SPM+DMA-based data block transfer. Nevertheless, we observe that the unbalanced NoC latency can degrade the effectiveness of loop tiling in a uniform fashion. In this paper, we propose a NoC-aware Non-Uniform Loop Tiling (NULT) scheme to improve DMA performance. A NULT framework is built on the proposed model to adaptively hide DMA latency into computation time and reduce the overall execution time. The framework first groups cores into different families taking into account their distance-to-data in NoC. Then a heuristic method is presented to solve the near optimal tiling factors for each core family. In this way, different core families are assigned non-uniform tiling sizes. We evaluate the NULT scheme on the NIRGAM platform. Compared to the traditional uniform tiling approach, the proposed NULT technique shows more benefit to overlap memory access time and computation time and thus reduce the overall execution time of a loop nest.