D2MA: Accelerating coarse-grained data transfer for GPUs

D. Jamshidi, M. Samadi, S. Mahlke
{"title":"D2MA: Accelerating coarse-grained data transfer for GPUs","authors":"D. Jamshidi, M. Samadi, S. Mahlke","doi":"10.1145/2628071.2628072","DOIUrl":null,"url":null,"abstract":"To achieve high performance on many-core architectures like GPUs, it is crucial to efficiently utilize the available memory bandwidth. Currently, it is common to use fast, on-chip scratchpad memories, like the shared memory available on GPUs' shader cores, to buffer data for computation. This buffering, however, has some sources of inefficiency that hinder it from most efficiently utilizing the available memory resources. These issues stem from shader resources being used for repeated, regular address calculations, a need to shuffle data multiple times between a physically unified on-chip memory, and forcing all threads to synchronize to ensure RAW consistency based on the speed of the slowest threads. To address these inefficiencies, we propose DataParallel DMA, or D2MA. D2MA is a reimagination of traditional DMA that addresses the challenges of extending DMA to thousands of concurrently executing threads. D2MA decouples address generation from the shader's computational resources, provides a more direct and efficient path for data in global memory to travel into the shared memory, and introduces a novel dynamic synchronization scheme that is transparent to the programmer. These advancements allow D2MA to achieve speedups as high as 2.29×, and reduces the average time to buffer data by 81% on average.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"140 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2628071.2628072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16

Abstract

To achieve high performance on many-core architectures like GPUs, it is crucial to efficiently utilize the available memory bandwidth. Currently, it is common to use fast, on-chip scratchpad memories, like the shared memory available on GPUs' shader cores, to buffer data for computation. This buffering, however, has some sources of inefficiency that hinder it from most efficiently utilizing the available memory resources. These issues stem from shader resources being used for repeated, regular address calculations, a need to shuffle data multiple times between a physically unified on-chip memory, and forcing all threads to synchronize to ensure RAW consistency based on the speed of the slowest threads. To address these inefficiencies, we propose DataParallel DMA, or D2MA. D2MA is a reimagination of traditional DMA that addresses the challenges of extending DMA to thousands of concurrently executing threads. D2MA decouples address generation from the shader's computational resources, provides a more direct and efficient path for data in global memory to travel into the shared memory, and introduces a novel dynamic synchronization scheme that is transparent to the programmer. These advancements allow D2MA to achieve speedups as high as 2.29×, and reduces the average time to buffer data by 81% on average.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
D2MA:加速gpu的粗粒度数据传输
为了在gpu等多核架构上实现高性能,有效利用可用内存带宽至关重要。目前,使用快速的片上刮板存储器(如gpu的着色器内核上可用的共享存储器)来缓冲计算数据是很常见的。然而,这种缓冲有一些低效率的来源,阻碍了它最有效地利用可用内存资源。这些问题源于着色器资源被用于重复,常规的地址计算,需要在物理上统一的芯片内存之间多次打乱数据,并强制所有线程同步以确保基于最慢线程的速度的RAW一致性。为了解决这些低效率问题,我们提出了数据并行DMA或D2MA。D2MA是对传统DMA的重新构想,它解决了将DMA扩展到数千个并发执行线程的挑战。D2MA从着色器的计算资源中解耦了地址生成,为全局内存中的数据传输到共享内存提供了更直接和有效的路径,并引入了一种对程序员透明的新颖动态同步方案。这些改进使D2MA的速度提高了2.29倍,缓冲数据的平均时间平均减少了81%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Optimizing stencil code via locality of computation Adaptive heterogeneous scheduling for integrated GPUs Heterogeneous microarchitectures trump voltage scaling for low-power cores Bitwise data parallelism in regular expression matching KLA: A new algorithmic paradigm for parallel graph computations
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1