高性能GPU事务性内存通过急切冲突检测

X. Ren, Mieszko Lis
{"title":"高性能GPU事务性内存通过急切冲突检测","authors":"X. Ren, Mieszko Lis","doi":"10.1109/HPCA.2018.00029","DOIUrl":null,"url":null,"abstract":"GPUs transactional memory (TM) proposals to date have relied on lazy, value-based conflict detection, assuming that GPUs can amortize the latency by executing other warps. In practice, however, concurrency must be throttled to a few warps per core to avoid high abort rates, and TM performance has remained far below that of fine-grained locks. We trace this to the latency cost of validating transactions: two round trips across the crossbar required for most commits and aborts. With limited concurrency, the warp scheduler cannot amortize this, and leaves the core idle most of the time. In this paper, we show that value-based validation does not scale to high thread counts, and eager conflict detection becomes more efficient as the number of threads grows. We leverage this insight to propose GETM, a GPU TM with eager conflict detection. GETM relies on a novel distributed logical clock scheme to implement eager conflict detection without the need for cache coherence or signature broadcasts. GETM is up to 2.1 times faster than the state-of-the art prior work WarpTM (gmean 1.2 times), with 3.6 times lower silicon area overheads and 2.2 times lower power overheads.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"127 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"High-Performance GPU Transactional Memory via Eager Conflict Detection\",\"authors\":\"X. Ren, Mieszko Lis\",\"doi\":\"10.1109/HPCA.2018.00029\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"GPUs transactional memory (TM) proposals to date have relied on lazy, value-based conflict detection, assuming that GPUs can amortize the latency by executing other warps. In practice, however, concurrency must be throttled to a few warps per core to avoid high abort rates, and TM performance has remained far below that of fine-grained locks. We trace this to the latency cost of validating transactions: two round trips across the crossbar required for most commits and aborts. With limited concurrency, the warp scheduler cannot amortize this, and leaves the core idle most of the time. In this paper, we show that value-based validation does not scale to high thread counts, and eager conflict detection becomes more efficient as the number of threads grows. We leverage this insight to propose GETM, a GPU TM with eager conflict detection. GETM relies on a novel distributed logical clock scheme to implement eager conflict detection without the need for cache coherence or signature broadcasts. GETM is up to 2.1 times faster than the state-of-the art prior work WarpTM (gmean 1.2 times), with 3.6 times lower silicon area overheads and 2.2 times lower power overheads.\",\"PeriodicalId\":154694,\"journal\":{\"name\":\"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)\",\"volume\":\"127 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCA.2018.00029\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2018.00029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

迄今为止,gpu事务性内存(TM)提案依赖于惰性的、基于值的冲突检测,假设gpu可以通过执行其他扭曲来分摊延迟。然而,在实践中,必须将并发性限制在每个核心的几次扭曲,以避免高中断率,并且TM的性能仍然远远低于细粒度锁。我们将其追溯到验证事务的延迟成本:大多数提交和终止都需要在交叉栏上进行两次往返。由于并发性有限,warp调度器无法分摊这一点,因此大多数时间内核都处于空闲状态。在本文中,我们证明了基于值的验证不能扩展到高线程数,并且随着线程数的增加,渴望冲突检测变得更有效。我们利用这一见解提出了GETM,一种具有渴望冲突检测的GPU TM。GETM依赖于一种新颖的分布式逻辑时钟方案来实现急切冲突检测,而不需要缓存一致性或签名广播。GETM比目前最先进的WarpTM(平均1.2倍)快2.1倍,硅面积开销降低3.6倍,功耗开销降低2.2倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
High-Performance GPU Transactional Memory via Eager Conflict Detection
GPUs transactional memory (TM) proposals to date have relied on lazy, value-based conflict detection, assuming that GPUs can amortize the latency by executing other warps. In practice, however, concurrency must be throttled to a few warps per core to avoid high abort rates, and TM performance has remained far below that of fine-grained locks. We trace this to the latency cost of validating transactions: two round trips across the crossbar required for most commits and aborts. With limited concurrency, the warp scheduler cannot amortize this, and leaves the core idle most of the time. In this paper, we show that value-based validation does not scale to high thread counts, and eager conflict detection becomes more efficient as the number of threads grows. We leverage this insight to propose GETM, a GPU TM with eager conflict detection. GETM relies on a novel distributed logical clock scheme to implement eager conflict detection without the need for cache coherence or signature broadcasts. GETM is up to 2.1 times faster than the state-of-the art prior work WarpTM (gmean 1.2 times), with 3.6 times lower silicon area overheads and 2.2 times lower power overheads.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Record-Replay Architecture as a General Security Framework LATTE-CC: Latency Tolerance Aware Adaptive Cache Compression Management for Energy Efficient GPUs Secure DIMM: Moving ORAM Primitives Closer to Memory OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator WIR: Warp Instruction Reuse to Minimize Repeated Computations in GPUs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1