FULL-W2V: fully exploiting data reuse for W2V on GPU-accelerated systems

Thomas Randall, Tyler N. Allen, Rong Ge
{"title":"FULL-W2V: fully exploiting data reuse for W2V on GPU-accelerated systems","authors":"Thomas Randall, Tyler N. Allen, Rong Ge","doi":"10.1145/3447818.3460373","DOIUrl":null,"url":null,"abstract":"Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm’s inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture’s peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"6 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447818.3460373","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm’s inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture’s peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
FULL-W2V:在gpu加速系统上充分利用W2V的数据重用
Word2Vec是自然语言处理(NLP)领域最具影响力的创新之一,它用低维的密集向量表示人类文本中潜在的语法和句法信息。由于Word2Vec算法固有的顺序性、密集的内存访问以及它所代表的大量词汇表,它的计算成本很高。虽然先前的研究已经调查了探索并行性和提高内存系统性能的技术,但它们很难在强大的gpu上有效地获得吞吐量。我们将内存数据访问和延迟确定为gpu先前工作的主要瓶颈,这阻碍了高度优化的内核达到架构的峰值性能。我们提出了一种新的算法,FULL-W2V,它最大限度地利用了W2V算法中数据重用的机会,并利用GPU架构和资源来减少对低内存级别的访问并改善时间局部性。与之前最先进的GPU实现相比,FULL-W2V能够显著减少对GPU全局内存的访问,例如减少89%以上,从而在连续几代硬件上实现显著的性能提升。当从Nvidia Pascal P100移植到Volta V100卡时,我们的原型实现实现了2.97倍的加速,并且在相同嵌入质量的V100卡上比最先进的速度高出5.72倍。深入分析表明,通过寄存器和共享内存缓存减少内存访问以及高吞吐量共享内存减少可以显著提高算法强度。FULL-W2V可以为自然语言处理和其他领域的许多应用带来潜在的好处。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Accelerating BWA-MEM Read Mapping on GPUs. Dynamic Memory Management in Massively Parallel Systems: A Case on GPUs. Priority Algorithms with Advice for Disjoint Path Allocation Problems From Data of Internet of Things to Domain Knowledge: A Case Study of Exploration in Smart Agriculture On Two Variants of Induced Matchings
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1