Fast CSV Loading Using GPUs and RDMA for In-Memory Data Processing

Alexander Kumaigorodski, Clemens Lutz, V. Markl
{"title":"Fast CSV Loading Using GPUs and RDMA for In-Memory Data Processing","authors":"Alexander Kumaigorodski, Clemens Lutz, V. Markl","doi":"10.18420/btw2021-01","DOIUrl":null,"url":null,"abstract":": Comma-separated values (CSV) is a widely-used format for data exchange. Due to the format’s prevalence, virtually all industrial-strength database systems and stream processing frameworks support importing CSV input. However, loading CSV input close to the speed of I/O hardware is challenging. Modern I/O devices such as InfiniBand NICs and NVMe SSDs are capable of sustaining high transfer rates of 100 Gbit/s and higher. At the same time, CSV parsing performance is limited by the complex control flows that its semi-structured and text-based layout incurs. In this paper, we propose to speed-up loading CSV input using GPUs. We devise a new parsing approach that streamlines the control flow while correctly handling context-sensitive CSV features such as quotes. By offloading I/O and parsing to the GPU, our approach enables databases to load CSVs at high throughput from main memory with NVLink 2.0, as well as directly from the network with RDMA. In our evaluation, we show that GPUs parse real-world datasets at up to 60 GB/s, thereby saturating high-bandwidth I/O devices.","PeriodicalId":421643,"journal":{"name":"Datenbanksysteme für Business, Technologie und Web","volume":"146 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Datenbanksysteme für Business, Technologie und Web","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18420/btw2021-01","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

: Comma-separated values (CSV) is a widely-used format for data exchange. Due to the format’s prevalence, virtually all industrial-strength database systems and stream processing frameworks support importing CSV input. However, loading CSV input close to the speed of I/O hardware is challenging. Modern I/O devices such as InfiniBand NICs and NVMe SSDs are capable of sustaining high transfer rates of 100 Gbit/s and higher. At the same time, CSV parsing performance is limited by the complex control flows that its semi-structured and text-based layout incurs. In this paper, we propose to speed-up loading CSV input using GPUs. We devise a new parsing approach that streamlines the control flow while correctly handling context-sensitive CSV features such as quotes. By offloading I/O and parsing to the GPU, our approach enables databases to load CSVs at high throughput from main memory with NVLink 2.0, as well as directly from the network with RDMA. In our evaluation, we show that GPUs parse real-world datasets at up to 60 GB/s, thereby saturating high-bandwidth I/O devices.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
快速CSV加载使用gpu和RDMA内存中的数据处理
: CSV (Comma-separated values)是一种广泛使用的数据交换格式。由于该格式的流行,几乎所有工业强度的数据库系统和流处理框架都支持导入CSV输入。但是,加载接近I/O硬件速度的CSV输入是一项挑战。InfiniBand网卡和NVMe ssd等现代I/O设备能够维持100 Gbit/s甚至更高的传输速率。同时,CSV解析性能受到其半结构化和基于文本的布局所导致的复杂控制流的限制。在本文中,我们提出使用gpu加速加载CSV输入。我们设计了一种新的解析方法,可以在正确处理上下文敏感的CSV特性(如引号)的同时简化控制流程。通过将I/O和解析卸载到GPU,我们的方法使数据库能够使用NVLink 2.0从主内存以高吞吐量加载csv,也可以直接从RDMA网络加载csv。在我们的评估中,我们表明gpu以高达60 GB/s的速度解析真实世界的数据集,从而使高带宽I/O设备饱和。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
SportsTables: A new Corpus for Semantic Type Detection Accelerating Large Table Scan using Processing-In-Memory Technology The InsightsNet Climate Change Corpus (ICCC) On the State of German (Abstractive) Text Summarization The Easiest Way of Turning your Relational Database into a Blockchain - and the Cost of Doing So
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1