Indexing Blocks to Reduce Space and Time Requirements for Searching Large Data Files

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) Pub Date : 2016-05-16 DOI:10.1109/CCGrid.2016.18

Tzu-Hsien Wu, Hao Shyng, J. Chou, Bin Dong, Kesheng Wu

{"title":"Indexing Blocks to Reduce Space and Time Requirements for Searching Large Data Files","authors":"Tzu-Hsien Wu, Hao Shyng, J. Chou, Bin Dong, Kesheng Wu","doi":"10.1109/CCGrid.2016.18","DOIUrl":null,"url":null,"abstract":"Scientific discoveries are increasingly relying on analysis of massive amounts of data generated from scientific experiments, observations, and simulations. The ability to directly access the most relevant data records, without shifting through all of them becomes essential. While many indexing techniques have been developed to quickly locate the selected data records, the time and space required for building and storing these indexes are often too expensive to meet the demands of in situ or real-time data analysis. Existing indexing methods generally capture information about each individual data record, however, when reading a data record, the I/O system typically has to access a block or a page of data. In this work, we postulate that indexing blocks instead of individual data records could significantly reduce index size and index building time without increasing the I/O time for accessing the selected data records. Our experiments using multiple real datasets on a supercomputer show that block index can reduce query time by a factor of 2 to 50 over other existing methods, including SciDB and FastQuery. But the size of block index is almost negligible comparing to the data size, and the time of building index can reach the peak I/O speed.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid.2016.18","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Scientific discoveries are increasingly relying on analysis of massive amounts of data generated from scientific experiments, observations, and simulations. The ability to directly access the most relevant data records, without shifting through all of them becomes essential. While many indexing techniques have been developed to quickly locate the selected data records, the time and space required for building and storing these indexes are often too expensive to meet the demands of in situ or real-time data analysis. Existing indexing methods generally capture information about each individual data record, however, when reading a data record, the I/O system typically has to access a block or a page of data. In this work, we postulate that indexing blocks instead of individual data records could significantly reduce index size and index building time without increasing the I/O time for accessing the selected data records. Our experiments using multiple real datasets on a supercomputer show that block index can reduce query time by a factor of 2 to 50 over other existing methods, including SciDB and FastQuery. But the size of block index is almost negligible comparing to the data size, and the time of building index can reach the peak I/O speed.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

索引块减少搜索大数据文件的空间和时间要求

科学发现越来越依赖于对科学实验、观察和模拟产生的大量数据的分析。直接访问最相关的数据记录的能力变得至关重要，而不需要在所有这些记录之间进行切换。虽然已经开发了许多索引技术来快速定位选定的数据记录，但是构建和存储这些索引所需的时间和空间往往过于昂贵，无法满足现场或实时数据分析的需求。现有的索引方法通常捕获关于每个单独数据记录的信息，但是，在读取数据记录时，I/O系统通常必须访问数据块或数据页。在这项工作中，我们假设索引块而不是单个数据记录可以显著减少索引大小和索引构建时间，而不会增加访问所选数据记录的I/O时间。我们在超级计算机上使用多个真实数据集进行的实验表明，块索引比其他现有方法(包括SciDB和FastQuery)可以减少2到50倍的查询时间。但是块索引的大小与数据大小相比几乎可以忽略不计，并且索引的构建时间可以达到峰值I/O速度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

自引率

0.00%

发文量