An Empirical Study on Efficient Storage of Human Genome Data

Diksha Chaudhary, Bratati Kahali, Yogesh L. Simmhan
{"title":"An Empirical Study on Efficient Storage of Human Genome Data","authors":"Diksha Chaudhary, Bratati Kahali, Yogesh L. Simmhan","doi":"10.1109/HiPCW.2019.00030","DOIUrl":null,"url":null,"abstract":"Next-generation sequencing (NGS) has become affordable and fast, facilitating large scale population-level Whole Genome Sequencing (WGS) studies. NGS and its processing pipeline generate 100's of gigabytes of data per human subject, which can grow to petabytes for large studies, such as the upcoming GenomeIndia program. At these scales, affordable and reliable storage of data becomes a challenge. Here, we propose a preliminary data management architecture for storage and querying of data from the GenomeIndia project. In this initial empirical study, we focus on existing generic and domain-specific compression techniques for reducing the storage space of genome sequence data and compare erasure coding and replication in providing reliability on commodity hardware. We report the time and space complexity of these approaches, and this will reform the future design of our architecture.","PeriodicalId":223719,"journal":{"name":"2019 26th International Conference on High Performance Computing, Data and Analytics Workshop (HiPCW)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 26th International Conference on High Performance Computing, Data and Analytics Workshop (HiPCW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPCW.2019.00030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Next-generation sequencing (NGS) has become affordable and fast, facilitating large scale population-level Whole Genome Sequencing (WGS) studies. NGS and its processing pipeline generate 100's of gigabytes of data per human subject, which can grow to petabytes for large studies, such as the upcoming GenomeIndia program. At these scales, affordable and reliable storage of data becomes a challenge. Here, we propose a preliminary data management architecture for storage and querying of data from the GenomeIndia project. In this initial empirical study, we focus on existing generic and domain-specific compression techniques for reducing the storage space of genome sequence data and compare erasure coding and replication in providing reliability on commodity hardware. We report the time and space complexity of these approaches, and this will reform the future design of our architecture.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
人类基因组数据高效存储的实证研究
新一代测序(NGS)已经变得经济且快速,促进了大规模人群水平的全基因组测序(WGS)研究。NGS及其处理管道为每个人类受试者产生100千兆字节的数据,对于大型研究,如即将到来的印度基因组计划,可以增长到pb。在这种规模下,负担得起且可靠的数据存储成为一项挑战。在这里,我们提出了一个初步的数据管理架构,用于存储和查询来自GenomeIndia项目的数据。在这项初步的实证研究中,我们将重点放在现有的通用和特定领域压缩技术上,以减少基因组序列数据的存储空间,并比较擦除编码和复制在提供商用硬件可靠性方面的差异。我们报告了这些方法的时间和空间复杂性,这将改变我们未来的建筑设计。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Keynote Talk 1: Internet of Things – Reshaping our Future Wireless Water Quality Monitoring and Quality Deterioration Prediction System HPC Education for Domain Scientists: An Indian Experience and Perspective Keynote Talk: Decentralised Technologies for Orchestrated Cloud-to-Edge Intelligence Keynote Talk 3: Technology for Meeting the SDGs by 2030
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1