Maintaining a microbial genome & metagenome data analysis system in an academic setting

I. Chen, V. Markowitz, E. Szeto, Krishna Palaniappan, Ken Chu
{"title":"Maintaining a microbial genome & metagenome data analysis system in an academic setting","authors":"I. Chen, V. Markowitz, E. Szeto, Krishna Palaniappan, Ken Chu","doi":"10.1145/2618243.2618244","DOIUrl":null,"url":null,"abstract":"The Integrated Microbial Genomes (IMG) system integrates microbial community aggregate genomes (metagenomes) with genomes from all domains of life. IMG provides tools for analyzing and reviewing the structural and functional annotations of metagenomes and genomes in a comparative context. At the core of the IMG system is a data warehouse that contains genome and metagenome datasets provided by scientific users, as well as public bacterial, archaeal, eukaryotic, and viral genomes from the US National Center for Biotechnology Information genomic archive and a rich set of engineered, environmental and host associated metagenomes. Genomes and metagenome datasets are processed using IMG's microbial genome and metagenome sequence data processing pipelines and then are integrated into the data warehouse using IMG's data integration toolkit. Microbial genome and metagenome application specific user interfaces provide access to different subsets of IMG's data and analysis toolkits. Genome and metagenome analysis is a gene centric iterative process that involves a sequence (composition) of data exploration and comparative analysis operations, with individual operations expected to have rapid response time.\n From its first release in 2005, IMG has grown from an initial content of about 300 genomes with a total of 2 million genes, to 22,578 bacterial, archaeal, eukaryotic and viral genomes, and 4,188 metagenome samples, with about 24.6 billion genes as of May 1st, 2014. IMG's database architecture is continuously revised in order to cope with the rapid increase in the number and size of the genome and metagenome datasets, maintain good query performance, and accommodate new data types. We present in this paper IMG's new database architecture developed over the past three years in the context of limited financial, engineering and data management resources customary to academic database systems. We discuss the alternative commercial and open source database management systems we considered and experimented with and describe the hybrid architecture we devised for sustaining IMG's rapid growth.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"3:1-3:11"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2618243.2618244","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

The Integrated Microbial Genomes (IMG) system integrates microbial community aggregate genomes (metagenomes) with genomes from all domains of life. IMG provides tools for analyzing and reviewing the structural and functional annotations of metagenomes and genomes in a comparative context. At the core of the IMG system is a data warehouse that contains genome and metagenome datasets provided by scientific users, as well as public bacterial, archaeal, eukaryotic, and viral genomes from the US National Center for Biotechnology Information genomic archive and a rich set of engineered, environmental and host associated metagenomes. Genomes and metagenome datasets are processed using IMG's microbial genome and metagenome sequence data processing pipelines and then are integrated into the data warehouse using IMG's data integration toolkit. Microbial genome and metagenome application specific user interfaces provide access to different subsets of IMG's data and analysis toolkits. Genome and metagenome analysis is a gene centric iterative process that involves a sequence (composition) of data exploration and comparative analysis operations, with individual operations expected to have rapid response time. From its first release in 2005, IMG has grown from an initial content of about 300 genomes with a total of 2 million genes, to 22,578 bacterial, archaeal, eukaryotic and viral genomes, and 4,188 metagenome samples, with about 24.6 billion genes as of May 1st, 2014. IMG's database architecture is continuously revised in order to cope with the rapid increase in the number and size of the genome and metagenome datasets, maintain good query performance, and accommodate new data types. We present in this paper IMG's new database architecture developed over the past three years in the context of limited financial, engineering and data management resources customary to academic database systems. We discuss the alternative commercial and open source database management systems we considered and experimented with and describe the hybrid architecture we devised for sustaining IMG's rapid growth.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在学术环境中维护微生物基因组和宏基因组数据分析系统
集成微生物基因组(IMG)系统将微生物群落聚合基因组(宏基因组)与来自所有生命领域的基因组集成在一起。IMG提供了在比较背景下分析和回顾宏基因组和基因组的结构和功能注释的工具。IMG系统的核心是一个数据仓库,其中包含由科学用户提供的基因组和宏基因组数据集,以及来自美国国家生物技术信息中心基因组档案的公共细菌、古细菌、真核生物和病毒基因组,以及一套丰富的工程、环境和宿主相关宏基因组。基因组和宏基因组数据集使用IMG的微生物基因组和宏基因组序列数据处理管道进行处理,然后使用IMG的数据集成工具包集成到数据仓库中。微生物基因组和宏基因组应用程序特定的用户界面提供了对IMG数据和分析工具包的不同子集的访问。基因组和宏基因组分析是一个以基因为中心的迭代过程,涉及数据探索和比较分析操作的序列(组成),单个操作期望具有快速响应时间。自2005年首次发布以来,IMG已从最初的约300个基因组,总计200万个基因,发展到2014年5月1日,细菌、古细菌、真核生物和病毒基因组22578个,宏基因组样本4188个,约246亿个基因。IMG的数据库架构不断修改,以应对基因组和宏基因组数据集数量和规模的快速增长,保持良好的查询性能,并适应新的数据类型。在本文中,我们介绍了IMG在过去三年中在有限的财务、工程和数据管理资源背景下开发的新数据库架构,这些资源通常用于学术数据库系统。我们讨论了我们考虑和试验的其他商业和开源数据库管理系统,并描述了我们为维持IMG的快速增长而设计的混合体系结构。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Towards Co-Evolution of Data-Centric Ecosystems. Data perturbation for outlier detection ensembles SLACID - sparse linear algebra in a column-oriented in-memory database system SensorBench: benchmarking approaches to processing wireless sensor network data Efficient data management and statistics with zero-copy integration
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1