Technology and Species Independent Simulation of Sequencing Data and Genomic Variants

F. Geraci, Riccardo Massidda, N. Pisanti
{"title":"Technology and Species Independent Simulation of Sequencing Data and Genomic Variants","authors":"F. Geraci, Riccardo Massidda, N. Pisanti","doi":"10.1109/BIBE.2019.00033","DOIUrl":null,"url":null,"abstract":"Highly accurate genotyping is essential for genomic projects aimed at understanding the etiology of diseases as well as for routinary screening of patients. For this reason, genotyping software packages are subject to a strict validation process that requires a large amount of sequencing data endowed with accurate genotype information. In-vitro assessment of genotyping is a long, complex and expensive activity that also depends on the specific variation and locus, and thus it cannot really be used for validation of in-silico genotyping algorithms. In this scenario, sequencing simulation has emerged as a practical alternative. Simulators must be able to keep up with the continuous improvement of different sequencing technologies producing datasets as much indistinguishable from real ones as possible. Moreover, they must be able to mimic as many types of genomic variant as possible. In this paper we describe OmniSim a simulator whose ultimate goal is that of being suitable in all the possible applicative scenarios. In order to fulfill this goal, OmniSim uses an abstract model where variations are read from a .vcf file and mapped into edit operations (insertion, deletion, substitution) on the reference genome. Technological parameters (e.g. error distributions, read length and per-base quality) are learned from real data. As a result of the combination of our abstract model and parameter learning module, OmniSim is able to output data in all aspects similar to that produced in a real sequencing experiment. The source code of OmniSim is freely available at the URL: https://gitlab.com/geraci/omnisim","PeriodicalId":318819,"journal":{"name":"2019 IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2019.00033","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Highly accurate genotyping is essential for genomic projects aimed at understanding the etiology of diseases as well as for routinary screening of patients. For this reason, genotyping software packages are subject to a strict validation process that requires a large amount of sequencing data endowed with accurate genotype information. In-vitro assessment of genotyping is a long, complex and expensive activity that also depends on the specific variation and locus, and thus it cannot really be used for validation of in-silico genotyping algorithms. In this scenario, sequencing simulation has emerged as a practical alternative. Simulators must be able to keep up with the continuous improvement of different sequencing technologies producing datasets as much indistinguishable from real ones as possible. Moreover, they must be able to mimic as many types of genomic variant as possible. In this paper we describe OmniSim a simulator whose ultimate goal is that of being suitable in all the possible applicative scenarios. In order to fulfill this goal, OmniSim uses an abstract model where variations are read from a .vcf file and mapped into edit operations (insertion, deletion, substitution) on the reference genome. Technological parameters (e.g. error distributions, read length and per-base quality) are learned from real data. As a result of the combination of our abstract model and parameter learning module, OmniSim is able to output data in all aspects similar to that produced in a real sequencing experiment. The source code of OmniSim is freely available at the URL: https://gitlab.com/geraci/omnisim
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
测序数据和基因组变异的技术和物种独立模拟
高度准确的基因分型对于旨在了解疾病病因的基因组项目以及对患者进行常规筛查至关重要。因此,基因分型软件包需要经过严格的验证过程,这需要大量具有准确基因型信息的测序数据。基因分型的体外评估是一个漫长、复杂和昂贵的活动,也取决于特定的变异和位点,因此它不能真正用于验证计算机基因分型算法。在这种情况下,序列模拟已经成为一种实用的替代方案。模拟器必须能够跟上不同测序技术的不断改进,产生尽可能与真实数据难以区分的数据集。此外,它们必须能够模仿尽可能多的基因组变异类型。在本文中,我们描述了OmniSim模拟器,其最终目标是适用于所有可能的应用场景。为了实现这一目标,OmniSim使用一个抽象模型,其中从.vcf文件读取变异,并将其映射到参考基因组的编辑操作(插入、删除、替换)中。技术参数(如误差分布、读取长度和每个碱基质量)从实际数据中学习。由于我们的抽象模型和参数学习模块的结合,OmniSim能够输出与真实测序实验中产生的数据相似的各个方面的数据。OmniSim的源代码可在以下网址免费获得:https://gitlab.com/geraci/omnisim
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Stability Investigation Using Hydrogen Bonds for Different Mutations and Drug Resistance in Non-Small Cell Lung Cancer Patients A Temporal Convolution Network Solution for EEG Motor Imagery Classification Evaluation of a Serious Game Promoting Nutrition and Food Literacy: Experiment Design and Preliminary Results Towards a Robust and Accurate Screening Tool for Dyslexia with Data Augmentation using GANs Exploring Fibrotic Disease Networks to Identify Common Molecular Mechanisms with IPF
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1