Deep learning for assembly of haplotypes and viral quasispecies from short and long sequencing reads

Ziqi Ke, H. Vikalo
{"title":"Deep learning for assembly of haplotypes and viral quasispecies from short and long sequencing reads","authors":"Ziqi Ke, H. Vikalo","doi":"10.1145/3535508.3545524","DOIUrl":null,"url":null,"abstract":"Information about genetic variations in either individual genomes or viral populations provides insight in genetic signatures of diseases and suggests directions for medical and pharmaceutical research. State-of-the-art sequencing platforms generate massive amounts of reads, with length varying from one technology to another, that provide data needed for the reconstruction of haplotypes and viral quasispecies. On the one hand, high-throughput platforms are capable of providing enormous amounts of highly accurate but relatively short reads; inability to bridge long genetic distances renders the reconstruction with such reads challenging. On the other hand, the latest generation of sequencing technologies is capable of generating much longer reads but those reads suffer from sequencing errors at a rate higher than the error rate of short reads. This motivates search for reconstruction methods capable of leveraging both the high accuracy of short reads and the phase resolving power of long reads. We present a deep learning framework that relies on convolutional auto-encoders with a clustering layer to reconstruct individual haplotypes or viral populations from hybrid data sources. First, an auto-encoder for haplotype assembly / viral population reconstruction from short reads is pre-trained separately from another one utilizing long reads for the same task. The pre-trained models are then retrained simultaneously to enable decision fusion. Results on realistic synthetic as well as experimental data demonstrate that the proposed framework outperforms state-of-the-art techniques for haplotype assembly and viral quasispecies reconstruction, and achieves significantly higher accuracy on those tasks than methods utilizing only one type of reads. Code is available at https://github.com/WuLoli/HybSeq.","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3535508.3545524","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Information about genetic variations in either individual genomes or viral populations provides insight in genetic signatures of diseases and suggests directions for medical and pharmaceutical research. State-of-the-art sequencing platforms generate massive amounts of reads, with length varying from one technology to another, that provide data needed for the reconstruction of haplotypes and viral quasispecies. On the one hand, high-throughput platforms are capable of providing enormous amounts of highly accurate but relatively short reads; inability to bridge long genetic distances renders the reconstruction with such reads challenging. On the other hand, the latest generation of sequencing technologies is capable of generating much longer reads but those reads suffer from sequencing errors at a rate higher than the error rate of short reads. This motivates search for reconstruction methods capable of leveraging both the high accuracy of short reads and the phase resolving power of long reads. We present a deep learning framework that relies on convolutional auto-encoders with a clustering layer to reconstruct individual haplotypes or viral populations from hybrid data sources. First, an auto-encoder for haplotype assembly / viral population reconstruction from short reads is pre-trained separately from another one utilizing long reads for the same task. The pre-trained models are then retrained simultaneously to enable decision fusion. Results on realistic synthetic as well as experimental data demonstrate that the proposed framework outperforms state-of-the-art techniques for haplotype assembly and viral quasispecies reconstruction, and achieves significantly higher accuracy on those tasks than methods utilizing only one type of reads. Code is available at https://github.com/WuLoli/HybSeq.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
从短序列和长序列中组装单倍型和病毒准种的深度学习
关于个体基因组或病毒种群中遗传变异的信息有助于了解疾病的遗传特征,并为医学和制药研究指明方向。最先进的测序平台产生了大量的reads,其长度因技术而异,为重建单倍型和病毒准种提供了所需的数据。一方面,高通量平台能够提供大量高度准确但相对较短的读取;无法跨越长遗传距离使得这样的读取具有挑战性的重建。另一方面,最新一代的测序技术能够产生更长的reads,但这些reads的测序错误率高于短reads的错误率。这促使人们寻找能够同时利用短读段的高精度和长读段的相位分辨能力的重建方法。我们提出了一个深度学习框架,该框架依赖于带有聚类层的卷积自编码器,从混合数据源重建单个单倍型或病毒种群。首先,一个用于单倍型组装/病毒种群重建的自编码器与另一个利用长读段进行相同任务的自编码器分开进行预训练。然后对预训练的模型进行重新训练,以实现决策融合。实际合成和实验数据的结果表明,所提出的框架在单倍型组装和病毒准种重建方面优于最先进的技术,并且在这些任务上取得了比仅使用一种reads的方法更高的准确性。代码可从https://github.com/WuLoli/HybSeq获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Examining post-pandemic behaviors influencing human mobility trends Geographic ensembles of observations using randomised ensembles of autoregression chains: ensemble methods for spatio-temporal time series forecasting of influenza-like illness Trajectory-based and sound-based medical data clustering Session details: Graphs & networks TopographyNET
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1