Resolving multicopy duplications de novo using polyploid phasing.

Mark J Chaisson, Sudipto Mukherjee, Sreeram Kannan, Evan E Eichler
{"title":"Resolving multicopy duplications <i>de novo</i> using polyploid phasing.","authors":"Mark J Chaisson, Sudipto Mukherjee, Sreeram Kannan, Evan E Eichler","doi":"10.1007/978-3-319-56970-3_8","DOIUrl":null,"url":null,"abstract":"<p><p>While the rise of single-molecule sequencing systems has enabled an unprecedented rise in the ability to assemble complex regions of the genome, long segmental duplications in the genome still remain a challenging frontier in assembly. Segmental duplications are at the same time both gene rich and prone to large structural rearrangements, making the resolution of their sequences important in medical and evolutionary studies. Duplicated sequences that are collapsed in mammalian <i>de novo</i> assemblies are rarely identical; after a sequence is duplicated, it begins to acquire <i>paralog specific variants</i>. In this paper, we study the problem of resolving the variations in multicopy long-segmental duplications by developing and utilizing algorithms for polyploid phasing. We develop two algorithms: the first one is targeted at maximizing the likelihood of observing the reads given the underlying haplotypes using <i>discrete matrix completion</i>. The second algorithm is based on <i>correlation clustering</i> and exploits an assumption, which is often satisfied in these duplications, that each paralog has a sizable number of paralog-specific variants. We develop a detailed simulation methodology, and demonstrate the superior performance of the proposed algorithms on an array of simulated datasets. We measure the likelihood score as well as reconstruction accuracy, i.e., what fraction of the reads are clustered correctly. In both the performance metrics, we find that our algorithms dominate existing algorithms on more than 93% of the datasets. While the discrete matrix completion performs better on likelihood score, the correlation clustering algorithm performs better on reconstruction accuracy due to the stronger regularization inherent in the algorithm. We also show that our correlation-clustering algorithm can reconstruct on an average 7.0 haplotypes in 10-copy duplication data-sets whereas existing algorithms reconstruct less than 1 copy on average.</p>","PeriodicalId":74675,"journal":{"name":"Research in computational molecular biology : ... Annual International Conference, RECOMB ... : proceedings. RECOMB (Conference : 2005- )","volume":"10229 ","pages":"117-133"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5553120/pdf/nihms883111.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research in computational molecular biology : ... Annual International Conference, RECOMB ... : proceedings. RECOMB (Conference : 2005- )","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/978-3-319-56970-3_8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2017/4/12 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

While the rise of single-molecule sequencing systems has enabled an unprecedented rise in the ability to assemble complex regions of the genome, long segmental duplications in the genome still remain a challenging frontier in assembly. Segmental duplications are at the same time both gene rich and prone to large structural rearrangements, making the resolution of their sequences important in medical and evolutionary studies. Duplicated sequences that are collapsed in mammalian de novo assemblies are rarely identical; after a sequence is duplicated, it begins to acquire paralog specific variants. In this paper, we study the problem of resolving the variations in multicopy long-segmental duplications by developing and utilizing algorithms for polyploid phasing. We develop two algorithms: the first one is targeted at maximizing the likelihood of observing the reads given the underlying haplotypes using discrete matrix completion. The second algorithm is based on correlation clustering and exploits an assumption, which is often satisfied in these duplications, that each paralog has a sizable number of paralog-specific variants. We develop a detailed simulation methodology, and demonstrate the superior performance of the proposed algorithms on an array of simulated datasets. We measure the likelihood score as well as reconstruction accuracy, i.e., what fraction of the reads are clustered correctly. In both the performance metrics, we find that our algorithms dominate existing algorithms on more than 93% of the datasets. While the discrete matrix completion performs better on likelihood score, the correlation clustering algorithm performs better on reconstruction accuracy due to the stronger regularization inherent in the algorithm. We also show that our correlation-clustering algorithm can reconstruct on an average 7.0 haplotypes in 10-copy duplication data-sets whereas existing algorithms reconstruct less than 1 copy on average.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用多倍体分期从头解决多拷贝重复问题。
虽然单分子测序系统的兴起使基因组复杂区域的组装能力得到了前所未有的提高,但基因组中长的节段重复序列仍然是组装中极具挑战性的前沿领域。片段重复既有丰富的基因,又容易发生大的结构重排,因此对其序列的解析在医学和进化研究中非常重要。在哺乳动物的从头组装中被拼接的重复序列很少是完全相同的;一个序列被复制后,它开始获得旁表型的特定变异。在本文中,我们通过开发和利用多倍体分期算法,研究了如何解决多拷贝长片段复制中的变异问题。我们开发了两种算法:第一种算法的目标是利用离散矩阵补全法最大限度地提高在给定底层单倍型的情况下观察读数的可能性。第二种算法基于相关性聚类,并利用了一个假设,即每个旁表型都有相当数量的旁表型特异变体,而这一假设在这些复制中经常得到满足。我们制定了详细的模拟方法,并在一系列模拟数据集上证明了所提算法的卓越性能。我们测量了似然得分和重建准确性,即正确聚类的读数比例。我们发现,在这两项性能指标上,我们的算法在 93% 以上的数据集上都优于现有算法。离散矩阵完成法在似然比得分上表现更好,而相关聚类算法由于其固有的更强正则化功能,在重建准确性上表现更好。我们还表明,我们的相关聚类算法平均可以在 10 个拷贝的重复数据集中重建 7.0 个单倍型,而现有算法平均只能重建不到 1 个拷贝。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Secure Discovery of Genetic Relatives across Large-Scale and Distributed Genomic Datasets. Research in Computational Molecular Biology: 27th Annual International Conference, RECOMB 2023, Istanbul, Turkey, April 16–19, 2023, Proceedings Comparative Analysis of Alternative Splicing Events in Foliar Transcriptomes of Potato Plants Inoculated with Phytophthora Infestans Identification and Bioinformatics Analysis of TCP Family Genes in Tree Peony Computational Molecular Biology Interdisciplinary Technological Integration and New Advances
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1