Resolving multicopy duplications de novo using polyploid phasing.

Research in computational molecular biology : ... Annual International Conference, RECOMB ... : proceedings. RECOMB (Conference : 2005- ) Pub Date : 2017-05-01 Epub Date: 2017-04-12 DOI:10.1007/978-3-319-56970-3_8

Mark J Chaisson, Sudipto Mukherjee, Sreeram Kannan, Evan E Eichler

{"title":"Resolving multicopy duplications de novo using polyploid phasing.","authors":"Mark J Chaisson, Sudipto Mukherjee, Sreeram Kannan, Evan E Eichler","doi":"10.1007/978-3-319-56970-3_8","DOIUrl":null,"url":null,"abstract":"While the rise of single-molecule sequencing systems has enabled an unprecedented rise in the ability to assemble complex regions of the genome, long segmental duplications in the genome still remain a challenging frontier in assembly. Segmental duplications are at the same time both gene rich and prone to large structural rearrangements, making the resolution of their sequences important in medical and evolutionary studies. Duplicated sequences that are collapsed in mammalian de novo assemblies are rarely identical; after a sequence is duplicated, it begins to acquire paralog specific variants. In this paper, we study the problem of resolving the variations in multicopy long-segmental duplications by developing and utilizing algorithms for polyploid phasing. We develop two algorithms: the first one is targeted at maximizing the likelihood of observing the reads given the underlying haplotypes using discrete matrix completion. The second algorithm is based on correlation clustering and exploits an assumption, which is often satisfied in these duplications, that each paralog has a sizable number of paralog-specific variants. We develop a detailed simulation methodology, and demonstrate the superior performance of the proposed algorithms on an array of simulated datasets. We measure the likelihood score as well as reconstruction accuracy, i.e., what fraction of the reads are clustered correctly. In both the performance metrics, we find that our algorithms dominate existing algorithms on more than 93% of the datasets. While the discrete matrix completion performs better on likelihood score, the correlation clustering algorithm performs better on reconstruction accuracy due to the stronger regularization inherent in the algorithm. We also show that our correlation-clustering algorithm can reconstruct on an average 7.0 haplotypes in 10-copy duplication data-sets whereas existing algorithms reconstruct less than 1 copy on average.","PeriodicalId":74675,"journal":{"name":"Research in computational molecular biology : ... Annual International Conference, RECOMB ... : proceedings. RECOMB (Conference : 2005- )","volume":"10229 ","pages":"117-133"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5553120/pdf/nihms883111.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research in computational molecular biology : ... Annual International Conference, RECOMB ... : proceedings. RECOMB (Conference : 2005- )","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/978-3-319-56970-3_8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2017/4/12 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

While the rise of single-molecule sequencing systems has enabled an unprecedented rise in the ability to assemble complex regions of the genome, long segmental duplications in the genome still remain a challenging frontier in assembly. Segmental duplications are at the same time both gene rich and prone to large structural rearrangements, making the resolution of their sequences important in medical and evolutionary studies. Duplicated sequences that are collapsed in mammalian de novo assemblies are rarely identical; after a sequence is duplicated, it begins to acquire paralog specific variants. In this paper, we study the problem of resolving the variations in multicopy long-segmental duplications by developing and utilizing algorithms for polyploid phasing. We develop two algorithms: the first one is targeted at maximizing the likelihood of observing the reads given the underlying haplotypes using discrete matrix completion. The second algorithm is based on correlation clustering and exploits an assumption, which is often satisfied in these duplications, that each paralog has a sizable number of paralog-specific variants. We develop a detailed simulation methodology, and demonstrate the superior performance of the proposed algorithms on an array of simulated datasets. We measure the likelihood score as well as reconstruction accuracy, i.e., what fraction of the reads are clustered correctly. In both the performance metrics, we find that our algorithms dominate existing algorithms on more than 93% of the datasets. While the discrete matrix completion performs better on likelihood score, the correlation clustering algorithm performs better on reconstruction accuracy due to the stronger regularization inherent in the algorithm. We also show that our correlation-clustering algorithm can reconstruct on an average 7.0 haplotypes in 10-copy duplication data-sets whereas existing algorithms reconstruct less than 1 copy on average.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用多倍体分期从头解决多拷贝重复问题。

虽然单分子测序系统的兴起使基因组复杂区域的组装能力得到了前所未有的提高，但基因组中长的节段重复序列仍然是组装中极具挑战性的前沿领域。片段重复既有丰富的基因，又容易发生大的结构重排，因此对其序列的解析在医学和进化研究中非常重要。在哺乳动物的从头组装中被拼接的重复序列很少是完全相同的；一个序列被复制后，它开始获得旁表型的特定变异。在本文中，我们通过开发和利用多倍体分期算法，研究了如何解决多拷贝长片段复制中的变异问题。我们开发了两种算法：第一种算法的目标是利用离散矩阵补全法最大限度地提高在给定底层单倍型的情况下观察读数的可能性。第二种算法基于相关性聚类，并利用了一个假设，即每个旁表型都有相当数量的旁表型特异变体，而这一假设在这些复制中经常得到满足。我们制定了详细的模拟方法，并在一系列模拟数据集上证明了所提算法的卓越性能。我们测量了似然得分和重建准确性，即正确聚类的读数比例。我们发现，在这两项性能指标上，我们的算法在 93% 以上的数据集上都优于现有算法。离散矩阵完成法在似然比得分上表现更好，而相关聚类算法由于其固有的更强正则化功能，在重建准确性上表现更好。我们还表明，我们的相关聚类算法平均可以在 10 个拷贝的重复数据集中重建 7.0 个单倍型，而现有算法平均只能重建不到 1 个拷贝。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Research in computational molecular biology : ... Annual International Conference, RECOMB ... : proceedings. RECOMB (Conference : 2005- )

自引率

0.00%

发文量