Efficient haplotype inference from pedigrees with missing data using linear systems with disjoint-set data structures.

Computational systems bioinformatics. Computational Systems Bioinformatics Conference Pub Date : 2008-01-01

Xin Li, Jing Li

{"title":"Efficient haplotype inference from pedigrees with missing data using linear systems with disjoint-set data structures.","authors":"Xin Li, Jing Li","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>We study the haplotype inference problem from pedigree data under the zero recombination assumption, which is well supported by real data for tightly linked markers (i.e., single nucleotide polymorphisms (SNPs)) over a relatively large chromosome segment. We solve the problem in a rigorous mathematical manner by formulating genotype constraints as a linear system of inheritance variables. We then utilize disjoint-set structures to encode connectivity information among individuals, to detect constraints from genotypes, and to check consistency of constraints. On a tree pedigree without missing data, our algorithm can output a general solution as well as the number of total specific solutions in a nearly linear time O (mn x alpha(n)), where m is the number of loci, n is the number of individuals and alpha is the inverse Ackermann function, which is a further improvement over existing ones. We also extend the idea to looped pedigrees and pedigrees with missing data by considering existing (partial) constraints on inheritance variables. The algorithm has been implemented in C++ and will be incorporated into our PedPhase package. Experimental results show that it can correctly identify all 0-recombinant solutions with great efficiency. Comparisons with other two popular algorithms show that the proposed algorithm achieves 10 to 10(5)-fold improvements over a variety of parameter settings. The experimental study also provides empirical evidences on the complexity bounds suggested by theoretical analysis.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"297-308"},"PeriodicalIF":0.0000,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3326667/pdf/nihms231595.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We study the haplotype inference problem from pedigree data under the zero recombination assumption, which is well supported by real data for tightly linked markers (i.e., single nucleotide polymorphisms (SNPs)) over a relatively large chromosome segment. We solve the problem in a rigorous mathematical manner by formulating genotype constraints as a linear system of inheritance variables. We then utilize disjoint-set structures to encode connectivity information among individuals, to detect constraints from genotypes, and to check consistency of constraints. On a tree pedigree without missing data, our algorithm can output a general solution as well as the number of total specific solutions in a nearly linear time O (mn x alpha(n)), where m is the number of loci, n is the number of individuals and alpha is the inverse Ackermann function, which is a further improvement over existing ones. We also extend the idea to looped pedigrees and pedigrees with missing data by considering existing (partial) constraints on inheritance variables. The algorithm has been implemented in C++ and will be incorporated into our PedPhase package. Experimental results show that it can correctly identify all 0-recombinant solutions with great efficiency. Comparisons with other two popular algorithms show that the proposed algorithm achieves 10 to 10(5)-fold improvements over a variety of parameter settings. The experimental study also provides empirical evidences on the complexity bounds suggested by theoretical analysis.

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用线性系统与不相邻集合数据结构，从有缺失数据的血统中高效推断单倍型。

我们研究了零重组假设下的血统数据单倍型推断问题，这一假设得到了相对较大染色体片段上紧密相连标记（即单核苷酸多态性（SNP））的真实数据的有力支持。我们将基因型约束条件表述为继承变量的线性系统，以严谨的数学方式解决了这一问题。然后，我们利用不相邻集合结构来编码个体间的连接信息，从基因型中检测约束条件，并检查约束条件的一致性。在没有缺失数据的树状血统上，我们的算法可以在近乎线性的时间 O (mn x alpha(n))（其中 m 是位点数，n 是个体数，alpha 是反阿克曼函数）内输出一般解以及全部特定解的数量，这是对现有算法的进一步改进。我们还通过考虑对遗传变量的现有（部分）约束，将这一想法扩展到循环系谱和数据缺失的系谱。该算法已用 C++ 实现，并将纳入我们的 PedPhase 软件包。实验结果表明，该算法能以极高的效率正确识别所有 0 重合解。与其他两种流行算法的比较表明，所提出的算法在各种参数设置下都能实现 10 到 10(5)倍的改进。实验研究还为理论分析提出的复杂度界限提供了经验证据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

自引率

0.00%

发文量