Hanna Sigeman, Ina Satokangas, Matthieu de Lamarre, Patrick Krapf, Pierre Nouhaud, Riddhi Deshmukh, Heikki Helanterä, Michel Chapuisat, Jonna Kulmuni, Lumi Viljakainen
Some of the most striking examples of phenotypic variation within species are controlled by supergenes. However, most research on supergenes has focused on their emergence and long-term maintenance, leaving the later stages of their life cycle largely unexplored. Specifically, what happens to a derived supergene haplotype when the trait it controls reaches fixation? Here we answer this question using the ancient supergene system of Formica ants, where (monogynous) single-queen colonies typically carry only the ancestral haplotype M while the derived haplotype P is exclusive to (polygynous) colonies with multiple queens. Through comparative population genomics of 264 individuals from all seven European wood ant species, we found that the P haplotype was present in only 1/3 obligately polygynous species (Formica polyctena). In the two others (Formica aquilonia and Formica paralugubris), the P haplotype was completely missing except for duplicated P-specific paralogs of two genes, Zasp52 and TTLL2, with Zasp52 being directly involved in wing muscle development. We hypothesize that these genes play a direct role in polygyny and contribute to differences in body size and/or dispersal behavior between monogynous and polygynous queens. A complete lack of P/P genotypes among the 261 workers suggests strong selection against such genotypes. While our analyses did not reveal evidence of increased mutation load on the P, it is possible that this skew in genotype distributions is driven by a few loci with strong fitness effects. We propose that selection to escape P-associated fitness costs underlies the loss of this haplotype in obligately polygynous wood ants.
{"title":"The loss of a supergene in obligately polygynous Formica wood ant species.","authors":"Hanna Sigeman, Ina Satokangas, Matthieu de Lamarre, Patrick Krapf, Pierre Nouhaud, Riddhi Deshmukh, Heikki Helanterä, Michel Chapuisat, Jonna Kulmuni, Lumi Viljakainen","doi":"10.1093/molbev/msaf320","DOIUrl":"10.1093/molbev/msaf320","url":null,"abstract":"<p><p>Some of the most striking examples of phenotypic variation within species are controlled by supergenes. However, most research on supergenes has focused on their emergence and long-term maintenance, leaving the later stages of their life cycle largely unexplored. Specifically, what happens to a derived supergene haplotype when the trait it controls reaches fixation? Here we answer this question using the ancient supergene system of Formica ants, where (monogynous) single-queen colonies typically carry only the ancestral haplotype M while the derived haplotype P is exclusive to (polygynous) colonies with multiple queens. Through comparative population genomics of 264 individuals from all seven European wood ant species, we found that the P haplotype was present in only 1/3 obligately polygynous species (Formica polyctena). In the two others (Formica aquilonia and Formica paralugubris), the P haplotype was completely missing except for duplicated P-specific paralogs of two genes, Zasp52 and TTLL2, with Zasp52 being directly involved in wing muscle development. We hypothesize that these genes play a direct role in polygyny and contribute to differences in body size and/or dispersal behavior between monogynous and polygynous queens. A complete lack of P/P genotypes among the 261 workers suggests strong selection against such genotypes. While our analyses did not reveal evidence of increased mutation load on the P, it is possible that this skew in genotype distributions is driven by a few loci with strong fitness effects. We propose that selection to escape P-associated fitness costs underlies the loss of this haplotype in obligately polygynous wood ants.</p>","PeriodicalId":18730,"journal":{"name":"Molecular biology and evolution","volume":" ","pages":""},"PeriodicalIF":5.3,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12728502/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145768533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Riya Nilkant, Lisa Y Mesrop, Samuel Lobo, Onur Sakarya, Joan E Shea, Scott Shell, Soojin V Yi, Kenneth S Kosik
Some genes encoding proteins within the co-evolved pre- and postsynaptic compartments are present in genomes long preceding the origination of the synapse within the animal kingdom. DLG4, gene encoding PSD-95, is one of the most abundant synaptic proteins. It is a MAGUK family member that shares a conserved domain structure comprised of one or multiple PDZ domains, a Src homology 3 (SH3), and a guanylate kinase (GK) domain. Here, we construct the phylogeny of the tri-PDZ domains in DLG4 to its deep ancestral origin in Filozoa, which includes animals and their nearest unicellular relatives. PDZ domain architecture appears to be a strong organizing feature of this gene lineage that originated with a single ancestral PDZ3-like domain in Capsaspora owczarzaki from which PDZ1 and PDZ2 were derived. The strong conservation of individual PDZ domain identities was captured by Evolutionary Scale Modeling (ESM2) across the boundary to the animal kingdom, corroborating distinct clades formed by the divergence of PDZ1, PDZ2, and PDZ3 in the phylogeny. CRIPT, PDZ3 ligand, is present in all Filozoa genomes studied here. AlphaFold2 Multimer demonstrates conserved binding function; however, conserved binding does not completely depend on either sequence motifs or hydrophobicity profiles. Rather, the most conserved feature is hydrogen bonds at the 0 and -2 positions of the ligand as an ancient foundational innovation for PDZ3 ligand interaction. Hydrogen bonds may loosen the sequence requirements for binding to allow a more extensive search space for protein-protein interactions that enhance fitness before the mutations that secure those interactions occur.
{"title":"Evolution of the Tri-PDZ Domain in PSD95 (DLG-4 Gene).","authors":"Riya Nilkant, Lisa Y Mesrop, Samuel Lobo, Onur Sakarya, Joan E Shea, Scott Shell, Soojin V Yi, Kenneth S Kosik","doi":"10.1093/molbev/msaf309","DOIUrl":"10.1093/molbev/msaf309","url":null,"abstract":"<p><p>Some genes encoding proteins within the co-evolved pre- and postsynaptic compartments are present in genomes long preceding the origination of the synapse within the animal kingdom. DLG4, gene encoding PSD-95, is one of the most abundant synaptic proteins. It is a MAGUK family member that shares a conserved domain structure comprised of one or multiple PDZ domains, a Src homology 3 (SH3), and a guanylate kinase (GK) domain. Here, we construct the phylogeny of the tri-PDZ domains in DLG4 to its deep ancestral origin in Filozoa, which includes animals and their nearest unicellular relatives. PDZ domain architecture appears to be a strong organizing feature of this gene lineage that originated with a single ancestral PDZ3-like domain in Capsaspora owczarzaki from which PDZ1 and PDZ2 were derived. The strong conservation of individual PDZ domain identities was captured by Evolutionary Scale Modeling (ESM2) across the boundary to the animal kingdom, corroborating distinct clades formed by the divergence of PDZ1, PDZ2, and PDZ3 in the phylogeny. CRIPT, PDZ3 ligand, is present in all Filozoa genomes studied here. AlphaFold2 Multimer demonstrates conserved binding function; however, conserved binding does not completely depend on either sequence motifs or hydrophobicity profiles. Rather, the most conserved feature is hydrogen bonds at the 0 and -2 positions of the ligand as an ancient foundational innovation for PDZ3 ligand interaction. Hydrogen bonds may loosen the sequence requirements for binding to allow a more extensive search space for protein-protein interactions that enhance fitness before the mutations that secure those interactions occur.</p>","PeriodicalId":18730,"journal":{"name":"Molecular biology and evolution","volume":"42 12","pages":""},"PeriodicalIF":5.3,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12709283/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145768420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jibom Jung, Siliang Song, Myeong-Yeon Kim, Haena Kwak, Benny K K Chan, Sun-Shin Cha, Ui Wook Hwang, Joong-Ki Park
Parasitic lifestyles often impose profound evolutionary pressures, affecting molecular evolution through both adaptive and non-adaptive mechanisms. Among barnacles (subclass Cirripedia), the obligate parasitic Rhizocephala differ markedly from their filter-feeding thoracican relatives in morphology, ecology, and life history. However, how the shift to parasitism has shaped mitochondrial genome evolution within Cirripedia remains unclear. Here, we present the first comprehensive comparative analysis of mitochondrial genomes between parasitic and non-parasitic barnacles, including three newly sequenced and one unpublished species of parasitic Rhizocephala, a clade whose mitochondrial genomes had not been characterized until now. Phylogenomic and molecular evolutionary analyses reveal that Rhizocephala species exhibit extremely long branches likely attributed to the clade-specific tempo (high substitution rate) and mode (selection pressure) of mtDNA sequence evolution associated with their parasitic lifestyle. A two-cluster molecular clock test reveals significantly elevated substitution rates across rhizocephalans, consistent with reduced effective population sizes (Ne) linked to their opportunistic, host-dependent life cycles. We also detect signatures of positive selection in protein-coding genes encoding key components of the electron transport chain complexes III and IV. Structural modeling highlights amino acid substitutions at functionally critical sites for electron transfer and proton pumping, suggesting adaptive modifications to mitochondrial bioenergetics under hypoxic conditions within host tissues. Together, our findings underscore that both non-adaptive (genetic drift, relaxed selection) and adaptive (positive selection) processes have driven the rapid sequence divergence of mitochondrial genomes in parasitic Rhizocephala. Further experimental study is needed to elucidate how mitochondrial and nuclear-encoded subunits of oxidative phosphorylation coevolve in this specialized parasitic group.
{"title":"Accelerated Mitochondrial Genome Evolution in Parasitic Barnacles Driven by Adaptive and Non-adaptive Responses.","authors":"Jibom Jung, Siliang Song, Myeong-Yeon Kim, Haena Kwak, Benny K K Chan, Sun-Shin Cha, Ui Wook Hwang, Joong-Ki Park","doi":"10.1093/molbev/msaf303","DOIUrl":"10.1093/molbev/msaf303","url":null,"abstract":"<p><p>Parasitic lifestyles often impose profound evolutionary pressures, affecting molecular evolution through both adaptive and non-adaptive mechanisms. Among barnacles (subclass Cirripedia), the obligate parasitic Rhizocephala differ markedly from their filter-feeding thoracican relatives in morphology, ecology, and life history. However, how the shift to parasitism has shaped mitochondrial genome evolution within Cirripedia remains unclear. Here, we present the first comprehensive comparative analysis of mitochondrial genomes between parasitic and non-parasitic barnacles, including three newly sequenced and one unpublished species of parasitic Rhizocephala, a clade whose mitochondrial genomes had not been characterized until now. Phylogenomic and molecular evolutionary analyses reveal that Rhizocephala species exhibit extremely long branches likely attributed to the clade-specific tempo (high substitution rate) and mode (selection pressure) of mtDNA sequence evolution associated with their parasitic lifestyle. A two-cluster molecular clock test reveals significantly elevated substitution rates across rhizocephalans, consistent with reduced effective population sizes (Ne) linked to their opportunistic, host-dependent life cycles. We also detect signatures of positive selection in protein-coding genes encoding key components of the electron transport chain complexes III and IV. Structural modeling highlights amino acid substitutions at functionally critical sites for electron transfer and proton pumping, suggesting adaptive modifications to mitochondrial bioenergetics under hypoxic conditions within host tissues. Together, our findings underscore that both non-adaptive (genetic drift, relaxed selection) and adaptive (positive selection) processes have driven the rapid sequence divergence of mitochondrial genomes in parasitic Rhizocephala. Further experimental study is needed to elucidate how mitochondrial and nuclear-encoded subunits of oxidative phosphorylation coevolve in this specialized parasitic group.</p>","PeriodicalId":18730,"journal":{"name":"Molecular biology and evolution","volume":" ","pages":""},"PeriodicalIF":5.3,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12696376/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145588021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction to: UPrimer: A Clade-Specific Primer Design Program Based on Nested-PCR Strategy and Its Applications in Amplicon Capture Phylogenomics.","authors":"","doi":"10.1093/molbev/msaf317","DOIUrl":"10.1093/molbev/msaf317","url":null,"abstract":"","PeriodicalId":18730,"journal":{"name":"Molecular biology and evolution","volume":"42 12","pages":""},"PeriodicalIF":5.3,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12687589/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145708823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ultraconserved elements are segments of DNA that are identical or nearly identical in distantly related species. Finding 100% identity over long evolutionary times is unexpected, but pioneering research in human-mouse pairwise alignment uncovered something even more puzzling: these elements are not as rare as previously suspected. Furthermore, their sizes are distributed as a power-law, a feature that cannot be explained by standard models of genome evolution where conservation is expected to decay exponentially. Despite the power-law behavior having been reported and investigated in a wide variety of biological and physical contexts, from cell-division to protein family evolution, why it appears in the size distribution of ultraconserved elements remains elusive. To address this question, I propose a model of DNA sequence evolution by mutations of arbitrary length based on a classical integro-differential equation that arises in various applications in biology. The model captures the ultraconserved size distribution observed in pairwise alignments between human and 40 other vertebrates, encompassing more than 400 million years of evolution, from chimpanzee to zebrafish. I also show that the model can be used to predict other important aspects of genome evolution, such as indel rates and conservation in functional classes.
{"title":"Modeling the Evolution of Ultraconserved Elements by Indels.","authors":"Priscila Biller","doi":"10.1093/molbev/msaf299","DOIUrl":"10.1093/molbev/msaf299","url":null,"abstract":"<p><p>Ultraconserved elements are segments of DNA that are identical or nearly identical in distantly related species. Finding 100% identity over long evolutionary times is unexpected, but pioneering research in human-mouse pairwise alignment uncovered something even more puzzling: these elements are not as rare as previously suspected. Furthermore, their sizes are distributed as a power-law, a feature that cannot be explained by standard models of genome evolution where conservation is expected to decay exponentially. Despite the power-law behavior having been reported and investigated in a wide variety of biological and physical contexts, from cell-division to protein family evolution, why it appears in the size distribution of ultraconserved elements remains elusive. To address this question, I propose a model of DNA sequence evolution by mutations of arbitrary length based on a classical integro-differential equation that arises in various applications in biology. The model captures the ultraconserved size distribution observed in pairwise alignments between human and 40 other vertebrates, encompassing more than 400 million years of evolution, from chimpanzee to zebrafish. I also show that the model can be used to predict other important aspects of genome evolution, such as indel rates and conservation in functional classes.</p>","PeriodicalId":18730,"journal":{"name":"Molecular biology and evolution","volume":" ","pages":""},"PeriodicalIF":5.3,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12673672/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145573867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Merlin Szymanski, Johann Visagie, Frédéric Romagné, Matthias Meyer, Janet Kelso
Ancient DNA extracted from the sediments of archaeological sites (sedaDNA) can provide fine-grained information about the composition of past ecosystems and human site use, even in the absence of visible remains. However, the growing amount of available sequencing data and the nature of the data obtained from archaeological sediments pose several computational challenges; among these, the rapid and accurate taxonomic classification of sequences. While alignment-based taxonomic classifiers remain the standard in sedaDNA analysis pipelines, they are too computationally expensive for the processing of large numbers of sedaDNA sequences. In contrast, alignment-free methods offer fast classification but suffer from higher false-positive rates. To address these limits, we developed quicksand, an open-source Nextflow pipeline designed for rapid and accurate taxonomic classification of mammalian mitochondrial DNA in sedaDNA samples. quicksand combines fast alignment-free classification using KrakenUniq with post-classification mapping, filtering, and ancient DNA authentication. Based on simulations and reanalyses of published datasets, we demonstrate that quicksand achieves accuracy and sensitivity comparable to or better than existing methods, while significantly reducing runtime. quicksand offers an easy workflow for large-scale screening of sedaDNA samples for archaeological research and is freely available at https://github.com/mpieva/quicksand.
{"title":"Quick Analysis of Sedimentary Ancient DNA Using quicksand.","authors":"Merlin Szymanski, Johann Visagie, Frédéric Romagné, Matthias Meyer, Janet Kelso","doi":"10.1093/molbev/msaf305","DOIUrl":"10.1093/molbev/msaf305","url":null,"abstract":"<p><p>Ancient DNA extracted from the sediments of archaeological sites (sedaDNA) can provide fine-grained information about the composition of past ecosystems and human site use, even in the absence of visible remains. However, the growing amount of available sequencing data and the nature of the data obtained from archaeological sediments pose several computational challenges; among these, the rapid and accurate taxonomic classification of sequences. While alignment-based taxonomic classifiers remain the standard in sedaDNA analysis pipelines, they are too computationally expensive for the processing of large numbers of sedaDNA sequences. In contrast, alignment-free methods offer fast classification but suffer from higher false-positive rates. To address these limits, we developed quicksand, an open-source Nextflow pipeline designed for rapid and accurate taxonomic classification of mammalian mitochondrial DNA in sedaDNA samples. quicksand combines fast alignment-free classification using KrakenUniq with post-classification mapping, filtering, and ancient DNA authentication. Based on simulations and reanalyses of published datasets, we demonstrate that quicksand achieves accuracy and sensitivity comparable to or better than existing methods, while significantly reducing runtime. quicksand offers an easy workflow for large-scale screening of sedaDNA samples for archaeological research and is freely available at https://github.com/mpieva/quicksand.</p>","PeriodicalId":18730,"journal":{"name":"Molecular biology and evolution","volume":" ","pages":""},"PeriodicalIF":5.3,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12684969/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145596772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cade Mirchandani, Erik Enbody, Timothy B Sackton, Russ Corbett-Detig
The increasing scale of population genomic datasets presents computational challenges in estimating summary statistics such as nucleotide diversity (π) and divergence (dxy). Accurate estimates of diversity require knowledge of missing data, and existing tools require all-site VCFs. However, generating these files is computationally expensive for large datasets. Here, we introduce Callable Loci And More (clam), a tool that leverages callable loci-determined from depth information-to estimate population genetic statistics using a variant-only VCF. This approach offers improvements in storage footprint and computational performance compared to contemporary methods. We validate clam's accuracy using simulated data, demonstrating that it produces estimates of π, dxy, and fixation index (FST) identical to those from all-site VCF approaches. We then benchmark clam using a large muskox dataset and demonstrate that it produces accurate estimates of π while substantially reducing runtime requirements compared to current best-practice methods. clam provides an efficient and scalable alternative for population genomic analyses, facilitating the study of increasingly large and diverse datasets. clam is available as a standalone program and integrated into snpArcher for efficient reproducible population genomic analysis.
人口基因组数据集的规模不断扩大,在估计核苷酸多样性(π)和差异(dxy)等汇总统计数据方面提出了计算挑战。对多样性的准确估计需要了解缺失的数据,而现有的工具需要所有地点的VCFs。然而,对于大型数据集来说,生成这些文件在计算上是非常昂贵的。在这里,我们介绍了Callable Loci And More (clam),这是一个利用深度信息确定的Callable Loci的工具,使用仅变量的VCF来估计群体遗传统计。与当前方法相比,这种方法在存储空间占用和计算性能方面有所改进。我们使用模拟数据验证了clam的准确性,证明它产生的π, dxy和FST估计值与所有站点VCF方法相同。然后,我们使用大型麝鼠数据集对clam进行基准测试,并证明与当前最佳实践方法相比,它产生了准确的π估计,同时大大减少了运行时需求。Clam为群体基因组分析提供了一种高效和可扩展的替代方案,促进了对日益庞大和多样化的数据集的研究。clam可以作为一个独立的程序,并集成到snpArcher中,用于高效可重复的种群基因组分析。
{"title":"Efficient Estimation of Nucleotide Diversity and Divergence Using Callable Loci (and More).","authors":"Cade Mirchandani, Erik Enbody, Timothy B Sackton, Russ Corbett-Detig","doi":"10.1093/molbev/msaf282","DOIUrl":"10.1093/molbev/msaf282","url":null,"abstract":"<p><p>The increasing scale of population genomic datasets presents computational challenges in estimating summary statistics such as nucleotide diversity (π) and divergence (dxy). Accurate estimates of diversity require knowledge of missing data, and existing tools require all-site VCFs. However, generating these files is computationally expensive for large datasets. Here, we introduce Callable Loci And More (clam), a tool that leverages callable loci-determined from depth information-to estimate population genetic statistics using a variant-only VCF. This approach offers improvements in storage footprint and computational performance compared to contemporary methods. We validate clam's accuracy using simulated data, demonstrating that it produces estimates of π, dxy, and fixation index (FST) identical to those from all-site VCF approaches. We then benchmark clam using a large muskox dataset and demonstrate that it produces accurate estimates of π while substantially reducing runtime requirements compared to current best-practice methods. clam provides an efficient and scalable alternative for population genomic analyses, facilitating the study of increasingly large and diverse datasets. clam is available as a standalone program and integrated into snpArcher for efficient reproducible population genomic analysis.</p>","PeriodicalId":18730,"journal":{"name":"Molecular biology and evolution","volume":" ","pages":""},"PeriodicalIF":5.3,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12697346/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145588078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Phylogenomic data are indispensable for establishing reliable relationships needed to build a robust Tree of Life. The superalignment approach concatenates hundreds or thousands of genomic segments, providing a straightforward, computationally efficient, and effective means of inferring phylogenies. However, the standard bootstrap method can produce overly confident support for incorrect inferences based on superalignments. It fails to account for the heterogeneity in phylogenetic signals across the data, which is caused by incomplete lineage sorting (ILS), data errors, and other biological processes. To detect such erroneous inferences, researchers need to produce and deliberate on the concordance of inferences derived from many complex and computationally demanding analyses that require knowledge of data partitions. This study demonstrates that analyzing phylogenomic subsamples with bootstrap upsampling overcomes the overconfidence drawback of the superalignment approach. We found that bootstrapping multiple small, randomly selected site subsets can detect the presence of phylogeny variation signals across the dataset, similar to that detected using data partitions. We present the Net Bootstrap Support (NBS) approach that accounts for this phylogenetic variation in the estimates of bootstrap confidence. NBS values showed comparable performance to multispecies coalescent analyses in the presence of ILS and surpassed it for datasets simulated with gene tree estimation errors. NBS analyses of phylogenomic data from rodents, fungi, and carnivorous plants corroborated the performance observed in simulated datasets and even mitigated overconfidence resulting from some data errors. NBS calculations are computationally efficient, with low memory consumption and high computational time savings, making the NBS approach well suited for big data molecular phylogenetics on both desktops and high-performance computing systems.
系统基因组学数据对于建立可靠的关系以构建强健的生命之树是不可或缺的。超比对方法连接了数百或数千个基因组片段,提供了一种简单、计算效率高、有效的推断系统发育的方法。然而,标准的自举方法可能会对基于超比对的错误推断产生过于自信的支持,因为它无法解释数据中系统发育信号的异质性,这是由不完整的谱系排序(ILS)、数据错误和其他生物过程引起的。为了检测这些错误的推断,研究人员需要产生并仔细考虑从许多需要数据分区知识的复杂和计算要求高的分析中得出的推断的一致性。本研究表明,用自举上采样分析系统基因组亚样本克服了超比对方法的过度置信度缺点。我们发现,启动多个随机选择的小位点子集揭示了整个数据集中存在的系统发育变异信号,类似于使用生物数据分区检测到的信号。我们提出了Net Bootstrap Support (NBS)来解释这种系统发育差异。NBS值显示出与存在ILS的多物种聚结分析相当的性能,并且在基因树估计误差模拟的数据集上超过了它。NBS对啮齿类动物、真菌和食肉植物的系统基因组数据进行了分析,证实了在模拟数据集中观察到的性能,甚至减轻了由于一些数据错误而导致的过度自信。NBS计算计算效率高,内存消耗低,节省计算时间,使NBS非常适合台式机和高性能计算系统上的大数据分子系统发育。
{"title":"Robust and Efficient Confidence Limits for Phylogenomic Inference of Organismal Relationships.","authors":"Sudip Sharma, Sudhir Kumar","doi":"10.1093/molbev/msaf296","DOIUrl":"10.1093/molbev/msaf296","url":null,"abstract":"<p><p>Phylogenomic data are indispensable for establishing reliable relationships needed to build a robust Tree of Life. The superalignment approach concatenates hundreds or thousands of genomic segments, providing a straightforward, computationally efficient, and effective means of inferring phylogenies. However, the standard bootstrap method can produce overly confident support for incorrect inferences based on superalignments. It fails to account for the heterogeneity in phylogenetic signals across the data, which is caused by incomplete lineage sorting (ILS), data errors, and other biological processes. To detect such erroneous inferences, researchers need to produce and deliberate on the concordance of inferences derived from many complex and computationally demanding analyses that require knowledge of data partitions. This study demonstrates that analyzing phylogenomic subsamples with bootstrap upsampling overcomes the overconfidence drawback of the superalignment approach. We found that bootstrapping multiple small, randomly selected site subsets can detect the presence of phylogeny variation signals across the dataset, similar to that detected using data partitions. We present the Net Bootstrap Support (NBS) approach that accounts for this phylogenetic variation in the estimates of bootstrap confidence. NBS values showed comparable performance to multispecies coalescent analyses in the presence of ILS and surpassed it for datasets simulated with gene tree estimation errors. NBS analyses of phylogenomic data from rodents, fungi, and carnivorous plants corroborated the performance observed in simulated datasets and even mitigated overconfidence resulting from some data errors. NBS calculations are computationally efficient, with low memory consumption and high computational time savings, making the NBS approach well suited for big data molecular phylogenetics on both desktops and high-performance computing systems.</p>","PeriodicalId":18730,"journal":{"name":"Molecular biology and evolution","volume":" ","pages":""},"PeriodicalIF":5.3,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12665395/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145541390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qingbei Cheng, Muhammad Saqib Sohail, Matthew R McKay
Estimating selection from genetic time-series data is fundamental to understanding evolutionary dynamics. Accurate selection inference is confounded by multiple noise sources, including limited sampling of populations and genetic drift. To characterize how these uncertainties collectively affect estimator performance, we analyze a mathematically tractable selection coefficient estimator derived under the marginal path likelihood (MPL) framework. We identify a parameter, the integrated mutant allele variance, as a key quantity determining estimator precision. Our analysis reveals that variance integration mitigates sampling and genetic drift errors at different rates, with drift typically becoming the dominant source of error in longer trajectories. The increased robustness of MPL-based estimation to sampling is surprising, since MPL is derived from a model that neglects this effect. Our findings offer insights into how incorporating temporal information reduces multiple sources of noise when estimating selection coefficients.
{"title":"Selection Estimation from Genetic Time-Series Data: Effects of Limited Sampling and Genetic Drift.","authors":"Qingbei Cheng, Muhammad Saqib Sohail, Matthew R McKay","doi":"10.1093/molbev/msaf301","DOIUrl":"10.1093/molbev/msaf301","url":null,"abstract":"<p><p>Estimating selection from genetic time-series data is fundamental to understanding evolutionary dynamics. Accurate selection inference is confounded by multiple noise sources, including limited sampling of populations and genetic drift. To characterize how these uncertainties collectively affect estimator performance, we analyze a mathematically tractable selection coefficient estimator derived under the marginal path likelihood (MPL) framework. We identify a parameter, the integrated mutant allele variance, as a key quantity determining estimator precision. Our analysis reveals that variance integration mitigates sampling and genetic drift errors at different rates, with drift typically becoming the dominant source of error in longer trajectories. The increased robustness of MPL-based estimation to sampling is surprising, since MPL is derived from a model that neglects this effect. Our findings offer insights into how incorporating temporal information reduces multiple sources of noise when estimating selection coefficients.</p>","PeriodicalId":18730,"journal":{"name":"Molecular biology and evolution","volume":" ","pages":""},"PeriodicalIF":5.3,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145636160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}