Poultry egg production is shaped by the intertwined action of multiple physiological systems, greatly magnifying the complexity of its underlying genetic regulation. Although multitissue mapping of regulatory variants offers a powerful route to untangle this complexity, comprehensive data sets in ducks remain scarce. Meanwhile, the contributions of peripheral systems beyond neuroendocrine regulation on poultry egg production are still largely unexplored. Here, we generate 979 RNA-seq samples from the liver, ovary, oviduct shell gland, and spleen, along with matched whole-genome sequencing data from 307 egg-laying ducks. We map cis-regulatory variants associated with gene expression (eQTL), alternative splicing (sQTL), and 3′ alternative polyadenylation (apaQTL), yielding 14,074, 6267, and 4994 genes with at least one significant eQTL, sQTL, and apaQTL, respectively. By integrating this resource and GWAS results, we confirm that ABCG2 expression in the shell gland specifically regulates eggshell color, with additional involvement of ENOPH1’s 3′APA sites in both the shell gland and liver. In addition, expression of LOC101800576 and LOC101790890 in the shell gland, of LOC119713219 in the ovary, and of GLP2R in the spleen is causally linked to declining egg production at peak laying. Last, we delineate a cross-tissue regulatory landscape underlying duck egg production and identify liver-derived modules, particularly Liver_ME1, which is mainly involved in cell cycle regulation, as central hubs coordinating with peripheral tissues affecting duck egg production. This work delivers a key resource and fresh perspectives for the genetic mechanism dissection of duck egg production and for future studies on cross-tissue regulation of reproduction.
{"title":"Mapping multitissue regulatory variants reveals a liver-centric coexpression network associated with duck egg-laying performance","authors":"Yang Xi, Jingjing Qi, Zhao Yang, Yutian Zeng, Huicong Zhang, Qiuyu Tao, Mengru Xu, Anqi Huang, Shenqiang Hu, Chunchun Han, Lili Bai, Jiwei Hu, Jiwen Wang, Liang Li, Lingzhao Fang, Hehe Liu","doi":"10.1101/gr.280345.124","DOIUrl":"https://doi.org/10.1101/gr.280345.124","url":null,"abstract":"Poultry egg production is shaped by the intertwined action of multiple physiological systems, greatly magnifying the complexity of its underlying genetic regulation. Although multitissue mapping of regulatory variants offers a powerful route to untangle this complexity, comprehensive data sets in ducks remain scarce. Meanwhile, the contributions of peripheral systems beyond neuroendocrine regulation on poultry egg production are still largely unexplored. Here, we generate 979 RNA-seq samples from the liver, ovary, oviduct shell gland, and spleen, along with matched whole-genome sequencing data from 307 egg-laying ducks. We map <em>cis</em>-regulatory variants associated with gene expression (eQTL), alternative splicing (sQTL), and 3′ alternative polyadenylation (apaQTL), yielding 14,074, 6267, and 4994 genes with at least one significant eQTL, sQTL, and apaQTL, respectively. By integrating this resource and GWAS results, we confirm that <em>ABCG2</em> expression in the shell gland specifically regulates eggshell color, with additional involvement of <em>ENOPH1</em>’s 3′APA sites in both the shell gland and liver. In addition, expression of <em>LOC101800576</em> and <em>LOC101790890</em> in the shell gland, of <em>LOC119713219</em> in the ovary, and of <em>GLP2R</em> in the spleen is causally linked to declining egg production at peak laying. Last, we delineate a cross-tissue regulatory landscape underlying duck egg production and identify liver-derived modules, particularly Liver_ME1, which is mainly involved in cell cycle regulation, as central hubs coordinating with peripheral tissues affecting duck egg production. This work delivers a key resource and fresh perspectives for the genetic mechanism dissection of duck egg production and for future studies on cross-tissue regulation of reproduction.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"12 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145031923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In a multicellular organism, cell lineages share a common evolutionary history. Knowing this history can facilitate the study of development, aging, and cancer. Cell lineage trees represent the evolutionary history of cells sampled from an organism. Recent developments in single-cell sequencing have greatly facilitated the inference of cell lineage trees. However, single-cell data are sparse and noisy, and the size of single-cell data is increasing rapidly. Accurate inference of cell lineage tree from large single-cell data is computationally challenging. In this paper, we present ScisTree2, a fast and accurate cell lineage tree inference and genotype calling approach based on the infinite-sites model. ScisTree2 relies on an efficient local search approach to find optimal trees. ScisTree2 also calls single-cell genotypes based on the inferred cell lineage tree. Experiments on simulated and real biological data show that ScisTree2 achieves better overall accuracy while being significantly more efficient than existing methods. To the best of our knowledge, ScisTree2 is the first model-based cell lineage tree inference and genotype calling approach that is capable of handling datasets from tens of thousands of cells or more.
{"title":"ScisTree2 enables large-scale inference of cell lineage trees and genotype calling using efficient local search","authors":"Haotian Zhang, Yiming Zhang, Teng Gao, Yufeng Wu","doi":"10.1101/gr.280542.125","DOIUrl":"https://doi.org/10.1101/gr.280542.125","url":null,"abstract":"In a multicellular organism, cell lineages share a common evolutionary history. Knowing this history can facilitate the study of development, aging, and cancer. Cell lineage trees represent the evolutionary history of cells sampled from an organism. Recent developments in single-cell sequencing have greatly facilitated the inference of cell lineage trees. However, single-cell data are sparse and noisy, and the size of single-cell data is increasing rapidly. Accurate inference of cell lineage tree from large single-cell data is computationally challenging. In this paper, we present ScisTree2, a fast and accurate cell lineage tree inference and genotype calling approach based on the infinite-sites model. ScisTree2 relies on an efficient local search approach to find optimal trees. ScisTree2 also calls single-cell genotypes based on the inferred cell lineage tree. Experiments on simulated and real biological data show that ScisTree2 achieves better overall accuracy while being significantly more efficient than existing methods. To the best of our knowledge, ScisTree2 is the first model-based cell lineage tree inference and genotype calling approach that is capable of handling datasets from tens of thousands of cells or more.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"24 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144987594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yohan An, Ji-Hyun Lee, Joonoh Lim, Jeonghwan Youk, Seongyeol Park, Ji-Hyung Park, Kijong Yi, Taewoo Kim, Chang Hyun Nam, Won Hee Lee, Soo A Oh, Yoo Jin Bae, Thomas M. Klompstra, Haeun Lee, Jinju Han, Junehwak Lee, Jung Woo Park, Jie-Hyun Kim, Hyunki Kim, Hugo Snippert, Bon-Kyoung Koo, Young Seok Ju
Cancer genomes frequently carry APOBEC (apolipoprotein B mRNA editing catalytic polypeptide-like)-associated DNA mutations, suggesting APOBEC enzymes as innate mutagens during cancer initiation and evolution. However, the pure mutagenic impacts of the specific enzymes among this family remain unclear in human normal cell lineages. Here, we investigated the comparative mutagenic activities of APOBEC3A and APOBEC3B, through whole-genome sequencing of human normal gastric organoid lines carrying doxycycline-inducible APOBEC expression cassettes. Our findings demonstrated that transcriptional upregulation of APOBEC3A led to the acquisition of a massive number of genomic mutations in just a few cell cycles. By contrast, despite clear deaminase activity and DNA damage, APOBEC3B upregulation did not generate a significant increase in mutations in the gastric epithelium. APOBEC3B-associated mutagenesis remained minimal even in the context of TP53 inactivation. Further analysis of the mutational landscape following APOBEC3A upregulation revealed a detailed spectrum of APOBEC3A-associated mutations, including indels, primarily 1 bp deletions, clustered mutations, and evidence of selective pressures acting on cells carrying the mutations. Our observations provide a clear foundation for understanding the mutational impact of APOBEC enzymes in human cells.
{"title":"APOBEC3A drives deaminase mutagenesis in human gastric epithelium","authors":"Yohan An, Ji-Hyun Lee, Joonoh Lim, Jeonghwan Youk, Seongyeol Park, Ji-Hyung Park, Kijong Yi, Taewoo Kim, Chang Hyun Nam, Won Hee Lee, Soo A Oh, Yoo Jin Bae, Thomas M. Klompstra, Haeun Lee, Jinju Han, Junehwak Lee, Jung Woo Park, Jie-Hyun Kim, Hyunki Kim, Hugo Snippert, Bon-Kyoung Koo, Young Seok Ju","doi":"10.1101/gr.280338.124","DOIUrl":"https://doi.org/10.1101/gr.280338.124","url":null,"abstract":"Cancer genomes frequently carry APOBEC (apolipoprotein B mRNA editing catalytic polypeptide-like)-associated DNA mutations, suggesting APOBEC enzymes as innate mutagens during cancer initiation and evolution. However, the pure mutagenic impacts of the specific enzymes among this family remain unclear in human normal cell lineages. Here, we investigated the comparative mutagenic activities of <em>APOBEC3A</em> and <em>APOBEC3B</em>, through whole-genome sequencing of human normal gastric organoid lines carrying doxycycline-inducible APOBEC expression cassettes. Our findings demonstrated that transcriptional upregulation of <em>APOBEC3A</em> led to the acquisition of a massive number of genomic mutations in just a few cell cycles. By contrast, despite clear deaminase activity and DNA damage, <em>APOBEC3B</em> upregulation did not generate a significant increase in mutations in the gastric epithelium. <em>APOBEC3B</em>-associated mutagenesis remained minimal even in the context of TP53 inactivation. Further analysis of the mutational landscape following <em>APOBEC3A</em> upregulation revealed a detailed spectrum of <em>APOBEC3A</em>-associated mutations, including indels, primarily 1 bp deletions, clustered mutations, and evidence of selective pressures acting on cells carrying the mutations. Our observations provide a clear foundation for understanding the mutational impact of APOBEC enzymes in human cells.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"15 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Microglia-driven dysregulation has emerged as a significant underlying mechanism in many neurodegenerative diseases, such as Age-related Macular Degeneration (AMD) and Alzheimer's disease (AD). While both brain and retinal microglia originate from the yolk sac, it is uncertain whether they share molecular similarities or genetic and molecular foundations related to neurodegenerative diseases. In this study, we examine the transcriptomic and epigenetic profiles of retina and brain microglia through integrative analyses of single-nucleus RNA sequencing (snRNA-seq) and single-nucleus ATAC sequencing (snATAC-seq) from 97 independent human samples across eleven different studies. Our findings reveal that retina and brain microglia share similar expression and regulatory profiles when compared to other cell types in retina and brain. By integrating genome-wide association studies (GWAS) data with gene expression profiles, we demonstrate that genetic variants associated with AMD and AD are linked to microglia-specific gene signatures. Furthermore, integrating regulatory annotations with GWAS data shows that susceptibility loci for both AMD and AD are notably enriched in the open chromatin regions of microglia from brain and retina, emphasizing their relevance to these neurodegenerative conditions. Finally, a comparison with microglia annotations from other tissues highlights the specific enrichment of microglia in relation to neurodegenerative diseases. These findings contribute to the understanding of the role of microglia in AMD and AD pathogenesis and offer an opportunity to utilize resources from both retinal and brain microglia to deepen our understanding of their contributions to genetic variations in neurodegenerative diseases.
{"title":"Molecular and genetic landscapes of retina and brain microglia in neurodegenerative diseases","authors":"Khang Ma, Rinki Ratnapriya","doi":"10.1101/gr.280554.125","DOIUrl":"https://doi.org/10.1101/gr.280554.125","url":null,"abstract":"Microglia-driven dysregulation has emerged as a significant underlying mechanism in many neurodegenerative diseases, such as Age-related Macular Degeneration (AMD) and Alzheimer's disease (AD). While both brain and retinal microglia originate from the yolk sac, it is uncertain whether they share molecular similarities or genetic and molecular foundations related to neurodegenerative diseases. In this study, we examine the transcriptomic and epigenetic profiles of retina and brain microglia through integrative analyses of single-nucleus RNA sequencing (snRNA-seq) and single-nucleus ATAC sequencing (snATAC-seq) from 97 independent human samples across eleven different studies. Our findings reveal that retina and brain microglia share similar expression and regulatory profiles when compared to other cell types in retina and brain. By integrating genome-wide association studies (GWAS) data with gene expression profiles, we demonstrate that genetic variants associated with AMD and AD are linked to microglia-specific gene signatures. Furthermore, integrating regulatory annotations with GWAS data shows that susceptibility loci for both AMD and AD are notably enriched in the open chromatin regions of microglia from brain and retina, emphasizing their relevance to these neurodegenerative conditions. Finally, a comparison with microglia annotations from other tissues highlights the specific enrichment of microglia in relation to neurodegenerative diseases. These findings contribute to the understanding of the role of microglia in AMD and AD pathogenesis and offer an opportunity to utilize resources from both retinal and brain microglia to deepen our understanding of their contributions to genetic variations in neurodegenerative diseases.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"43 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Isabela T Pereira, Izabela Mamede, Paulo de Paiva Amaral, Gloria Regina Franco, John L Rinn
Many essential cellular processes require RNA to interact with protein(s) to form ribonucleic protein complexes (RNPs). For example, all cellular proteins are produced by the ribosome - a large and stable RNP, gene splicing requires a choreography of numerous small and large RNPs, even the replication of telomeric DNA requires an RNP. All these examples are stable RNPs that exhibit specific sedimentation rates (e.g., in a sucrose gradient) based on the composition of RNA and protein. In this study we aimed to identify RNA components of discrete RNPs on a transcriptome-wide scale. Using sucrose-gradient sedimentation followed by sequencing, we identified 1,057 RNA transcripts, both coding and noncoding, that are likely to be components of cellular RNPs. We named these transcripts Gradient Enriched Transcripts (GETs). GETs were predominantly nuclear, metabolically stable, and they were not the major splice isoforms but instead each contained a specific retained intron. Collectively our study reveals a widespread phenomenon of a specific intron being retained in a stable nuclear RNPs.
{"title":"Widespread specific intron-retention events in nuclear RNA complexes identified by sedimentation analysis of pluripotent cellular extracts","authors":"Isabela T Pereira, Izabela Mamede, Paulo de Paiva Amaral, Gloria Regina Franco, John L Rinn","doi":"10.1101/gr.280431.125","DOIUrl":"https://doi.org/10.1101/gr.280431.125","url":null,"abstract":"Many essential cellular processes require RNA to interact with protein(s) to form ribonucleic protein complexes (RNPs). For example, all cellular proteins are produced by the ribosome - a large and stable RNP, gene splicing requires a choreography of numerous small and large RNPs, even the replication of telomeric DNA requires an RNP. All these examples are stable RNPs that exhibit specific sedimentation rates (e.g., in a sucrose gradient) based on the composition of RNA and protein. In this study we aimed to identify RNA components of discrete RNPs on a transcriptome-wide scale. Using sucrose-gradient sedimentation followed by sequencing, we identified 1,057 RNA transcripts, both coding and noncoding, that are likely to be components of cellular RNPs. We named these transcripts Gradient Enriched Transcripts (GETs). GETs were predominantly nuclear, metabolically stable, and they were not the major splice isoforms but instead each contained a specific retained intron. Collectively our study reveals a widespread phenomenon of a specific intron being retained in a stable nuclear RNPs.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"23 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikol Chantzi, Akshatha Nayak, Fotis A. Baltoumas, Eleni Aplakidou, Shiau Wei Liew, Jesslyn Elvaretta Galuh, Michail Patsakis, Austin Montgomery, Camille Moeckel, Ioannis Mouratidis, Saiful Arefeen Sazed, Wilfried Guiblet, Panagiotis Karmiris-Obratański, Guliang Wang, Apostolos Zaravinos, Karen M. Vasquez, Chun Kit Kwok, Georgios A. Pavlopoulos, Ilias Georgakopoulos-Soares
G-quadruplex DNA structures exhibit a profound influence on essential biological processes, including transcription, replication, telomere maintenance, and genomic stability. These structures have demonstrably shaped organismal evolution. However, a comprehensive, organism-wide G-quadruplex map encompassing the diversity of life has remained elusive. Here, we introduce Quadrupia, the most extensive and well-characterized G-quadruplex database to date, facilitating the exploration of G-quadruplex structures across the evolutionary spectrum. Quadrupia has identified G-quadruplex sequences in 108,449 reference genomes, with a total of 140,181,277 G-quadruplexes. The database also hosts a collection of 319,784 G-quadruplex clusters of 20 or more members, annotated by taxonomic distributions, multiple sequence alignments, profile hidden Markov models and cross-references to G-quadruplex 3D structures. Examination of G-quadruplexes across functional genomic elements in different taxa indicates preferential orientation and positioning, with significant differences between individual taxonomic groups. For example, we find that G-quadruplexes in bacteria with a single replication origin display profound preference for the leading orientation. Finally, we experimentally validate the most frequently observed G-quadruplexes using CD-spectroscopy, UV melting, and fluorescent-based approaches.
{"title":"Quadrupia provides a comprehensive catalog of G-quadruplexes across genomes from the tree of life","authors":"Nikol Chantzi, Akshatha Nayak, Fotis A. Baltoumas, Eleni Aplakidou, Shiau Wei Liew, Jesslyn Elvaretta Galuh, Michail Patsakis, Austin Montgomery, Camille Moeckel, Ioannis Mouratidis, Saiful Arefeen Sazed, Wilfried Guiblet, Panagiotis Karmiris-Obratański, Guliang Wang, Apostolos Zaravinos, Karen M. Vasquez, Chun Kit Kwok, Georgios A. Pavlopoulos, Ilias Georgakopoulos-Soares","doi":"10.1101/gr.279790.124","DOIUrl":"https://doi.org/10.1101/gr.279790.124","url":null,"abstract":"G-quadruplex DNA structures exhibit a profound influence on essential biological processes, including transcription, replication, telomere maintenance, and genomic stability. These structures have demonstrably shaped organismal evolution. However, a comprehensive, organism-wide G-quadruplex map encompassing the diversity of life has remained elusive. Here, we introduce Quadrupia, the most extensive and well-characterized G-quadruplex database to date, facilitating the exploration of G-quadruplex structures across the evolutionary spectrum. Quadrupia has identified G-quadruplex sequences in 108,449 reference genomes, with a total of 140,181,277 G-quadruplexes. The database also hosts a collection of 319,784 G-quadruplex clusters of 20 or more members, annotated by taxonomic distributions, multiple sequence alignments, profile hidden Markov models and cross-references to G-quadruplex 3D structures. Examination of G-quadruplexes across functional genomic elements in different taxa indicates preferential orientation and positioning, with significant differences between individual taxonomic groups. For example, we find that G-quadruplexes in bacteria with a single replication origin display profound preference for the leading orientation. Finally, we experimentally validate the most frequently observed G-quadruplexes using CD-spectroscopy, UV melting, and fluorescent-based approaches.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"191 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Taxonomic sequence classification is a computational problem central to the study of metagenomics and evolution Advances in compressed indexing with the r-index enable full-text pattern matching against large sequence collections. But the data structures that link pattern sequences to their clades of origin still do not scale well to large collections. Previous work proposed the document array profiles, which use O(rd) words of space where r is the number of maximal-equal letter runs in the Burrows-Wheeler transform and d is the number of distinct genomes. The linear dependence on d is limiting, since real taxonomies can easily contain 10,000s of leaves or more. We propose a method called cliff compression that reduces this size by a large factor, over 250× when indexing the SILVA 16S rRNA gene database. This method uses Θ(r log d) words of space in expectation under a random model we propose here. We implemented these ideas in an open source tool called Cliffy that performs efficient taxonomic classification of sequencing reads with respect to a compressed taxonomic index. When applied to simulated 16S rRNA reads, Cliffy's read-level accuracy is higher than Kraken2's by 11-18%. Clade abundances are also more accurately predicted by Cliffy compared to Kraken2 and Bracken. Overall, Cliffy is a fast and space-economical extension to compressed full-text indexes, enabling them to perform fast and accurate taxonomic classification queries. Cliffy's accuracy underscores the advantages of full-text indexes, which offer a more precise solution compared to k-mer indexes designed for a specific k value.
{"title":"Robust 16S rRNA classification based on a compressed LCA index","authors":"Omar Y. Ahmed, Christina Boucher, Ben Langmead","doi":"10.1101/gr.279846.124","DOIUrl":"https://doi.org/10.1101/gr.279846.124","url":null,"abstract":"Taxonomic sequence classification is a computational problem central to the study of metagenomics and evolution Advances in compressed indexing with the <em>r</em>-index enable full-text pattern matching against large sequence collections. But the data structures that link pattern sequences to their clades of origin still do not scale well to large collections. Previous work proposed the document array profiles, which use <em>O</em>(<em>rd</em>) words of space where<em> r</em> is the number of maximal-equal letter runs in the Burrows-Wheeler transform and <em> d</em> is the number of distinct genomes. The linear dependence on <em> d</em> is limiting, since real taxonomies can easily contain 10,000s of leaves or more. We propose a method called cliff compression that reduces this size by a large factor, over 250× when indexing the SILVA 16S rRNA gene database. This method uses Θ(<em>r</em> log <em> d</em>) words of space in expectation under a random model we propose here. We implemented these ideas in an open source tool called Cliffy that performs efficient taxonomic classification of sequencing reads with respect to a compressed taxonomic index. When applied to simulated 16S rRNA reads, Cliffy's read-level accuracy is higher than Kraken2's by 11-18%. Clade abundances are also more accurately predicted by Cliffy compared to Kraken2 and Bracken. Overall, Cliffy is a fast and space-economical extension to compressed full-text indexes, enabling them to perform fast and accurate taxonomic classification queries. Cliffy's accuracy underscores the advantages of full-text indexes, which offer a more precise solution compared to <em>k</em>-mer indexes designed for a specific <em>k</em> value.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"10 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Noor P Singh, Euphy Wu, Jason Fan, Michael I Love, Rob Patro
Identifying differentially expressed transcripts poses a crucial yet challenging problem in transcriptomic. Substantial uncertainty is associated with the abundance estimates of certain transcripts which, if ignored, can lead to the exaggeration of false positives and, if included, may lead to reduced power. Given a set of RNA-seq samples, TreeTerminus arranges transcripts in a hierarchical tree structure that encodes different layers of resolution for interpretation of the abundance of transcriptional groups, with uncertainty generally decreasing as one ascends the tree from the leaves. We introduce mehenDi, which utilizes the tree structure from TreeTerminus for differential testing. The nodes output by mehenDi, called the selected nodes are determined in a data-driven manner to maximize the signal that can be extracted from the data while controlling for the uncertainty associated with estimating the transcript abundances. The identified selected nodes can include transcripts and inner nodes, with no two nodes having an ancestor/descendant relationship. We evaluated our method on both simulated and experimental datasets, comparing its performance with other tree-based differential methods as well as with uncertainty-aware differential transcript/gene expression methods. Our method detects inner nodes that show a strong signal for differential expression, which would have been overlooked when analyzing the transcripts alone.
{"title":"Tree-based differential testing using inferential uncertainty for RNA-seq","authors":"Noor P Singh, Euphy Wu, Jason Fan, Michael I Love, Rob Patro","doi":"10.1101/gr.279981.124","DOIUrl":"https://doi.org/10.1101/gr.279981.124","url":null,"abstract":"Identifying differentially expressed transcripts poses a crucial yet challenging problem in transcriptomic. Substantial uncertainty is associated with the abundance estimates of certain transcripts which, if ignored, can lead to the exaggeration of false positives and, if included, may lead to reduced power. Given a set of RNA-seq samples, TreeTerminus arranges transcripts in a hierarchical tree structure that encodes different layers of resolution for interpretation of the abundance of transcriptional groups, with uncertainty generally decreasing as one ascends the tree from the leaves. We introduce mehenDi, which utilizes the tree structure from TreeTerminus for differential testing. The nodes output by mehenDi, called the selected nodes are determined in a data-driven manner to maximize the signal that can be extracted from the data while controlling for the uncertainty associated with estimating the transcript abundances. The identified selected nodes can include transcripts and inner nodes, with no two nodes having an ancestor/descendant relationship. We evaluated our method on both simulated and experimental datasets, comparing its performance with other tree-based differential methods as well as with uncertainty-aware differential transcript/gene expression methods. Our method detects inner nodes that show a strong signal for differential expression, which would have been overlooked when analyzing the transcripts alone.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"9 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tandem repeats (TRs) are sequences of DNA where two or more base pairs are repeated back-to-back at specific locations in the genome. TR expansions, where the number of repeat units exceeds the normal range, have been implicated in over 50 conditions. However, accurately measuring the copy number of TRs is challenging, especially when their expansions are larger than the fragment sizes used in standard short-read genome sequencing. Here, we introduce ScatTR, a novel computational method that leverages a maximum likelihood framework to estimate the copy number of large TR expansions from short-read sequencing data. ScatTR calculates the likelihood of different alignments between sequencing reads and reference sequences that represent various TR lengths and employs a Monte Carlo technique to find the best match. In simulated data, ScatTR outperforms state-of-the-art methods, particularly for TRs with longer motifs and those with lengths that greatly exceed typical sequencing fragment sizes. When applied to data from the 1000 Genomes Project, ScatTR detects potential large TR expansions that other methods missed, highlighting its ability to better characterize genome-wide TR variation.
{"title":"Estimating the size of long tandem repeat expansions from short reads with ScatTR","authors":"Rashid Al-Abri, Gamze Gursoy","doi":"10.1101/gr.280563.125","DOIUrl":"https://doi.org/10.1101/gr.280563.125","url":null,"abstract":"Tandem repeats (TRs) are sequences of DNA where two or more base pairs are repeated back-to-back at specific locations in the genome. TR expansions, where the number of repeat units exceeds the normal range, have been implicated in over 50 conditions. However, accurately measuring the copy number of TRs is challenging, especially when their expansions are larger than the fragment sizes used in standard short-read genome sequencing. Here, we introduce ScatTR, a novel computational method that leverages a maximum likelihood framework to estimate the copy number of large TR expansions from short-read sequencing data. ScatTR calculates the likelihood of different alignments between sequencing reads and reference sequences that represent various TR lengths and employs a Monte Carlo technique to find the best match. In simulated data, ScatTR outperforms state-of-the-art methods, particularly for TRs with longer motifs and those with lengths that greatly exceed typical sequencing fragment sizes. When applied to data from the 1000 Genomes Project, ScatTR detects potential large TR expansions that other methods missed, highlighting its ability to better characterize genome-wide TR variation.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"146 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Functional gene programs play a wide range of roles in health and disease by orchestrating transcriptional coregulation to govern cell identity. Understanding these intricate gene programs is essential for unraveling the complexities of biological systems; however, deciphering them remains a significant challenge. Recent advancements in single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) technologies have empowered the comprehensive characterization of gene programs at both single-cell and spatial resolutions. Here, we present DeCEP, a computational framework designed to characterize context-specific gene programs using scRNA-seq and ST data. DeCEP leverages functional gene lists and directed graphs to construct functional networks underlying distinct cellular or spatial contexts. It then identifies context-dependent hub genes associated with specific gene programs based on network topology and assigns gene program activity to individual cells or spatial locations. Through evaluation on both simulated and real biological datasets, DeCEP demonstrates complementary strengths over existing methods by enabling more fine-grained characterization of gene programs within specific contexts, particularly those characterized by pronounced transcriptional heterogeneity. Furthermore, we showcase the ability of DeCEP in elucidating biological insights through case studies on normal liver tissue, Alzheimer' disease, and cancer.
{"title":"Deciphering context-specific gene programs from single-cell and spatial transcriptomics data with DeCEP","authors":"Lin Li, Xianbin Su, Ze-Guang Han","doi":"10.1101/gr.279689.124","DOIUrl":"https://doi.org/10.1101/gr.279689.124","url":null,"abstract":"Functional gene programs play a wide range of roles in health and disease by orchestrating transcriptional coregulation to govern cell identity. Understanding these intricate gene programs is essential for unraveling the complexities of biological systems; however, deciphering them remains a significant challenge. Recent advancements in single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) technologies have empowered the comprehensive characterization of gene programs at both single-cell and spatial resolutions. Here, we present DeCEP, a computational framework designed to characterize context-specific gene programs using scRNA-seq and ST data. DeCEP leverages functional gene lists and directed graphs to construct functional networks underlying distinct cellular or spatial contexts. It then identifies context-dependent hub genes associated with specific gene programs based on network topology and assigns gene program activity to individual cells or spatial locations. Through evaluation on both simulated and real biological datasets, DeCEP demonstrates complementary strengths over existing methods by enabling more fine-grained characterization of gene programs within specific contexts, particularly those characterized by pronounced transcriptional heterogeneity. Furthermore, we showcase the ability of DeCEP in elucidating biological insights through case studies on normal liver tissue, Alzheimer' disease, and cancer.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"38 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}