Tandem repeats (TRs) are sequences of DNA where two or more base pairs are repeated back-to-back at specific locations in the genome. TR expansions, where the number of repeat units exceeds the normal range, have been implicated in over 50 conditions. However, accurately measuring the copy number of TRs is challenging, especially when their expansions are larger than the fragment sizes used in standard short-read genome sequencing. Here, we introduce ScatTR, a novel computational method that leverages a maximum likelihood framework to estimate the copy number of large TR expansions from short-read sequencing data. ScatTR calculates the likelihood of different alignments between sequencing reads and reference sequences that represent various TR lengths and employs a Monte Carlo technique to find the best match. In simulated data, ScatTR outperforms state-of-the-art methods, particularly for TRs with longer motifs and those with lengths that greatly exceed typical sequencing fragment sizes. When applied to data from the 1000 Genomes Project, ScatTR detects potential large TR expansions that other methods missed, highlighting its ability to better characterize genome-wide TR variation.
{"title":"Estimating the size of long tandem repeat expansions from short reads with ScatTR","authors":"Rashid Al-Abri, Gamze Gursoy","doi":"10.1101/gr.280563.125","DOIUrl":"https://doi.org/10.1101/gr.280563.125","url":null,"abstract":"Tandem repeats (TRs) are sequences of DNA where two or more base pairs are repeated back-to-back at specific locations in the genome. TR expansions, where the number of repeat units exceeds the normal range, have been implicated in over 50 conditions. However, accurately measuring the copy number of TRs is challenging, especially when their expansions are larger than the fragment sizes used in standard short-read genome sequencing. Here, we introduce ScatTR, a novel computational method that leverages a maximum likelihood framework to estimate the copy number of large TR expansions from short-read sequencing data. ScatTR calculates the likelihood of different alignments between sequencing reads and reference sequences that represent various TR lengths and employs a Monte Carlo technique to find the best match. In simulated data, ScatTR outperforms state-of-the-art methods, particularly for TRs with longer motifs and those with lengths that greatly exceed typical sequencing fragment sizes. When applied to data from the 1000 Genomes Project, ScatTR detects potential large TR expansions that other methods missed, highlighting its ability to better characterize genome-wide TR variation.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"146 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Functional gene programs play a wide range of roles in health and disease by orchestrating transcriptional coregulation to govern cell identity. Understanding these intricate gene programs is essential for unraveling the complexities of biological systems; however, deciphering them remains a significant challenge. Recent advancements in single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) technologies have empowered the comprehensive characterization of gene programs at both single-cell and spatial resolutions. Here, we present DeCEP, a computational framework designed to characterize context-specific gene programs using scRNA-seq and ST data. DeCEP leverages functional gene lists and directed graphs to construct functional networks underlying distinct cellular or spatial contexts. It then identifies context-dependent hub genes associated with specific gene programs based on network topology and assigns gene program activity to individual cells or spatial locations. Through evaluation on both simulated and real biological datasets, DeCEP demonstrates complementary strengths over existing methods by enabling more fine-grained characterization of gene programs within specific contexts, particularly those characterized by pronounced transcriptional heterogeneity. Furthermore, we showcase the ability of DeCEP in elucidating biological insights through case studies on normal liver tissue, Alzheimer' disease, and cancer.
{"title":"Deciphering context-specific gene programs from single-cell and spatial transcriptomics data with DeCEP","authors":"Lin Li, Xianbin Su, Ze-Guang Han","doi":"10.1101/gr.279689.124","DOIUrl":"https://doi.org/10.1101/gr.279689.124","url":null,"abstract":"Functional gene programs play a wide range of roles in health and disease by orchestrating transcriptional coregulation to govern cell identity. Understanding these intricate gene programs is essential for unraveling the complexities of biological systems; however, deciphering them remains a significant challenge. Recent advancements in single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) technologies have empowered the comprehensive characterization of gene programs at both single-cell and spatial resolutions. Here, we present DeCEP, a computational framework designed to characterize context-specific gene programs using scRNA-seq and ST data. DeCEP leverages functional gene lists and directed graphs to construct functional networks underlying distinct cellular or spatial contexts. It then identifies context-dependent hub genes associated with specific gene programs based on network topology and assigns gene program activity to individual cells or spatial locations. Through evaluation on both simulated and real biological datasets, DeCEP demonstrates complementary strengths over existing methods by enabling more fine-grained characterization of gene programs within specific contexts, particularly those characterized by pronounced transcriptional heterogeneity. Furthermore, we showcase the ability of DeCEP in elucidating biological insights through case studies on normal liver tissue, Alzheimer' disease, and cancer.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"38 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A tandem repeat is a sequence of nucleotides that appear as multiple contiguous, near-identical copies arranged consecutively. Tandem repeats are widespread across natural genomes, play critical roles in genetic diversity, gene regulation, and are associated with various neurological and developmental disorders. They can also arise in sequencing reads generated by certain technologies, such as those used for sequencing circular molecules. A key challenge in analyzing tandem repeats is reconstructing the sequence of the underlying repeat unit. While several methods exist, they often exhibit low accuracy when the repeat unit length increases or the number of copies is low. Furthermore, methods capable of handling highly mutated sequences remain scarce, highlighting a significant opportunity for improvement. We introduce EquiRep, a tool for accurate detection of tandem repeats from erroneous sequences. EquiRep estimates the likelihood of positions originating from the same location in the unit through self-alignment, followed by a novel refinement approach. The resulting equivalence classes and consecutive position information are then used to build a weighted graph. A cycle in this graph with maximum bottleneck weight covering most nucleotide positions is identified to reconstruct the repeat unit. We test EquiRep on two applications, identifying repeat units from satellite DNAs and reconstructing circular RNAs from rolling-circular long-read sequencing data, using both simulated and raw sequencing datasets. Our results show that EquiRep consistently outperforms or matches state-of-the-art methods, demonstrating robustness to sequencing errors and superior performance on long repeat units and low-frequency repeats. These capabilities underscore EquiRep’s broad utility in tandem repeat analysis.
{"title":"Accurate detection of tandem repeats from error-prone sequences with EquiRep","authors":"Zhezheng Song, Tasfia Zahin, Xiang Li, Mingfu Shao","doi":"10.1101/gr.280750.125","DOIUrl":"https://doi.org/10.1101/gr.280750.125","url":null,"abstract":"A tandem repeat is a sequence of nucleotides that appear as multiple contiguous, near-identical copies arranged consecutively. Tandem repeats are widespread across natural genomes, play critical roles in genetic diversity, gene regulation, and are associated with various neurological and developmental disorders. They can also arise in sequencing reads generated by certain technologies, such as those used for sequencing circular molecules. A key challenge in analyzing tandem repeats is reconstructing the sequence of the underlying repeat unit. While several methods exist, they often exhibit low accuracy when the repeat unit length increases or the number of copies is low. Furthermore, methods capable of handling highly mutated sequences remain scarce, highlighting a significant opportunity for improvement. We introduce EquiRep, a tool for accurate detection of tandem repeats from erroneous sequences. EquiRep estimates the likelihood of positions originating from the same location in the unit through self-alignment, followed by a novel refinement approach. The resulting equivalence classes and consecutive position information are then used to build a weighted graph. A cycle in this graph with maximum bottleneck weight covering most nucleotide positions is identified to reconstruct the repeat unit. We test EquiRep on two applications, identifying repeat units from satellite DNAs and reconstructing circular RNAs from rolling-circular long-read sequencing data, using both simulated and raw sequencing datasets. Our results show that EquiRep consistently outperforms or matches state-of-the-art methods, demonstrating robustness to sequencing errors and superior performance on long repeat units and low-frequency repeats. These capabilities underscore EquiRep’s broad utility in tandem repeat analysis.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"8 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ghanshyam Chandra, Md Helal Hossen, Stephan Scholz, Alexander T Dilthey, Daniel Gibney, Chirag Jain
Affordable genotyping methods are essential in genomics. Commonly used genotyping methods primarily support single nucleotide variants and short indels but neglect structural variants. Additionally, accuracy of read alignments to a reference genome is unreliable in highly polymorphic and repetitive regions, further impacting genotyping performance. Recent works highlight the advantage of haplotype-resolved pangenome graphs in addressing these challenges. Building on these developments, we propose a rigorous alignment-free genotyping method. Our optimization framework identifies a path through the pangenome graph that maximizes the matches between the path and substrings of sequencing reads (e.g., k-mers) while minimizing recombination events (haplotype switches) along the path. We prove that this problem is NP-Hard and develop efficient integer-programming solutions. We benchmarked the algorithm using downsampled short-read datasets from homozygous human cell lines with coverage ranging from 0.1× to 10×. Our algorithm accurately estimates complete major histocompatibility complex (MHC) haplotype sequences with small edit distances from the ground-truth sequences, providing a significant advantage over existing methods on low-coverage inputs. While this algorithm is designed for haploid genomes, we discuss directions for extending it to diploid genotyping.
{"title":"Pangenome-based genome inference using integer programming","authors":"Ghanshyam Chandra, Md Helal Hossen, Stephan Scholz, Alexander T Dilthey, Daniel Gibney, Chirag Jain","doi":"10.1101/gr.280567.125","DOIUrl":"https://doi.org/10.1101/gr.280567.125","url":null,"abstract":"Affordable genotyping methods are essential in genomics. Commonly used genotyping methods primarily support single nucleotide variants and short indels but neglect structural variants. Additionally, accuracy of read alignments to a reference genome is unreliable in highly polymorphic and repetitive regions, further impacting genotyping performance. Recent works highlight the advantage of haplotype-resolved pangenome graphs in addressing these challenges. Building on these developments, we propose a rigorous alignment-free genotyping method. Our optimization framework identifies a path through the pangenome graph that maximizes the matches between the path and substrings of sequencing reads (e.g., <em>k</em>-mers) while minimizing recombination events (haplotype switches) along the path. We prove that this problem is NP-Hard and develop efficient integer-programming solutions. We benchmarked the algorithm using downsampled short-read datasets from homozygous human cell lines with coverage ranging from 0.1× to 10×. Our algorithm accurately estimates complete major histocompatibility complex (MHC) haplotype sequences with small edit distances from the ground-truth sequences, providing a significant advantage over existing methods on low-coverage inputs. While this algorithm is designed for haploid genomes, we discuss directions for extending it to diploid genotyping.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"50 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The genus Tuber (family: Tuberaceae) includes the most economically valuable ectomycorrhizal (ECM), truffle-forming fungi. Previous genomic analyses revealed that massive transposable element (TE) proliferation represents a convergent genomic feature of ECM fungi, including Tuberaceae. Repetitive sequences constitute a principal driver of genome evolution shaping its architecture and regulatory networks. In this context, Tuberaceae can become an important model system to study their genomic impact; however, the family lacks high-quality assemblies. Here, we investigate the interplay between TEs and Tuberaceae genome evolution by producing a highly contiguous assembly for the endangered Chinese white truffle Tuber panzhihuanense, along with a recalibrated timeline for Tuberaceae diversification and comprehensive comparative genomic analyses. We find that, concurrently with a Paleogene diversification of the family, pre-existing Chromoviridae-related Gypsy clades independently expanded in different truffle lineages, leading to increased genome size and high gene family turnover rates, but without resulting in highly rearranged genomes. Additionally, we uncover a significant enrichment of ECM-induced gene families stemming from ancestral duplication events. Finally, we explore the repetitive structure of nuclear ribosomal DNA (rDNA) loci for the first time in the clade. Most of the 45S rDNA paralogues are undergoing concerted evolution, though an isolated divergent locus raises concerns about potential issues for metabarcoding and biodiversity assessments. Our study establishes a fundamental genomic resource for future research on truffle genomics and showcases a clear example of how establishment and self-perpetuating expansion of heterochromatin can drive massive genome size variation due to activity of selfish genetic elements.
{"title":"High-quality assembly of the Chinese white truffle genome and recalibrated divergence time estimate provide insight into the evolutionary dynamics of Tuberaceae","authors":"Jacopo Martelossi, Jacopo Vujovic, Yue Huang, Alessia Tatti, Kaiwei Xu, Federico Puliga, Yuanxue Chen, Omar Rota Stabelli, Fabrizio Ghiselli, Xiaoping Zhang, Alessandra Zambonelli","doi":"10.1101/gr.280368.124","DOIUrl":"https://doi.org/10.1101/gr.280368.124","url":null,"abstract":"The genus <em>Tuber</em> (family: Tuberaceae) includes the most economically valuable ectomycorrhizal (ECM), truffle-forming fungi. Previous genomic analyses revealed that massive transposable element (TE) proliferation represents a convergent genomic feature of ECM fungi, including Tuberaceae. Repetitive sequences constitute a principal driver of genome evolution shaping its architecture and regulatory networks. In this context, Tuberaceae can become an important model system to study their genomic impact; however, the family lacks high-quality assemblies. Here, we investigate the interplay between TEs and Tuberaceae genome evolution by producing a highly contiguous assembly for the endangered Chinese white truffle <em>Tuber panzhihuanense</em>, along with a recalibrated timeline for Tuberaceae diversification and comprehensive comparative genomic analyses. We find that, concurrently with a Paleogene diversification of the family, pre-existing Chromoviridae-related Gypsy clades independently expanded in different truffle lineages, leading to increased genome size and high gene family turnover rates, but without resulting in highly rearranged genomes. Additionally, we uncover a significant enrichment of ECM-induced gene families stemming from ancestral duplication events. Finally, we explore the repetitive structure of nuclear ribosomal DNA (rDNA) loci for the first time in the clade. Most of the 45S rDNA paralogues are undergoing concerted evolution, though an isolated divergent locus raises concerns about potential issues for metabarcoding and biodiversity assessments. Our study establishes a fundamental genomic resource for future research on truffle genomics and showcases a clear example of how establishment and self-perpetuating expansion of heterochromatin can drive massive genome size variation due to activity of selfish genetic elements.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"9 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mari B Gornitzka, Egil Røsjø, Uddalok Jana, Easton E Ford, Alan Tourancheau, William Lees, Zachary Vanwinkle, Melissa L Smith, Corey T Watson, Andreas Lossius
Genetic diversity within the human immunoglobulin heavy chain (IGH) locus influences the expressed antibody repertoire and susceptibility to infectious and autoimmune diseases. However, repetitive sequences and complex structural variation pose significant challenges for large-scale characterization. Here, we introduce a method that combines Oxford Nanopore Technologies ultra-long sequencing and adaptive sampling with a bioinformatic pipeline to produce haplotype-resolved, annotated IGH assemblies. Notably, our strategy overcomes prior limitations in phasing resolution, enabling single-contig haplotype assemblies that span the entire IGH locus. We apply this method to four individuals and validate the accuracy of the IGH assemblies using Pacific Biosciences HiFi reads, demonstrating near-complete sequence congruence, with only some residual indel errors. Moreover, when applying our pipeline to the reference material HG002, it reveals no base differences and a limited number of indels compared with the Telomere-to-Telomere genome benchmark across the IGH region. Importantly, in the four individuals, our approach uncovers 28 novel alleles and previously uncharacterized large structural variants, including a 120 kb duplication spanning IGHE to IGHA1 within the IGH constant region (IGHC) and, within the IGHV region, an expanded seven-copy IGHV3-23 gene haplotype. These findings underscore the power of our method to resolve the full complexity of the IGH locus and uncover previously unrecognized variants that may affect immune function and disease susceptibility. Thus, our method provides a strong basis for future immunological research and translational applications.
{"title":"Ultra-long sequencing for contiguous haplotype resolution of the human immunoglobulin heavy chain locus","authors":"Mari B Gornitzka, Egil Røsjø, Uddalok Jana, Easton E Ford, Alan Tourancheau, William Lees, Zachary Vanwinkle, Melissa L Smith, Corey T Watson, Andreas Lossius","doi":"10.1101/gr.280400.125","DOIUrl":"https://doi.org/10.1101/gr.280400.125","url":null,"abstract":"Genetic diversity within the human immunoglobulin heavy chain (IGH) locus influences the expressed antibody repertoire and susceptibility to infectious and autoimmune diseases. However, repetitive sequences and complex structural variation pose significant challenges for large-scale characterization. Here, we introduce a method that combines Oxford Nanopore Technologies ultra-long sequencing and adaptive sampling with a bioinformatic pipeline to produce haplotype-resolved, annotated IGH assemblies. Notably, our strategy overcomes prior limitations in phasing resolution, enabling single-contig haplotype assemblies that span the entire IGH locus. We apply this method to four individuals and validate the accuracy of the IGH assemblies using Pacific Biosciences HiFi reads, demonstrating near-complete sequence congruence, with only some residual indel errors. Moreover, when applying our pipeline to the reference material HG002, it reveals no base differences and a limited number of indels compared with the Telomere-to-Telomere genome benchmark across the IGH region. Importantly, in the four individuals, our approach uncovers 28 novel alleles and previously uncharacterized large structural variants, including a 120 kb duplication spanning IGHE to IGHA1 within the IGH constant region (IGHC) and, within the IGHV region, an expanded seven-copy IGHV3-23 gene haplotype. These findings underscore the power of our method to resolve the full complexity of the IGH locus and uncover previously unrecognized variants that may affect immune function and disease susceptibility. Thus, our method provides a strong basis for future immunological research and translational applications.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"8 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alex Warwick Vesztrocy, Natasha Glover, Paul D. Thomas, Christophe Dessimoz, Irene Julca
Gene duplication is a major evolutionary source of functional innovation. Following duplication events, gene copies (paralogues) may undergo various fates, including retention with functional modifications (such as subfunctionalization or neofunctionalization) or loss. When paralogues are retained, this results in complex orthology relationships, including one-to-many or many-to-many. In such cases, determining which one-to-one pair is more likely to have conserved functions can be challenging. It has been proposed that, following gene duplication, the copy that diverges more slowly in sequence is more likely to maintain the ancestral function -referred to here as "the least diverged orthologue (LDO) conjecture". This study explores this conjecture, using a novel method to identify asymmetric evolution of paralogues and apply it to all gene families across the Tree of Life in the PANTHER database. Structural data for over 1 million proteins and expression data for 16 animals and 20 plants were then used to investigate functional divergence following duplication. This analysis, the most comprehensive to date, revealed that whilst the majority of paralogues display similar rates of sequence evolution, significant differences in branch lengths following gene duplication can be correlated with functional divergence. Overall, the results support the least diverged orthologue conjecture, suggesting that the least diverged orthologue (LDO) tends to retain the ancestral function, whilst the most diverged orthologue (MDO) may acquire a new, potentially specialized, role.
{"title":"Unveiling the functional fate of duplicated genes through expression profiling and structural analysis","authors":"Alex Warwick Vesztrocy, Natasha Glover, Paul D. Thomas, Christophe Dessimoz, Irene Julca","doi":"10.1101/gr.280166.124","DOIUrl":"https://doi.org/10.1101/gr.280166.124","url":null,"abstract":"Gene duplication is a major evolutionary source of functional innovation. Following duplication events, gene copies (paralogues) may undergo various fates, including retention with functional modifications (such as subfunctionalization or neofunctionalization) or loss. When paralogues are retained, this results in complex orthology relationships, including one-to-many or many-to-many. In such cases, determining which one-to-one pair is more likely to have conserved functions can be challenging. It has been proposed that, following gene duplication, the copy that diverges more slowly in sequence is more likely to maintain the ancestral function -referred to here as \"the least diverged orthologue (LDO) conjecture\". This study explores this conjecture, using a novel method to identify asymmetric evolution of paralogues and apply it to all gene families across the Tree of Life in the PANTHER database. Structural data for over 1 million proteins and expression data for 16 animals and 20 plants were then used to investigate functional divergence following duplication. This analysis, the most comprehensive to date, revealed that whilst the majority of paralogues display similar rates of sequence evolution, significant differences in branch lengths following gene duplication can be correlated with functional divergence. Overall, the results support the least diverged orthologue conjecture, suggesting that the least diverged orthologue (LDO) tends to retain the ancestral function, whilst the most diverged orthologue (MDO) may acquire a new, potentially specialized, role.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"749 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144850649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ran Zhang, Chengxiang Qiu, Galina Filippova, Gang Li, Jay Shendure, Jean-Philippe Vert, Xinxian Deng, William Stafford Noble, Christine Disteche
The emergence of single-cell time-series datasets enables modeling of changes in various types of cellular profiles over time. However, due to the disruptive nature of single-cell measurements, it is impossible to capture the full temporal trajectory of a particular cell. Furthermore, single-cell profiles can be collected at mismatched time points across different conditions (e.g., sex, batch, disease) and data modalities (e.g., scRNA-seq, scATAC-seq), which makes modeling challenging. Here we propose a joint modeling framework, Sunbear, for integrating multicondition and multimodal single-cell profiles across time. Sunbear can be used to impute single-cell temporal profile changes, align multidataset and multimodal profiles across time, and extrapolate single-cell profiles in a missing modality. We applied Sunbear to reveal sex-biased transcription during mouse embryonic development and predict dynamic relationships between epigenetic priming and transcription for cells in which multimodal profiles are unavailable. Sunbear thus enables the projection of single-cell time-series snapshots to multimodal and multicondition views of cellular trajectories.
{"title":"Multicondition and multimodal temporal profile inference during mouse embryonic development","authors":"Ran Zhang, Chengxiang Qiu, Galina Filippova, Gang Li, Jay Shendure, Jean-Philippe Vert, Xinxian Deng, William Stafford Noble, Christine Disteche","doi":"10.1101/gr.279997.124","DOIUrl":"https://doi.org/10.1101/gr.279997.124","url":null,"abstract":"The emergence of single-cell time-series datasets enables modeling of changes in various types of cellular profiles over time. However, due to the disruptive nature of single-cell measurements, it is impossible to capture the full temporal trajectory of a particular cell. Furthermore, single-cell profiles can be collected at mismatched time points across different conditions (e.g., sex, batch, disease) and data modalities (e.g., scRNA-seq, scATAC-seq), which makes modeling challenging. Here we propose a joint modeling framework, Sunbear, for integrating multicondition and multimodal single-cell profiles across time. Sunbear can be used to impute single-cell temporal profile changes, align multidataset and multimodal profiles across time, and extrapolate single-cell profiles in a missing modality. We applied Sunbear to reveal sex-biased transcription during mouse embryonic development and predict dynamic relationships between epigenetic priming and transcription for cells in which multimodal profiles are unavailable. Sunbear thus enables the projection of single-cell time-series snapshots to multimodal and multicondition views of cellular trajectories.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"2 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144850667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Han Shu, Jing Chen, Chang Xu, Jialu Hu, Yongtian Wang, Jiajie Peng, Qinghua Jiang, Xuequn Shang, Tao Wang
Spatial omics (SO) is a powerful methodology that enables the study of genes, proteins, and other molecular features within the spatial context of tissue architecture. With the growing availability of SO datasets, researchers are eager to extract biological insights from larger datasets for a more comprehensive understanding. However, existing approaches focus on batch effect correction, often neglecting complex biological patterns in tissue slices, complicating feature integration and posing challenges when combining transcriptomics with other omics layers. Here, we introduce stMSA (SpaTial Multi-Slice/omics Analysis), a deep graph contrastive learning model that incorporates graph auto-encoder techniques. stMSA is specifically designed to produce batch-corrected representations while retaining the distinct spatial patterns within each slice, considering both intra- and interbatch relationships during integration. Extensive evaluations show that stMSA outperforms state-of-the-art methods in distinguishing tissue structures across diverse slices, even when faced with varying experimental protocols and sequencing technologies. Furthermore, stMSA effectively deciphers complex developmental trajectories by integrating spatial proteomics and transcriptomics data and excels in cross-slice matching and alignment for 3D tissue reconstruction.
{"title":"Efficient integration of spatial omics data for joint domain detection, matching, and alignment with stMSA","authors":"Han Shu, Jing Chen, Chang Xu, Jialu Hu, Yongtian Wang, Jiajie Peng, Qinghua Jiang, Xuequn Shang, Tao Wang","doi":"10.1101/gr.280584.125","DOIUrl":"https://doi.org/10.1101/gr.280584.125","url":null,"abstract":"Spatial omics (SO) is a powerful methodology that enables the study of genes, proteins, and other molecular features within the spatial context of tissue architecture. With the growing availability of SO datasets, researchers are eager to extract biological insights from larger datasets for a more comprehensive understanding. However, existing approaches focus on batch effect correction, often neglecting complex biological patterns in tissue slices, complicating feature integration and posing challenges when combining transcriptomics with other omics layers. Here, we introduce stMSA (SpaTial Multi-Slice/omics Analysis), a deep graph contrastive learning model that incorporates graph auto-encoder techniques. stMSA is specifically designed to produce batch-corrected representations while retaining the distinct spatial patterns within each slice, considering both intra- and interbatch relationships during integration. Extensive evaluations show that stMSA outperforms state-of-the-art methods in distinguishing tissue structures across diverse slices, even when faced with varying experimental protocols and sequencing technologies. Furthermore, stMSA effectively deciphers complex developmental trajectories by integrating spatial proteomics and transcriptomics data and excels in cross-slice matching and alignment for 3D tissue reconstruction.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"292 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144850648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Can Luo, Zimeng Jamie Zhou, Yichen Henry Liu, Xin Maizie Zhou
Structural variants (SVs) play a critical role in shaping the diversity of the human genome and their detection holds significant potential for advancing precision medicine. Despite notable progress in single-molecule long-read sequencing technologies, accurately identifying SV breakpoints and resolving their sequence remains a major challenge. Current alignment-based tools often struggle with precise breakpoint detection and sequence characterization, while whole genome assembly-based methods are computationally demanding and less practical for targeted analyses. Neither approach is ideally suited for scenarios where regions of interest are predefined and require precise SV characterization. To address this gap, we introduce FocalSV, a targeted SV detection framework that integrates both assembly- and alignment-based signals. By combining the precision of local assemblies with the efficiency of region-specific analysis, FocalSV enables more accurate SV detection. FocalSV supports user-defined target regions and can automatically identify and expand regions with potential structural variants to enable more comprehensive detection. FocalSV was evaluated on ten germline datasets and two paired normal-tumor cancer datasets, demonstrating superior performance in both precision and efficiency.
{"title":"FocalSV enables target region-based structural variant assembly and refinement using single-molecule long-read sequencing data","authors":"Can Luo, Zimeng Jamie Zhou, Yichen Henry Liu, Xin Maizie Zhou","doi":"10.1101/gr.280282.124","DOIUrl":"https://doi.org/10.1101/gr.280282.124","url":null,"abstract":"Structural variants (SVs) play a critical role in shaping the diversity of the human genome and their detection holds significant potential for advancing precision medicine. Despite notable progress in single-molecule long-read sequencing technologies, accurately identifying SV breakpoints and resolving their sequence remains a major challenge. Current alignment-based tools often struggle with precise breakpoint detection and sequence characterization, while whole genome assembly-based methods are computationally demanding and less practical for targeted analyses. Neither approach is ideally suited for scenarios where regions of interest are predefined and require precise SV characterization. To address this gap, we introduce FocalSV, a targeted SV detection framework that integrates both assembly- and alignment-based signals. By combining the precision of local assemblies with the efficiency of region-specific analysis, FocalSV enables more accurate SV detection. FocalSV supports user-defined target regions and can automatically identify and expand regions with potential structural variants to enable more comprehensive detection. FocalSV was evaluated on ten germline datasets and two paired normal-tumor cancer datasets, demonstrating superior performance in both precision and efficiency.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"185 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144824980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}