Wenhao Zhang, Lan Cao, Xiaoxuan Gu, Yongyu Long, Ying Wang
Understanding Gene Regulatory Networks (GRNs) is crucial for deciphering cellular heterogeneity and the mechanisms underlying development and disease. However, current GRN inference methods fail to utilize multiomics data and prior knowledge from a biologically-interpretable insight. Therefore, we propose PRISM-GRN, a Bayesian model that seamlessly incorporates known GRNs, along with scRNA-seq and scATAC-seq data, into a probabilistic framework to reconstruct cell type-specific GRNs. PRISM-GRN employs a biologically interpretable architecture firmly rooted in the established gene regulatory mechanism, which asserts that gene expression is influenced by TF expression levels and gene chromatin accessibility through GRNs. Accordingly, PRISM-GRN decomposes observable data into biologically meaningful latent variables through a mechanism-informed generation process and a prior-GRN-primed inference process, enabling precise and robust GRN reconstruction. We evaluate PRISM-GRN on four benchmarking datasets with paired scRNA-seq and scATAC-seq data, demonstrating its superior performance over seven baseline methods in GRN reconstruction, especially its higher precision under the inherently imbalanced scenario where the true regulatory interaction is sparse. Furthermore, benchmarking on directed GRNs highlights PRISM-GRN's ability to capture causality in gene regulation derived from the biologically-interpretable architecture. More importantly, PRISM-GRN performs well with unpaired omics data and limited prior GRN information, showcasing its flexibility and adaptability across various biological contexts. Finally, biological analyses on PBMC datasets demonstrate PRISM-GRN's potential to facilitate the identification of cell type-specific or context-specific GRNs across broader real-world biological research applications. Overall, PRISM-GRN provides a novel paradigm for precise, robust, and interpretable exploration of causal GRNs with prior knowledge and multiomics data.
{"title":"Recovering gene regulatory networks in single-cell multiomics data with PRISM-GRN","authors":"Wenhao Zhang, Lan Cao, Xiaoxuan Gu, Yongyu Long, Ying Wang","doi":"10.1101/gr.280757.125","DOIUrl":"https://doi.org/10.1101/gr.280757.125","url":null,"abstract":"Understanding Gene Regulatory Networks (GRNs) is crucial for deciphering cellular heterogeneity and the mechanisms underlying development and disease. However, current GRN inference methods fail to utilize multiomics data and prior knowledge from a biologically-interpretable insight. Therefore, we propose PRISM-GRN, a Bayesian model that seamlessly incorporates known GRNs, along with scRNA-seq and scATAC-seq data, into a probabilistic framework to reconstruct cell type-specific GRNs. PRISM-GRN employs a biologically interpretable architecture firmly rooted in the established gene regulatory mechanism, which asserts that gene expression is influenced by TF expression levels and gene chromatin accessibility through GRNs. Accordingly, PRISM-GRN decomposes observable data into biologically meaningful latent variables through a mechanism-informed generation process and a prior-GRN-primed inference process, enabling precise and robust GRN reconstruction. We evaluate PRISM-GRN on four benchmarking datasets with paired scRNA-seq and scATAC-seq data, demonstrating its superior performance over seven baseline methods in GRN reconstruction, especially its higher precision under the inherently imbalanced scenario where the true regulatory interaction is sparse. Furthermore, benchmarking on directed GRNs highlights PRISM-GRN's ability to capture causality in gene regulation derived from the biologically-interpretable architecture. More importantly, PRISM-GRN performs well with unpaired omics data and limited prior GRN information, showcasing its flexibility and adaptability across various biological contexts. Finally, biological analyses on PBMC datasets demonstrate PRISM-GRN's potential to facilitate the identification of cell type-specific or context-specific GRNs across broader real-world biological research applications. Overall, PRISM-GRN provides a novel paradigm for precise, robust, and interpretable exploration of causal GRNs with prior knowledge and multiomics data.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"71 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145255623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Roberto Rossini, Saleh Oshaghi, Maxim Nekrasov, Aurélie Bellanger, Renae Domaschenz, Yasmin Dijkwel, Mohamed Abdelhalim, Rahul Agrawal, Marit Ledsaak, Philippe Collas, Ragnhild Eskeland, David Tremethick, Jonas Paulsen
Breast cancer entails intricate alterations in genome organization and expression. However, how three-dimensional (3D) chromatin structure changes in the progression from a normal to a breast cancer malignant state remains unknown. To address this, we conducted an analysis combining Hi-C data with lamina-associated domains (LADs), epigenomic marks, and gene expression in an in vitro model of breast cancer progression. Our results reveal that while the fundamental properties of topologically associating domains (TADs) are overall maintained, significant changes occur in the organization of compartments and subcompartments. These changes are closely correlated with alterations in the expression of oncogenic genes. We also observe a restructuring of TAD-TAD interactions, coinciding with a loss of spatial compartmentalization and radial positioning of the 3D genome. Notably, we identify a previously unrecognized interchromosomal insertion event, wherein a locus on Chromosome 8 housing the MYC oncogene is inserted into a highly active subcompartment on Chromosome 10. This insertion is accompanied by the formation of de novo enhancer contacts and activation of MYC, illustrating how structural genomic variants can alter the 3D genome during oncogenesis. In summary, our findings provide evidence for the loss of genome organization at multiple scales during breast cancer progression revealing novel relationships between genome 3D structure and oncogenic processes.
{"title":"Loss of multilevel 3D genome organization during breast cancer progression","authors":"Roberto Rossini, Saleh Oshaghi, Maxim Nekrasov, Aurélie Bellanger, Renae Domaschenz, Yasmin Dijkwel, Mohamed Abdelhalim, Rahul Agrawal, Marit Ledsaak, Philippe Collas, Ragnhild Eskeland, David Tremethick, Jonas Paulsen","doi":"10.1101/gr.280791.125","DOIUrl":"https://doi.org/10.1101/gr.280791.125","url":null,"abstract":"Breast cancer entails intricate alterations in genome organization and expression. However, how three-dimensional (3D) chromatin structure changes in the progression from a normal to a breast cancer malignant state remains unknown. To address this, we conducted an analysis combining Hi-C data with lamina-associated domains (LADs), epigenomic marks, and gene expression in an in vitro model of breast cancer progression. Our results reveal that while the fundamental properties of topologically associating domains (TADs) are overall maintained, significant changes occur in the organization of compartments and subcompartments. These changes are closely correlated with alterations in the expression of oncogenic genes. We also observe a restructuring of TAD-TAD interactions, coinciding with a loss of spatial compartmentalization and radial positioning of the 3D genome. Notably, we identify a previously unrecognized interchromosomal insertion event, wherein a locus on Chromosome 8 housing the <em>MYC</em> oncogene is inserted into a highly active subcompartment on Chromosome 10. This insertion is accompanied by the formation of de novo enhancer contacts and activation of <em>MYC</em>, illustrating how structural genomic variants can alter the 3D genome during oncogenesis. In summary, our findings provide evidence for the loss of genome organization at multiple scales during breast cancer progression revealing novel relationships between genome 3D structure and oncogenic processes.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"9 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145255651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marjan Farahbod, Aboud Diab, Paul Sud, Meenakshi S. Kagda, Ian Whaling, Mehdi Foroozandeh, Ishan Goel, Habib Daneshpajouh, Benjamin Hitz, J. Michael Cherry, Maxwell W. Libbrecht
The fourth and final phase of the ENCODE consortium has newly profiled epigenetic activity in hundreds of human tissues. Chromatin state annotations created by segmentation and genome annotation (SAGA) methods such as Segway have emerged as the predominant integrative summary of such data sets. Here, we present the ENCODE4 Catalog of Segway Annotations, a set of sample-specific genome-wide chromatin state annotations of 234 human biosamples inferred from 1,794 genomics experiments. This catalog identifies genomic elements, accurately captures cell type-specific regulatory patterns, and facilitates discovery of elements involved in phenotype and disease.
{"title":"Integrative chromatin state annotation of 234 human ENCODE4 cell types using Segway","authors":"Marjan Farahbod, Aboud Diab, Paul Sud, Meenakshi S. Kagda, Ian Whaling, Mehdi Foroozandeh, Ishan Goel, Habib Daneshpajouh, Benjamin Hitz, J. Michael Cherry, Maxwell W. Libbrecht","doi":"10.1101/gr.280633.125","DOIUrl":"https://doi.org/10.1101/gr.280633.125","url":null,"abstract":"The fourth and final phase of the ENCODE consortium has newly profiled epigenetic activity in hundreds of human tissues. Chromatin state annotations created by segmentation and genome annotation (SAGA) methods such as Segway have emerged as the predominant integrative summary of such data sets. Here, we present the ENCODE4 Catalog of Segway Annotations, a set of sample-specific genome-wide chromatin state annotations of 234 human biosamples inferred from 1,794 genomics experiments. This catalog identifies genomic elements, accurately captures cell type-specific regulatory patterns, and facilitates discovery of elements involved in phenotype and disease.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"55 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145235329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antonio Bernardo Carvalho, Bernard Y Kim, Fabiana Uno
Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) are generally considered free from sequence composition bias, a key factor - alongside read length - that explains their success in producing high quality genome assemblies. Indeed, there had been very few reports of bias, the clearest one against GA-rich repeats in the human genome. However, our study reveals a systematic failure of both technologies to sequence and assemble specific exons of Drosophila melanogaster genes, indicating an overlooked limitation. Namely, multiple Y-linked exons are nearly or completely absent from raw reads produced by deep sequencing with state-of-the-art ONT (10.4 flow cells, 200× coverage) and PacBio (HiFi 50×). The same exons are accurately assembled using Illumina 67× coverage. We found that these missing exons are consistently located near simple satellite sequences, where sequencing fails at multiple levels: read initiation (very few reads start within satellite regions), read elongation (satellite-containing reads are shorter on average), and base-calling (quality scores drop as sequencing enters a satellite sequence). These findings challenge the assumption that long-read technologies are unbiased and reveal a critical barrier to assembling sequences near repetitive regions. As large-scale sequencing projects move towards telomere-to-telomere assemblies in a wide range of organisms, recognizing and addressing these biases will be important to achieving truly complete and accurate genomes. Additionally, the underrepresented Y-linked exons provides a valuable benchmark for refining those sequencing technologies while improving the assembly of the highly heterochromatic and often neglected Drosophila Y Chromosome.
{"title":"Strong bias in long-read sequencing prevents assembly of Drosophila melanogaster Y-linked genes","authors":"Antonio Bernardo Carvalho, Bernard Y Kim, Fabiana Uno","doi":"10.1101/gr.280604.125","DOIUrl":"https://doi.org/10.1101/gr.280604.125","url":null,"abstract":"Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) are generally considered free from sequence composition bias, a key factor - alongside read length - that explains their success in producing high quality genome assemblies. Indeed, there had been very few reports of bias, the clearest one against GA-rich repeats in the human genome. However, our study reveals a systematic failure of both technologies to sequence and assemble specific exons of <em>Drosophila melanogaster</em> genes, indicating an overlooked limitation. Namely, multiple Y-linked exons are nearly or completely absent from raw reads produced by deep sequencing with state-of-the-art ONT (10.4 flow cells, 200× coverage) and PacBio (HiFi 50×). The same exons are accurately assembled using Illumina 67× coverage. We found that these missing exons are consistently located near simple satellite sequences, where sequencing fails at multiple levels: read initiation (very few reads start within satellite regions), read elongation (satellite-containing reads are shorter on average), and base-calling (quality scores drop as sequencing enters a satellite sequence). These findings challenge the assumption that long-read technologies are unbiased and reveal a critical barrier to assembling sequences near repetitive regions. As large-scale sequencing projects move towards telomere-to-telomere assemblies in a wide range of organisms, recognizing and addressing these biases will be important to achieving truly complete and accurate genomes. Additionally, the underrepresented Y-linked exons provides a valuable benchmark for refining those sequencing technologies while improving the assembly of the highly heterochromatic and often neglected <em>Drosophila</em> Y Chromosome.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"101 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145203171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cell type annotation is a critical and essential task in single-cell data analysis. Various reference-based methods have provided rapid annotation for diverse single-cell data. However, how to select the optimal references and methods is often overlooked. To this end, we present a cross-dataset cell-type annotation methodology with a universal reference data and method selection strategy (CAMUS) to achieve highly accurate and efficient annotations. We demonstrate the advantages of CAMUS by conducting comprehensive analyses on 672 pairs of cross-species scRNA-seq datasets. The annotation results with references selected by CAMUS achieved substantial accuracy gains (25.0-124.7%) over random selection strategies across five reference-based methods. CAMUS achieved high accuracy in choosing the best reference-method pair among 3360 pairs (49.1%). Moreover, CAMUS showed high accuracy in selecting the best methods on the 80 scST datasets (82.5%) and five scATAC-seq datasets (100.0%), illustrating its universal applicability. In addition, we utilized the CAMUS score with other metrics to predict the annotation accuracy, providing direct guidance on whether to accept current annotation results.
{"title":"Highly accurate reference and method selection for universal cross-dataset cell type annotation with CAMUS","authors":"Qunlun Shen, Shuqin Zhang, Shihua Zhang","doi":"10.1101/gr.280821.125","DOIUrl":"https://doi.org/10.1101/gr.280821.125","url":null,"abstract":"Cell type annotation is a critical and essential task in single-cell data analysis. Various reference-based methods have provided rapid annotation for diverse single-cell data. However, how to select the optimal references and methods is often overlooked. To this end, we present a cross-dataset cell-type annotation methodology with a universal reference data and method selection strategy (CAMUS) to achieve highly accurate and efficient annotations. We demonstrate the advantages of CAMUS by conducting comprehensive analyses on 672 pairs of cross-species scRNA-seq datasets. The annotation results with references selected by CAMUS achieved substantial accuracy gains (25.0-124.7%) over random selection strategies across five reference-based methods. CAMUS achieved high accuracy in choosing the best reference-method pair among 3360 pairs (49.1%). Moreover, CAMUS showed high accuracy in selecting the best methods on the 80 scST datasets (82.5%) and five scATAC-seq datasets (100.0%), illustrating its universal applicability. In addition, we utilized the CAMUS score with other metrics to predict the annotation accuracy, providing direct guidance on whether to accept current annotation results.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"95 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145203170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Centromeres, characterized by their unique chromatin attributes, are indispensable for safeguarding genomic stability. Due to their intricate and fragile nature, centromeres are susceptible to chromosomal rearrangements. However, the mechanisms preserving their functional integrity and supporting nucleus homeostasis following breakages remained enigmatic. In this study, we use wheat ditelosomic stocks, which arise from centromere breakage, to explore the genetic and epigenetic alterations in damaged centromeres. Our investigations unveil novel chromosome end structures marked by de novo addition of telomeres, as well as localized chromosomal shattering, including segment deletions and duplications near centromere breakpoints. We reveal that the damaged centromeres possess a remarkable capacity for self-regulation, through employing structural modifications such as expansion, contraction, and neocentromere formation to maintain their functional integrity. Centromere breakage triggers nucleosome remodeling and is accompanied by local transcription changes and chromatin reorganization, and subsequently may contribute to the stabilization of broken chromosomes. Our findings highlight the resilience and adaptability of plant chromosomes in response to centromere breakage, and provide valuable insights into the stability of centromeres, thereby offering promising prospects to manipulate centromeres for targeted chromosomal innovation and crop genetic improvement.
{"title":"Adaptation of centromere to breakage through local genomic and epigenomic remodeling in wheat","authors":"Jingwei Zhou, Yuhong Huang, Huan Ma, Yiqian Chen, Chuanye Chen, Fangpu Han, Handong Su","doi":"10.1101/gr.280913.125","DOIUrl":"https://doi.org/10.1101/gr.280913.125","url":null,"abstract":"Centromeres, characterized by their unique chromatin attributes, are indispensable for safeguarding genomic stability. Due to their intricate and fragile nature, centromeres are susceptible to chromosomal rearrangements. However, the mechanisms preserving their functional integrity and supporting nucleus homeostasis following breakages remained enigmatic. In this study, we use wheat ditelosomic stocks, which arise from centromere breakage, to explore the genetic and epigenetic alterations in damaged centromeres. Our investigations unveil novel chromosome end structures marked by de novo addition of telomeres, as well as localized chromosomal shattering, including segment deletions and duplications near centromere breakpoints. We reveal that the damaged centromeres possess a remarkable capacity for self-regulation, through employing structural modifications such as expansion, contraction, and neocentromere formation to maintain their functional integrity. Centromere breakage triggers nucleosome remodeling and is accompanied by local transcription changes and chromatin reorganization, and subsequently may contribute to the stabilization of broken chromosomes. Our findings highlight the resilience and adaptability of plant chromosomes in response to centromere breakage, and provide valuable insights into the stability of centromeres, thereby offering promising prospects to manipulate centromeres for targeted chromosomal innovation and crop genetic improvement.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"29 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145195152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jim Shaw, Christina Boucher, Yun William Yu, Noelle Noyes, Heng Li
Reconstructing exact haplotypes is important when sequencing a mixture of similar sequences. Long-read sequencing can connect distant alleles to disentangle similar haplotypes, but handling sequencing errors requires specialized techniques. We present devider, an algorithm for haplotyping small sequences - such as viruses or genes - from long-read sequencing. devider uses a positional de Bruijn graph with sequence-to-graph alignment on an alphabet of informative alleles to provide a fast assembly-inspired approach compatible with various long-read sequencing technologies. On a synthetic Nanopore dataset containing seven HIV strains, devider recovered 97% of the haplotype content and had the most accurate abundance estimates while taking < 4 minutes and 1 GB of memory for > 8000× coverage. Benchmarking on synthetic mixtures of antimicrobial resistance (AMR) genes showed that devider recovered 83% of haplotypes, 23 percentage points higher than the next best method. On real PacBio and Nanopore datasets, devider recapitulates previously known results in seconds, disentangling a bacterial community with > 10 strains and an HIV-1 co-infection dataset. We used devider to investigate the within-host diversity of a long-read bovine gut metagenome enriched for AMR genes, discovering 13 distinct haplotypes for a tet(Q) tetracycline resistance gene with > 18,000× coverage and 6 haplotypes for a CfxA2 beta-lactamase gene. We found clear recombination blocks for these AMR gene haplotypes, showcasing devider's ability to unveil evolutionary signals for heterogeneous mixtures.
{"title":"Long-read reconstruction of many diverse haplotypes with devider","authors":"Jim Shaw, Christina Boucher, Yun William Yu, Noelle Noyes, Heng Li","doi":"10.1101/gr.280510.125","DOIUrl":"https://doi.org/10.1101/gr.280510.125","url":null,"abstract":"Reconstructing exact haplotypes is important when sequencing a mixture of similar sequences. Long-read sequencing can connect distant alleles to disentangle similar haplotypes, but handling sequencing errors requires specialized techniques. We present devider, an algorithm for haplotyping small sequences - such as viruses or genes - from long-read sequencing. devider uses a positional de Bruijn graph with sequence-to-graph alignment on an alphabet of informative alleles to provide a fast assembly-inspired approach compatible with various long-read sequencing technologies. On a synthetic Nanopore dataset containing seven HIV strains, devider recovered 97% of the haplotype content and had the most accurate abundance estimates while taking < 4 minutes and 1 GB of memory for > 8000× coverage. Benchmarking on synthetic mixtures of antimicrobial resistance (AMR) genes showed that devider recovered 83% of haplotypes, 23 percentage points higher than the next best method. On real PacBio and Nanopore datasets, devider recapitulates previously known results in seconds, disentangling a bacterial community with > 10 strains and an HIV-1 co-infection dataset. We used devider to investigate the within-host diversity of a long-read bovine gut metagenome enriched for AMR genes, discovering 13 distinct haplotypes for a <em>tet(Q)</em> tetracycline resistance gene with > 18,000× coverage and 6 haplotypes for a <em>CfxA2</em> beta-lactamase gene. We found clear recombination blocks for these AMR gene haplotypes, showcasing devider's ability to unveil evolutionary signals for heterogeneous mixtures.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"28 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145127786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiang Su, Yi Long, Deming Gou, Junmin Quan, Xiaoming Zhou, Qizhou Lian
RNA sequencing (RNA-seq) is a pivotal tool for transcriptomic analysis, providing comprehensive exploration of gene expression across diverse biological contexts. However, RNA-seq data is susceptible to various biases that can significantly compromise the accuracy and reliability of transcript quantification. This study investigates the influence of high-dimensional RNA structures on local sequencing efficiency using an innovative unsupervised Variational Autoencoder-Gaussian Mixture Model (VAE-GMM). The VAE-GMM effectively captures intricate high-dimensional k-mer structural similarities by learning compact latent representations, which reduces dimensionality while meticulously preserving essential structural features crucial for bias identification. This sophisticated modeling allows precise tracking of local RNA-read conversion dynamics and the identification of complex, often overlooked, bias sources. We rigorously validate the VAE-GMM model's performance and robustness against conventional machine learning techniques, including Gaussian Mixture Models (GMM-only), Principal Component Analysis-based GMMs, k-means clustering, and Hierarchical Clustering. These validations, using an extensive and diverse array of datasets including synthetic RNA constructs, various human cell lines, and authentic tissue samples, consistently demonstrate the model's superior versatility and accuracy across different biological systems. Furthermore, in silico simulations of the sequencing process closely align with actual sequencing data, strongly reinforcing the critical role of high-dimensional RNA structures in determining sequencing efficiency and their impact on data quality. Our findings offer valuable insights into the underlying mechanisms of RNA structure-mediated sequencing bias. This deeper understanding enables more accurate and reliable RNA-seq analyses and is expected to improve the interpretation of transcriptomic data in future genomic studies.
{"title":"Deep structural clustering reveals hidden systematic biases in RNA sequencing data","authors":"Qiang Su, Yi Long, Deming Gou, Junmin Quan, Xiaoming Zhou, Qizhou Lian","doi":"10.1101/gr.280713.125","DOIUrl":"https://doi.org/10.1101/gr.280713.125","url":null,"abstract":"RNA sequencing (RNA-seq) is a pivotal tool for transcriptomic analysis, providing comprehensive exploration of gene expression across diverse biological contexts. However, RNA-seq data is susceptible to various biases that can significantly compromise the accuracy and reliability of transcript quantification. This study investigates the influence of high-dimensional RNA structures on local sequencing efficiency using an innovative unsupervised Variational Autoencoder-Gaussian Mixture Model (VAE-GMM). The VAE-GMM effectively captures intricate high-dimensional <em>k</em>-mer structural similarities by learning compact latent representations, which reduces dimensionality while meticulously preserving essential structural features crucial for bias identification. This sophisticated modeling allows precise tracking of local RNA-read conversion dynamics and the identification of complex, often overlooked, bias sources. We rigorously validate the VAE-GMM model's performance and robustness against conventional machine learning techniques, including Gaussian Mixture Models (GMM-only), Principal Component Analysis-based GMMs, <em>k</em>-means clustering, and Hierarchical Clustering. These validations, using an extensive and diverse array of datasets including synthetic RNA constructs, various human cell lines, and authentic tissue samples, consistently demonstrate the model's superior versatility and accuracy across different biological systems. Furthermore, in silico simulations of the sequencing process closely align with actual sequencing data, strongly reinforcing the critical role of high-dimensional RNA structures in determining sequencing efficiency and their impact on data quality. Our findings offer valuable insights into the underlying mechanisms of RNA structure-mediated sequencing bias. This deeper understanding enables more accurate and reliable RNA-seq analyses and is expected to improve the interpretation of transcriptomic data in future genomic studies.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"27 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145089444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Differential expression (DE) analysis is a widely used method for identifying genes that are functionally relevant for an observed phenotype or biological response. However, typical DE analysis includes selection of genes based on a threshold of fold change in expression under the implicit assumption that all genes are equally sensitive to dosage changes of their transcripts. This tends to favor highly variable genes over more constrained genes where even small changes in expression may be biologically relevant. To address this limitation, we have developed a method to recalibrate each gene's DE fold change based on genetic expression variance observed in the human population. The newly established metric ranks statistically differentially expressed genes, not by nominal change of expression, but by relative change in comparison to natural dosage variation for each gene. We apply our method to RNA sequencing data sets from in vitro stimulus response and neuropsychiatric disease experiments. Compared to the standard approach, our method adjusts the bias in discovery toward highly variable genes and enriches for pathways and biological processes related to metabolic and regulatory activity, indicating a prioritization of functionally relevant driver genes. Tissue-specific recalibration increases detection of known disease-relevant processes. Altogether, our method provides a novel view on DE and contributes toward bridging the existing gap between statistical and biological significance. We believe that this approach will simplify the identification of disease-causing molecular processes and enhance the discovery of therapeutic targets.
{"title":"Recalibrating differential gene expression by genetic dosage variance prioritizes functionally relevant genes","authors":"Philipp Rentzsch, Aaron Kollotzek, Kaushik Ram Ganapathy, Pejman Mohammadi, Tuuli Lappalainen","doi":"10.1101/gr.280360.124","DOIUrl":"https://doi.org/10.1101/gr.280360.124","url":null,"abstract":"Differential expression (DE) analysis is a widely used method for identifying genes that are functionally relevant for an observed phenotype or biological response. However, typical DE analysis includes selection of genes based on a threshold of fold change in expression under the implicit assumption that all genes are equally sensitive to dosage changes of their transcripts. This tends to favor highly variable genes over more constrained genes where even small changes in expression may be biologically relevant. To address this limitation, we have developed a method to recalibrate each gene's DE fold change based on genetic expression variance observed in the human population. The newly established metric ranks statistically differentially expressed genes, not by nominal change of expression, but by relative change in comparison to natural dosage variation for each gene. We apply our method to RNA sequencing data sets from in vitro stimulus response and neuropsychiatric disease experiments. Compared to the standard approach, our method adjusts the bias in discovery toward highly variable genes and enriches for pathways and biological processes related to metabolic and regulatory activity, indicating a prioritization of functionally relevant driver genes. Tissue-specific recalibration increases detection of known disease-relevant processes. Altogether, our method provides a novel view on DE and contributes toward bridging the existing gap between statistical and biological significance. We believe that this approach will simplify the identification of disease-causing molecular processes and enhance the discovery of therapeutic targets.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"53 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145077422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}