Antonio Bernardo Carvalho, Bernard Y Kim, Fabiana Uno
Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) are generally considered free from sequence composition bias, a key factor - alongside read length - that explains their success in producing high quality genome assemblies. Indeed, there had been very few reports of bias, the clearest one against GA-rich repeats in the human genome. However, our study reveals a systematic failure of both technologies to sequence and assemble specific exons of Drosophila melanogaster genes, indicating an overlooked limitation. Namely, multiple Y-linked exons are nearly or completely absent from raw reads produced by deep sequencing with state-of-the-art ONT (10.4 flow cells, 200× coverage) and PacBio (HiFi 50×). The same exons are accurately assembled using Illumina 67× coverage. We found that these missing exons are consistently located near simple satellite sequences, where sequencing fails at multiple levels: read initiation (very few reads start within satellite regions), read elongation (satellite-containing reads are shorter on average), and base-calling (quality scores drop as sequencing enters a satellite sequence). These findings challenge the assumption that long-read technologies are unbiased and reveal a critical barrier to assembling sequences near repetitive regions. As large-scale sequencing projects move towards telomere-to-telomere assemblies in a wide range of organisms, recognizing and addressing these biases will be important to achieving truly complete and accurate genomes. Additionally, the underrepresented Y-linked exons provides a valuable benchmark for refining those sequencing technologies while improving the assembly of the highly heterochromatic and often neglected Drosophila Y Chromosome.
{"title":"Strong bias in long-read sequencing prevents assembly of Drosophila melanogaster Y-linked genes","authors":"Antonio Bernardo Carvalho, Bernard Y Kim, Fabiana Uno","doi":"10.1101/gr.280604.125","DOIUrl":"https://doi.org/10.1101/gr.280604.125","url":null,"abstract":"Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) are generally considered free from sequence composition bias, a key factor - alongside read length - that explains their success in producing high quality genome assemblies. Indeed, there had been very few reports of bias, the clearest one against GA-rich repeats in the human genome. However, our study reveals a systematic failure of both technologies to sequence and assemble specific exons of <em>Drosophila melanogaster</em> genes, indicating an overlooked limitation. Namely, multiple Y-linked exons are nearly or completely absent from raw reads produced by deep sequencing with state-of-the-art ONT (10.4 flow cells, 200× coverage) and PacBio (HiFi 50×). The same exons are accurately assembled using Illumina 67× coverage. We found that these missing exons are consistently located near simple satellite sequences, where sequencing fails at multiple levels: read initiation (very few reads start within satellite regions), read elongation (satellite-containing reads are shorter on average), and base-calling (quality scores drop as sequencing enters a satellite sequence). These findings challenge the assumption that long-read technologies are unbiased and reveal a critical barrier to assembling sequences near repetitive regions. As large-scale sequencing projects move towards telomere-to-telomere assemblies in a wide range of organisms, recognizing and addressing these biases will be important to achieving truly complete and accurate genomes. Additionally, the underrepresented Y-linked exons provides a valuable benchmark for refining those sequencing technologies while improving the assembly of the highly heterochromatic and often neglected <em>Drosophila</em> Y Chromosome.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"101 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145203171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cell type annotation is a critical and essential task in single-cell data analysis. Various reference-based methods have provided rapid annotation for diverse single-cell data. However, how to select the optimal references and methods is often overlooked. To this end, we present a cross-dataset cell-type annotation methodology with a universal reference data and method selection strategy (CAMUS) to achieve highly accurate and efficient annotations. We demonstrate the advantages of CAMUS by conducting comprehensive analyses on 672 pairs of cross-species scRNA-seq datasets. The annotation results with references selected by CAMUS achieved substantial accuracy gains (25.0-124.7%) over random selection strategies across five reference-based methods. CAMUS achieved high accuracy in choosing the best reference-method pair among 3360 pairs (49.1%). Moreover, CAMUS showed high accuracy in selecting the best methods on the 80 scST datasets (82.5%) and five scATAC-seq datasets (100.0%), illustrating its universal applicability. In addition, we utilized the CAMUS score with other metrics to predict the annotation accuracy, providing direct guidance on whether to accept current annotation results.
{"title":"Highly accurate reference and method selection for universal cross-dataset cell type annotation with CAMUS","authors":"Qunlun Shen, Shuqin Zhang, Shihua Zhang","doi":"10.1101/gr.280821.125","DOIUrl":"https://doi.org/10.1101/gr.280821.125","url":null,"abstract":"Cell type annotation is a critical and essential task in single-cell data analysis. Various reference-based methods have provided rapid annotation for diverse single-cell data. However, how to select the optimal references and methods is often overlooked. To this end, we present a cross-dataset cell-type annotation methodology with a universal reference data and method selection strategy (CAMUS) to achieve highly accurate and efficient annotations. We demonstrate the advantages of CAMUS by conducting comprehensive analyses on 672 pairs of cross-species scRNA-seq datasets. The annotation results with references selected by CAMUS achieved substantial accuracy gains (25.0-124.7%) over random selection strategies across five reference-based methods. CAMUS achieved high accuracy in choosing the best reference-method pair among 3360 pairs (49.1%). Moreover, CAMUS showed high accuracy in selecting the best methods on the 80 scST datasets (82.5%) and five scATAC-seq datasets (100.0%), illustrating its universal applicability. In addition, we utilized the CAMUS score with other metrics to predict the annotation accuracy, providing direct guidance on whether to accept current annotation results.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"95 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145203170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Centromeres, characterized by their unique chromatin attributes, are indispensable for safeguarding genomic stability. Due to their intricate and fragile nature, centromeres are susceptible to chromosomal rearrangements. However, the mechanisms preserving their functional integrity and supporting nucleus homeostasis following breakages remained enigmatic. In this study, we use wheat ditelosomic stocks, which arise from centromere breakage, to explore the genetic and epigenetic alterations in damaged centromeres. Our investigations unveil novel chromosome end structures marked by de novo addition of telomeres, as well as localized chromosomal shattering, including segment deletions and duplications near centromere breakpoints. We reveal that the damaged centromeres possess a remarkable capacity for self-regulation, through employing structural modifications such as expansion, contraction, and neocentromere formation to maintain their functional integrity. Centromere breakage triggers nucleosome remodeling and is accompanied by local transcription changes and chromatin reorganization, and subsequently may contribute to the stabilization of broken chromosomes. Our findings highlight the resilience and adaptability of plant chromosomes in response to centromere breakage, and provide valuable insights into the stability of centromeres, thereby offering promising prospects to manipulate centromeres for targeted chromosomal innovation and crop genetic improvement.
{"title":"Adaptation of centromere to breakage through local genomic and epigenomic remodeling in wheat","authors":"Jingwei Zhou, Yuhong Huang, Huan Ma, Yiqian Chen, Chuanye Chen, Fangpu Han, Handong Su","doi":"10.1101/gr.280913.125","DOIUrl":"https://doi.org/10.1101/gr.280913.125","url":null,"abstract":"Centromeres, characterized by their unique chromatin attributes, are indispensable for safeguarding genomic stability. Due to their intricate and fragile nature, centromeres are susceptible to chromosomal rearrangements. However, the mechanisms preserving their functional integrity and supporting nucleus homeostasis following breakages remained enigmatic. In this study, we use wheat ditelosomic stocks, which arise from centromere breakage, to explore the genetic and epigenetic alterations in damaged centromeres. Our investigations unveil novel chromosome end structures marked by de novo addition of telomeres, as well as localized chromosomal shattering, including segment deletions and duplications near centromere breakpoints. We reveal that the damaged centromeres possess a remarkable capacity for self-regulation, through employing structural modifications such as expansion, contraction, and neocentromere formation to maintain their functional integrity. Centromere breakage triggers nucleosome remodeling and is accompanied by local transcription changes and chromatin reorganization, and subsequently may contribute to the stabilization of broken chromosomes. Our findings highlight the resilience and adaptability of plant chromosomes in response to centromere breakage, and provide valuable insights into the stability of centromeres, thereby offering promising prospects to manipulate centromeres for targeted chromosomal innovation and crop genetic improvement.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"29 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145195152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jim Shaw, Christina Boucher, Yun William Yu, Noelle Noyes, Heng Li
Reconstructing exact haplotypes is important when sequencing a mixture of similar sequences. Long-read sequencing can connect distant alleles to disentangle similar haplotypes, but handling sequencing errors requires specialized techniques. We present devider, an algorithm for haplotyping small sequences - such as viruses or genes - from long-read sequencing. devider uses a positional de Bruijn graph with sequence-to-graph alignment on an alphabet of informative alleles to provide a fast assembly-inspired approach compatible with various long-read sequencing technologies. On a synthetic Nanopore dataset containing seven HIV strains, devider recovered 97% of the haplotype content and had the most accurate abundance estimates while taking < 4 minutes and 1 GB of memory for > 8000× coverage. Benchmarking on synthetic mixtures of antimicrobial resistance (AMR) genes showed that devider recovered 83% of haplotypes, 23 percentage points higher than the next best method. On real PacBio and Nanopore datasets, devider recapitulates previously known results in seconds, disentangling a bacterial community with > 10 strains and an HIV-1 co-infection dataset. We used devider to investigate the within-host diversity of a long-read bovine gut metagenome enriched for AMR genes, discovering 13 distinct haplotypes for a tet(Q) tetracycline resistance gene with > 18,000× coverage and 6 haplotypes for a CfxA2 beta-lactamase gene. We found clear recombination blocks for these AMR gene haplotypes, showcasing devider's ability to unveil evolutionary signals for heterogeneous mixtures.
{"title":"Long-read reconstruction of many diverse haplotypes with devider","authors":"Jim Shaw, Christina Boucher, Yun William Yu, Noelle Noyes, Heng Li","doi":"10.1101/gr.280510.125","DOIUrl":"https://doi.org/10.1101/gr.280510.125","url":null,"abstract":"Reconstructing exact haplotypes is important when sequencing a mixture of similar sequences. Long-read sequencing can connect distant alleles to disentangle similar haplotypes, but handling sequencing errors requires specialized techniques. We present devider, an algorithm for haplotyping small sequences - such as viruses or genes - from long-read sequencing. devider uses a positional de Bruijn graph with sequence-to-graph alignment on an alphabet of informative alleles to provide a fast assembly-inspired approach compatible with various long-read sequencing technologies. On a synthetic Nanopore dataset containing seven HIV strains, devider recovered 97% of the haplotype content and had the most accurate abundance estimates while taking < 4 minutes and 1 GB of memory for > 8000× coverage. Benchmarking on synthetic mixtures of antimicrobial resistance (AMR) genes showed that devider recovered 83% of haplotypes, 23 percentage points higher than the next best method. On real PacBio and Nanopore datasets, devider recapitulates previously known results in seconds, disentangling a bacterial community with > 10 strains and an HIV-1 co-infection dataset. We used devider to investigate the within-host diversity of a long-read bovine gut metagenome enriched for AMR genes, discovering 13 distinct haplotypes for a <em>tet(Q)</em> tetracycline resistance gene with > 18,000× coverage and 6 haplotypes for a <em>CfxA2</em> beta-lactamase gene. We found clear recombination blocks for these AMR gene haplotypes, showcasing devider's ability to unveil evolutionary signals for heterogeneous mixtures.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"28 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145127786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiang Su, Yi Long, Deming Gou, Junmin Quan, Xiaoming Zhou, Qizhou Lian
RNA sequencing (RNA-seq) is a pivotal tool for transcriptomic analysis, providing comprehensive exploration of gene expression across diverse biological contexts. However, RNA-seq data is susceptible to various biases that can significantly compromise the accuracy and reliability of transcript quantification. This study investigates the influence of high-dimensional RNA structures on local sequencing efficiency using an innovative unsupervised Variational Autoencoder-Gaussian Mixture Model (VAE-GMM). The VAE-GMM effectively captures intricate high-dimensional k-mer structural similarities by learning compact latent representations, which reduces dimensionality while meticulously preserving essential structural features crucial for bias identification. This sophisticated modeling allows precise tracking of local RNA-read conversion dynamics and the identification of complex, often overlooked, bias sources. We rigorously validate the VAE-GMM model's performance and robustness against conventional machine learning techniques, including Gaussian Mixture Models (GMM-only), Principal Component Analysis-based GMMs, k-means clustering, and Hierarchical Clustering. These validations, using an extensive and diverse array of datasets including synthetic RNA constructs, various human cell lines, and authentic tissue samples, consistently demonstrate the model's superior versatility and accuracy across different biological systems. Furthermore, in silico simulations of the sequencing process closely align with actual sequencing data, strongly reinforcing the critical role of high-dimensional RNA structures in determining sequencing efficiency and their impact on data quality. Our findings offer valuable insights into the underlying mechanisms of RNA structure-mediated sequencing bias. This deeper understanding enables more accurate and reliable RNA-seq analyses and is expected to improve the interpretation of transcriptomic data in future genomic studies.
{"title":"Deep structural clustering reveals hidden systematic biases in RNA sequencing data","authors":"Qiang Su, Yi Long, Deming Gou, Junmin Quan, Xiaoming Zhou, Qizhou Lian","doi":"10.1101/gr.280713.125","DOIUrl":"https://doi.org/10.1101/gr.280713.125","url":null,"abstract":"RNA sequencing (RNA-seq) is a pivotal tool for transcriptomic analysis, providing comprehensive exploration of gene expression across diverse biological contexts. However, RNA-seq data is susceptible to various biases that can significantly compromise the accuracy and reliability of transcript quantification. This study investigates the influence of high-dimensional RNA structures on local sequencing efficiency using an innovative unsupervised Variational Autoencoder-Gaussian Mixture Model (VAE-GMM). The VAE-GMM effectively captures intricate high-dimensional <em>k</em>-mer structural similarities by learning compact latent representations, which reduces dimensionality while meticulously preserving essential structural features crucial for bias identification. This sophisticated modeling allows precise tracking of local RNA-read conversion dynamics and the identification of complex, often overlooked, bias sources. We rigorously validate the VAE-GMM model's performance and robustness against conventional machine learning techniques, including Gaussian Mixture Models (GMM-only), Principal Component Analysis-based GMMs, <em>k</em>-means clustering, and Hierarchical Clustering. These validations, using an extensive and diverse array of datasets including synthetic RNA constructs, various human cell lines, and authentic tissue samples, consistently demonstrate the model's superior versatility and accuracy across different biological systems. Furthermore, in silico simulations of the sequencing process closely align with actual sequencing data, strongly reinforcing the critical role of high-dimensional RNA structures in determining sequencing efficiency and their impact on data quality. Our findings offer valuable insights into the underlying mechanisms of RNA structure-mediated sequencing bias. This deeper understanding enables more accurate and reliable RNA-seq analyses and is expected to improve the interpretation of transcriptomic data in future genomic studies.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"27 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145089444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Differential expression (DE) analysis is a widely used method for identifying genes that are functionally relevant for an observed phenotype or biological response. However, typical DE analysis includes selection of genes based on a threshold of fold change in expression under the implicit assumption that all genes are equally sensitive to dosage changes of their transcripts. This tends to favor highly variable genes over more constrained genes where even small changes in expression may be biologically relevant. To address this limitation, we have developed a method to recalibrate each gene's DE fold change based on genetic expression variance observed in the human population. The newly established metric ranks statistically differentially expressed genes, not by nominal change of expression, but by relative change in comparison to natural dosage variation for each gene. We apply our method to RNA sequencing data sets from in vitro stimulus response and neuropsychiatric disease experiments. Compared to the standard approach, our method adjusts the bias in discovery toward highly variable genes and enriches for pathways and biological processes related to metabolic and regulatory activity, indicating a prioritization of functionally relevant driver genes. Tissue-specific recalibration increases detection of known disease-relevant processes. Altogether, our method provides a novel view on DE and contributes toward bridging the existing gap between statistical and biological significance. We believe that this approach will simplify the identification of disease-causing molecular processes and enhance the discovery of therapeutic targets.
{"title":"Recalibrating differential gene expression by genetic dosage variance prioritizes functionally relevant genes","authors":"Philipp Rentzsch, Aaron Kollotzek, Kaushik Ram Ganapathy, Pejman Mohammadi, Tuuli Lappalainen","doi":"10.1101/gr.280360.124","DOIUrl":"https://doi.org/10.1101/gr.280360.124","url":null,"abstract":"Differential expression (DE) analysis is a widely used method for identifying genes that are functionally relevant for an observed phenotype or biological response. However, typical DE analysis includes selection of genes based on a threshold of fold change in expression under the implicit assumption that all genes are equally sensitive to dosage changes of their transcripts. This tends to favor highly variable genes over more constrained genes where even small changes in expression may be biologically relevant. To address this limitation, we have developed a method to recalibrate each gene's DE fold change based on genetic expression variance observed in the human population. The newly established metric ranks statistically differentially expressed genes, not by nominal change of expression, but by relative change in comparison to natural dosage variation for each gene. We apply our method to RNA sequencing data sets from in vitro stimulus response and neuropsychiatric disease experiments. Compared to the standard approach, our method adjusts the bias in discovery toward highly variable genes and enriches for pathways and biological processes related to metabolic and regulatory activity, indicating a prioritization of functionally relevant driver genes. Tissue-specific recalibration increases detection of known disease-relevant processes. Altogether, our method provides a novel view on DE and contributes toward bridging the existing gap between statistical and biological significance. We believe that this approach will simplify the identification of disease-causing molecular processes and enhance the discovery of therapeutic targets.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"53 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145077422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liubomir Chorbadjiev, Murat Cokol, Zohar Weinstein, Kevin Shi, Christopher Fleisch, Nikolay Dimitrov, Svetlin Mladenov, Ivo Todorov, Iordan Ivanov, Simon Xu, Steven Ford, Yoon-ha Lee, Boris Yamrom, Steven Marks, Adriana Munoz, Alex Lash, Natalia Volfovsky, Ivan Iossifov
The exploration of genotypic variants impacting phenotypes is a cornerstone in genetics research. The emergence of vast collections containing deeply genotyped and phenotyped families has made it possible to pursue the search for variants associated with complex diseases. However, managing these large-scale data sets requires specialized computational tools to organize and analyze the extensive data. Genotypes and Phenotypes in Families (GPF) is an open-source platform that manages genotypes and phenotypes derived from collections of families. GPF allows interactive exploration of genetic variants, enrichment analysis for de novo mutations, phenotype/genotype association tools, and secure data sharing. GPF is used to disseminate two family collection data sets, SSC and SPARK, for the study of autism, built by the Simons Foundation. The GPF instance at the Simons Foundation (GPF-SFARI) provides protected access to comprehensive genotypic and phenotypic data for SSC and SPARK. GPF-SFARI also provides public access to an extensive collection of de novo mutations from individuals with autism and related disorders and to gene-level statistics of the protected data sets characterizing the genes’ roles in autism. However, GPF is versatile and can manage genotypic data from other small or large family collections. Here, we highlight the primary features of GPF within the context of GPF-SFARI.
{"title":"Analyzing the large and complex SFARI autism cohort data using the Genotypes and Phenotypes in Families (GPF) platform","authors":"Liubomir Chorbadjiev, Murat Cokol, Zohar Weinstein, Kevin Shi, Christopher Fleisch, Nikolay Dimitrov, Svetlin Mladenov, Ivo Todorov, Iordan Ivanov, Simon Xu, Steven Ford, Yoon-ha Lee, Boris Yamrom, Steven Marks, Adriana Munoz, Alex Lash, Natalia Volfovsky, Ivan Iossifov","doi":"10.1101/gr.280356.124","DOIUrl":"https://doi.org/10.1101/gr.280356.124","url":null,"abstract":"The exploration of genotypic variants impacting phenotypes is a cornerstone in genetics research. The emergence of vast collections containing deeply genotyped and phenotyped families has made it possible to pursue the search for variants associated with complex diseases. However, managing these large-scale data sets requires specialized computational tools to organize and analyze the extensive data. Genotypes and Phenotypes in Families (GPF) is an open-source platform that manages genotypes and phenotypes derived from collections of families. GPF allows interactive exploration of genetic variants, enrichment analysis for de novo mutations, phenotype/genotype association tools, and secure data sharing. GPF is used to disseminate two family collection data sets, SSC and SPARK, for the study of autism, built by the Simons Foundation. The GPF instance at the Simons Foundation (GPF-SFARI) provides protected access to comprehensive genotypic and phenotypic data for SSC and SPARK. GPF-SFARI also provides public access to an extensive collection of de novo mutations from individuals with autism and related disorders and to gene-level statistics of the protected data sets characterizing the genes’ roles in autism. However, GPF is versatile and can manage genotypic data from other small or large family collections. Here, we highlight the primary features of GPF within the context of GPF-SFARI.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"37 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145072494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yao Wang, Dong Yu, Yue Mei, Zhida Fu, Jian Lin, Di Wu, Yuan Yang, Hongli Yan
Hepatitis B virus (HBV) integration is a key driver of hepatocellular carcinoma (HCC) occurrence and progression; however, its oncogenic mechanisms remain incompletely understood because of limitations in detection methods and sample availability. In this study, we employed Oxford Nanopore Technologies (ONT) whole-genome sequencing and full-length transcriptome sequencing to characterize HBV integration events at the genomic and transcriptomic levels, along with their regulatory effects on structural variations (SVs) and gene expression. Functional validation was performed using dual-luciferase assays and cell-based experiments. Our findings revealed that integrated HBV sequences form long concatemers, mediating inter- and intrachromosomal recombination in the human genome. Notably, integrated HBV enhancer I (HBV-Enh I) was detected in 6 of 7 tumor tissues and was associated with aberrant gene expression. HBV integration induced oncogenic SVs, such as focal MYC amplification and NAV2 deletion, and directly modulated gene expression. Additionally, ectopic overexpression of MYOCD, driven by HBV-Enh I integration, promoted HCC cell migration and invasion. In summary, HBV integration acts as a major driver of large-scale genomic SVs and transcriptomic dysregulation, through either direct alterations in genome dosage or cis-regulatory mechanisms. HBV-Enh I is frequently integrated in HCC and might play a pivotal role in abnormal gene expression, highlighting its potential as a therapeutic target.
{"title":"Long-read sequencing reveals HBV integration patterns and oncogenic impact on early-onset hepatocellular carcinoma","authors":"Yao Wang, Dong Yu, Yue Mei, Zhida Fu, Jian Lin, Di Wu, Yuan Yang, Hongli Yan","doi":"10.1101/gr.279889.124","DOIUrl":"https://doi.org/10.1101/gr.279889.124","url":null,"abstract":"Hepatitis B virus (HBV) integration is a key driver of hepatocellular carcinoma (HCC) occurrence and progression; however, its oncogenic mechanisms remain incompletely understood because of limitations in detection methods and sample availability. In this study, we employed Oxford Nanopore Technologies (ONT) whole-genome sequencing and full-length transcriptome sequencing to characterize HBV integration events at the genomic and transcriptomic levels, along with their regulatory effects on structural variations (SVs) and gene expression. Functional validation was performed using dual-luciferase assays and cell-based experiments. Our findings revealed that integrated HBV sequences form long concatemers, mediating inter- and intrachromosomal recombination in the human genome. Notably, integrated HBV enhancer I (HBV-Enh I) was detected in 6 of 7 tumor tissues and was associated with aberrant gene expression. HBV integration induced oncogenic SVs, such as focal<em> MYC</em> amplification and <em>NAV2</em> deletion, and directly modulated gene expression. Additionally, ectopic overexpression of <em>MYOCD</em>, driven by HBV-<em>Enh I</em> integration, promoted HCC cell migration and invasion. In summary, HBV integration acts as a major driver of large-scale genomic SVs and transcriptomic dysregulation, through either direct alterations in genome dosage or cis-regulatory mechanisms. HBV-<em>Enh I</em> is frequently integrated in HCC and might play a pivotal role in abnormal gene expression, highlighting its potential as a therapeutic target.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"46 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145067702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The T2T-CHM13 reference genome, released in March 2022, fills in the 8% of the human genome that were not resolved in GRCh38 and reconstructs large parts of the known genome. The more accurate and complete reference genome is expected to improve the quality of read mapping and variant calling. Even though whole genome sequencing (WGS)-based approaches have become the golden standard in medical genetics, the extent of these benefits still remains unclear. In this study, we aim to evaluate mapping quality and variant call performance with T2T-CHM13 as a reference using a cross-sectional Swedish cohort (SweGen) comprising 1000 individuals with short-read Illumina WGS data available. Remapping and variant calling was performed using the nf-core/sarek pipeline. T2T-CHM13 improved a wide range of mapping and variant calling related metrics, including a higher fraction of properly paired reads, lower mismatch rate, and more uniform coverage of coding regions. Moreover, the fraction of ambiguous alignments was higher, reflecting segmental duplications that were incorrectly collapsed in GRCh37 and GRCh38. In comparison to GRCh38, we identified 10 million additional variants in the cohort, including 5.5 million singletons, and observed an increased sensitivity for rare variants. SnpEff assigned impact ratings of moderate or high to 13% more variants in T2T-CHM13 than GRCh38. In summary, we conclude that T2T-CHM13 improves alignment metrics with higher mapping quality, better variant calling performance and confidence, including for rare and deleterious variants. The T2T-CHM13 genome reference thus facilitates enhanced discovery of new disease-causing variation, benefiting, for example, rare-disease diagnostics.
{"title":"T2T-CHM13 improves read mapping and detection of clinically relevant genetic variation in the Swedish population","authors":"Daniel Schmitz, Adam Ameur, Åsa Johansson","doi":"10.1101/gr.279320.124","DOIUrl":"https://doi.org/10.1101/gr.279320.124","url":null,"abstract":"The T2T-CHM13 reference genome, released in March 2022, fills in the 8% of the human genome that were not resolved in GRCh38 and reconstructs large parts of the known genome. The more accurate and complete reference genome is expected to improve the quality of read mapping and variant calling. Even though whole genome sequencing (WGS)-based approaches have become the golden standard in medical genetics, the extent of these benefits still remains unclear. In this study, we aim to evaluate mapping quality and variant call performance with T2T-CHM13 as a reference using a cross-sectional Swedish cohort (SweGen) comprising 1000 individuals with short-read Illumina WGS data available. Remapping and variant calling was performed using the nf-core/sarek pipeline. T2T-CHM13 improved a wide range of mapping and variant calling related metrics, including a higher fraction of properly paired reads, lower mismatch rate, and more uniform coverage of coding regions. Moreover, the fraction of ambiguous alignments was higher, reflecting segmental duplications that were incorrectly collapsed in GRCh37 and GRCh38. In comparison to GRCh38, we identified 10 million additional variants in the cohort, including 5.5 million singletons, and observed an increased sensitivity for rare variants. SnpEff assigned impact ratings of moderate or high to 13% more variants in T2T-CHM13 than GRCh38. In summary, we conclude that T2T-CHM13 improves alignment metrics with higher mapping quality, better variant calling performance and confidence, including for rare and deleterious variants. The T2T-CHM13 genome reference thus facilitates enhanced discovery of new disease-causing variation, benefiting, for example, rare-disease diagnostics.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"321 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145067755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tandemly repeated satellite DNAs (satDNAs) are among the most abundant and fastest-evolving eukaryotic sequences, but the way they model genomes is still elusive. Here, we investigate the evolutionary dynamics of satDNAs in the extremely satDNA-rich genomes of two closely related Tribolium insects that produce sterile hybrids. In Tribolium freemani, we identify 135 satDNAs, accounting for 38.7% of the genome. Comparative analysis with the Tribolium castaneum satellitome reveals that the drastic difference occurred in their centromeric regions, which share orthologous organization characterized by totally different major satDNAs but related minor satDNAs. The T. freemani male sex chromosome, which lacks the major satDNA but contains a minor-like satDNA, further highlighted the question of which satDNA is centromere-competent. By analyzing the long-range organization of the centromeric regions, we discover that both the major and minor satDNA arrays exhibit a strong tendency toward macro-dyad symmetry, suggesting that the secondary structures in the centromeres may be more important than the primary sequence itself. We find evidence that the centromeric satDNAs of T. freemani occur in extrachromosomal circular DNAs, which may contribute to their expansion and homogenization between nonhomologous chromosomes. We also identify numerous low-copy-number satDNAs that are orthologous between the siblings, some of which are associated with transposable elements, highlighting transposition as a mechanism of their spreading. The dynamic evolution of satDNAs has clearly influenced the differentiation of Tribolium genomes, but the question remains whether the differences in their satDNA profiles are a cause or consequence of speciation.
{"title":"Dynamic evolution of satellite DNAs drastically differentiates the genomes of Tribolium sibling species","authors":"Damira Veseljak, Evelin Despot-Slade, Marin Volarić, Lucija Horvat, Tanja Vojvoda Zeljko, Nevenka Meštrović, Brankica Mravinac","doi":"10.1101/gr.280516.125","DOIUrl":"https://doi.org/10.1101/gr.280516.125","url":null,"abstract":"Tandemly repeated satellite DNAs (satDNAs) are among the most abundant and fastest-evolving eukaryotic sequences, but the way they model genomes is still elusive. Here, we investigate the evolutionary dynamics of satDNAs in the extremely satDNA-rich genomes of two closely related <em>Tribolium</em> insects that produce sterile hybrids. In <em>Tribolium freemani</em>, we identify 135 satDNAs, accounting for 38.7% of the genome. Comparative analysis with the <em>Tribolium castaneum</em> satellitome reveals that the drastic difference occurred in their centromeric regions, which share orthologous organization characterized by totally different major satDNAs but related minor satDNAs. The <em>T. freemani</em> male sex chromosome, which lacks the major satDNA but contains a minor-like satDNA, further highlighted the question of which satDNA is centromere-competent. By analyzing the long-range organization of the centromeric regions, we discover that both the major and minor satDNA arrays exhibit a strong tendency toward macro-dyad symmetry, suggesting that the secondary structures in the centromeres may be more important than the primary sequence itself. We find evidence that the centromeric satDNAs of <em>T. freemani</em> occur in extrachromosomal circular DNAs, which may contribute to their expansion and homogenization between nonhomologous chromosomes. We also identify numerous low-copy-number satDNAs that are orthologous between the siblings, some of which are associated with transposable elements, highlighting transposition as a mechanism of their spreading. The dynamic evolution of satDNAs has clearly influenced the differentiation of <em>Tribolium</em> genomes, but the question remains whether the differences in their satDNA profiles are a cause or consequence of speciation.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"17 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145067703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}