Genome research最新文献_第8页

Corrigendum: A sheep pangenome reveals the spectrum of structural variations and their effects on tail phenotypes 更正：绵羊泛基因组揭示了结构变异的频谱及其对尾部表型的影响

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-10-01 DOI: 10.1101/gr.281340.125

Ran Li, Mian Gong, Xinmiao Zhang, Fei Wang, Zhenyu Liu, Lei Zhang, Qimeng Yang, Yuan Xu, Mengsi Xu, Huanhuan Zhang, Yunfeng Zhang, Xuelei Dai, Yuanpeng Gao, Zhuangbiao Zhang, Wenwen Fang, Yuta Yang, Weiwei Fu, Chunna Cao, Peng Yang, Zeinab Amiri Ghanatsaman, Niloufar Jafarpour Negari, Hojjat Asadollahpour Nanaei, Xiangpeng Yue, Yuxuan Song, Xianyong Lan, Weidong Deng, Xihong Wang, Chuanying Pan, Ruidong Xiang, Eveline M. Ibeagha-Awemu, Pat (J.S.) Heslop-Harrison, Benjamin D. Rosen, Johannes A. Lenstra, Shangquan Gan, Yu Jiang

Genome Research 33: 463–477 (2023)

基因组研究33:463-477 （2023）

引用次数: 0

Strong bias in long-read sequencing prevents assembly of Drosophila melanogaster Y-linked genes 长读测序的强烈偏见阻止了果蝇y连锁基因的组装

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-10-01 DOI: 10.1101/gr.280604.125

Antonio Bernardo Carvalho, Bernard Y Kim, Fabiana Uno

Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) are generally considered free from sequence composition bias, a key factor - alongside read length - that explains their success in producing high quality genome assemblies. Indeed, there had been very few reports of bias, the clearest one against GA-rich repeats in the human genome. However, our study reveals a systematic failure of both technologies to sequence and assemble specific exons of Drosophila melanogaster genes, indicating an overlooked limitation. Namely, multiple Y-linked exons are nearly or completely absent from raw reads produced by deep sequencing with state-of-the-art ONT (10.4 flow cells, 200× coverage) and PacBio (HiFi 50×). The same exons are accurately assembled using Illumina 67× coverage. We found that these missing exons are consistently located near simple satellite sequences, where sequencing fails at multiple levels: read initiation (very few reads start within satellite regions), read elongation (satellite-containing reads are shorter on average), and base-calling (quality scores drop as sequencing enters a satellite sequence). These findings challenge the assumption that long-read technologies are unbiased and reveal a critical barrier to assembling sequences near repetitive regions. As large-scale sequencing projects move towards telomere-to-telomere assemblies in a wide range of organisms, recognizing and addressing these biases will be important to achieving truly complete and accurate genomes. Additionally, the underrepresented Y-linked exons provides a valuable benchmark for refining those sequencing technologies while improving the assembly of the highly heterochromatic and often neglected Drosophila Y Chromosome.

牛津纳米孔技术公司（ONT）和太平洋生物科学公司（PacBio）通常被认为没有序列组成偏差，这是一个关键因素——除了读取长度——解释了它们成功生产高质量基因组组装的原因。事实上，很少有偏见的报道，最明显的是反对人类基因组中富含ga的重复序列。然而，我们的研究揭示了这两种技术在对黑腹果蝇基因的特定外显子进行测序和组装方面的系统性失败，表明了一个被忽视的局限性。也就是说，使用最先进的ONT（10.4流式细胞，200x覆盖率）和PacBio （HiFi 50x）进行深度测序产生的原始reads中几乎或完全没有多个y连锁外显子。使用Illumina 67x覆盖准确地组装相同的外显子。我们发现这些缺失的外显子始终位于简单的卫星序列附近，其中测序在多个层面上失败：读取起始（很少的读取在卫星区域内开始），读取延伸（含卫星的读取平均较短）和碱基调用（测序进入卫星序列时质量分数下降）。这些发现挑战了长读技术是无偏倚的假设，并揭示了在重复区域附近组装序列的关键障碍。随着大规模测序项目在广泛的生物体中向端粒到端粒组装的方向发展，认识和解决这些偏差对于实现真正完整和准确的基因组将是重要的。此外，未被充分代表的Y连锁外显子为改进这些测序技术提供了有价值的基准，同时改善了高度异色且经常被忽视的果蝇Y染色体的组装。

{"title":"Strong bias in long-read sequencing prevents assembly of Drosophila melanogaster Y-linked genes","authors":"Antonio Bernardo Carvalho, Bernard Y Kim, Fabiana Uno","doi":"10.1101/gr.280604.125","DOIUrl":"https://doi.org/10.1101/gr.280604.125","url":null,"abstract":"Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) are generally considered free from sequence composition bias, a key factor - alongside read length - that explains their success in producing high quality genome assemblies. Indeed, there had been very few reports of bias, the clearest one against GA-rich repeats in the human genome. However, our study reveals a systematic failure of both technologies to sequence and assemble specific exons of Drosophila melanogaster genes, indicating an overlooked limitation. Namely, multiple Y-linked exons are nearly or completely absent from raw reads produced by deep sequencing with state-of-the-art ONT (10.4 flow cells, 200× coverage) and PacBio (HiFi 50×). The same exons are accurately assembled using Illumina 67× coverage. We found that these missing exons are consistently located near simple satellite sequences, where sequencing fails at multiple levels: read initiation (very few reads start within satellite regions), read elongation (satellite-containing reads are shorter on average), and base-calling (quality scores drop as sequencing enters a satellite sequence). These findings challenge the assumption that long-read technologies are unbiased and reveal a critical barrier to assembling sequences near repetitive regions. As large-scale sequencing projects move towards telomere-to-telomere assemblies in a wide range of organisms, recognizing and addressing these biases will be important to achieving truly complete and accurate genomes. Additionally, the underrepresented Y-linked exons provides a valuable benchmark for refining those sequencing technologies while improving the assembly of the highly heterochromatic and often neglected Drosophila Y Chromosome.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"101 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145203171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Highly accurate reference and method selection for universal cross-dataset cell type annotation with CAMUS 基于CAMUS的通用跨数据集单元类型标注的高精度参考和方法选择

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-10-01 DOI: 10.1101/gr.280821.125

Qunlun Shen, Shuqin Zhang, Shihua Zhang

Cell type annotation is a critical and essential task in single-cell data analysis. Various reference-based methods have provided rapid annotation for diverse single-cell data. However, how to select the optimal references and methods is often overlooked. To this end, we present a cross-dataset cell-type annotation methodology with a universal reference data and method selection strategy (CAMUS) to achieve highly accurate and efficient annotations. We demonstrate the advantages of CAMUS by conducting comprehensive analyses on 672 pairs of cross-species scRNA-seq datasets. The annotation results with references selected by CAMUS achieved substantial accuracy gains (25.0-124.7%) over random selection strategies across five reference-based methods. CAMUS achieved high accuracy in choosing the best reference-method pair among 3360 pairs (49.1%). Moreover, CAMUS showed high accuracy in selecting the best methods on the 80 scST datasets (82.5%) and five scATAC-seq datasets (100.0%), illustrating its universal applicability. In addition, we utilized the CAMUS score with other metrics to predict the annotation accuracy, providing direct guidance on whether to accept current annotation results.

在单细胞数据分析中，细胞类型标注是一项至关重要的任务。各种基于参考的方法为不同的单细胞数据提供了快速标注。然而，如何选择最佳的参考文献和方法往往被忽视。为此，我们提出了一种具有通用参考数据和方法选择策略（CAMUS）的跨数据集单元类型标注方法，以实现高精度和高效的标注。我们通过对672对跨物种scRNA-seq数据集进行综合分析，证明了CAMUS的优势。在五种基于参考文献的方法中，CAMUS选择参考文献的标注结果比随机选择策略获得了显著的准确率提升（25.0-124.7%）。在3360对参考方法对中，CAMUS选择最佳参考方法对的准确率为49.1%。此外，CAMUS在80个scST数据集（82.5%）和5个scATAC-seq数据集（100.0%）上的最佳方法选择准确率较高，说明其普遍适用性。此外，我们利用CAMUS分数和其他指标来预测标注准确性，为是否接受当前标注结果提供直接指导。

{"title":"Highly accurate reference and method selection for universal cross-dataset cell type annotation with CAMUS","authors":"Qunlun Shen, Shuqin Zhang, Shihua Zhang","doi":"10.1101/gr.280821.125","DOIUrl":"https://doi.org/10.1101/gr.280821.125","url":null,"abstract":"Cell type annotation is a critical and essential task in single-cell data analysis. Various reference-based methods have provided rapid annotation for diverse single-cell data. However, how to select the optimal references and methods is often overlooked. To this end, we present a cross-dataset cell-type annotation methodology with a universal reference data and method selection strategy (CAMUS) to achieve highly accurate and efficient annotations. We demonstrate the advantages of CAMUS by conducting comprehensive analyses on 672 pairs of cross-species scRNA-seq datasets. The annotation results with references selected by CAMUS achieved substantial accuracy gains (25.0-124.7%) over random selection strategies across five reference-based methods. CAMUS achieved high accuracy in choosing the best reference-method pair among 3360 pairs (49.1%). Moreover, CAMUS showed high accuracy in selecting the best methods on the 80 scST datasets (82.5%) and five scATAC-seq datasets (100.0%), illustrating its universal applicability. In addition, we utilized the CAMUS score with other metrics to predict the annotation accuracy, providing direct guidance on whether to accept current annotation results.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"95 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145203170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adaptation of centromere to breakage through local genomic and epigenomic remodeling in wheat 小麦着丝粒通过局部基因组和表观基因组重塑适应断裂

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-09-30 DOI: 10.1101/gr.280913.125

Jingwei Zhou, Yuhong Huang, Huan Ma, Yiqian Chen, Chuanye Chen, Fangpu Han, Handong Su

Centromeres, characterized by their unique chromatin attributes, are indispensable for safeguarding genomic stability. Due to their intricate and fragile nature, centromeres are susceptible to chromosomal rearrangements. However, the mechanisms preserving their functional integrity and supporting nucleus homeostasis following breakages remained enigmatic. In this study, we use wheat ditelosomic stocks, which arise from centromere breakage, to explore the genetic and epigenetic alterations in damaged centromeres. Our investigations unveil novel chromosome end structures marked by de novo addition of telomeres, as well as localized chromosomal shattering, including segment deletions and duplications near centromere breakpoints. We reveal that the damaged centromeres possess a remarkable capacity for self-regulation, through employing structural modifications such as expansion, contraction, and neocentromere formation to maintain their functional integrity. Centromere breakage triggers nucleosome remodeling and is accompanied by local transcription changes and chromatin reorganization, and subsequently may contribute to the stabilization of broken chromosomes. Our findings highlight the resilience and adaptability of plant chromosomes in response to centromere breakage, and provide valuable insights into the stability of centromeres, thereby offering promising prospects to manipulate centromeres for targeted chromosomal innovation and crop genetic improvement.

着丝粒以其独特的染色质属性为特征，在维护基因组稳定性方面是不可或缺的。由于其复杂和脆弱的性质，着丝粒易受染色体重排的影响。然而，保持其功能完整性和支持破坏后核稳态的机制仍然是谜。在这项研究中，我们利用小麦着丝粒断裂产生的二染色体种群来探索受损着丝粒的遗传和表观遗传变化。我们的研究揭示了新的染色体末端结构，其特征是端粒的重新添加，以及局部染色体破碎，包括着丝粒断点附近的片段缺失和复制。我们发现受损的着丝粒具有显著的自我调节能力，通过结构修饰，如扩张、收缩和新着丝粒的形成来维持其功能的完整性。着丝粒断裂触发核小体重塑，并伴随着局部转录变化和染色质重组，随后可能有助于断裂染色体的稳定。我们的研究结果突出了植物染色体对着丝粒断裂的恢复和适应性，并为着丝粒的稳定性提供了有价值的见解，从而为操纵着丝粒进行针对性的染色体创新和作物遗传改良提供了广阔的前景。

{"title":"Adaptation of centromere to breakage through local genomic and epigenomic remodeling in wheat","authors":"Jingwei Zhou, Yuhong Huang, Huan Ma, Yiqian Chen, Chuanye Chen, Fangpu Han, Handong Su","doi":"10.1101/gr.280913.125","DOIUrl":"https://doi.org/10.1101/gr.280913.125","url":null,"abstract":"Centromeres, characterized by their unique chromatin attributes, are indispensable for safeguarding genomic stability. Due to their intricate and fragile nature, centromeres are susceptible to chromosomal rearrangements. However, the mechanisms preserving their functional integrity and supporting nucleus homeostasis following breakages remained enigmatic. In this study, we use wheat ditelosomic stocks, which arise from centromere breakage, to explore the genetic and epigenetic alterations in damaged centromeres. Our investigations unveil novel chromosome end structures marked by de novo addition of telomeres, as well as localized chromosomal shattering, including segment deletions and duplications near centromere breakpoints. We reveal that the damaged centromeres possess a remarkable capacity for self-regulation, through employing structural modifications such as expansion, contraction, and neocentromere formation to maintain their functional integrity. Centromere breakage triggers nucleosome remodeling and is accompanied by local transcription changes and chromatin reorganization, and subsequently may contribute to the stabilization of broken chromosomes. Our findings highlight the resilience and adaptability of plant chromosomes in response to centromere breakage, and provide valuable insights into the stability of centromeres, thereby offering promising prospects to manipulate centromeres for targeted chromosomal innovation and crop genetic improvement.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"29 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145195152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Long-read reconstruction of many diverse haplotypes with devider 带分裂器的多种单倍型的长读重建

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-09-23 DOI: 10.1101/gr.280510.125

Jim Shaw, Christina Boucher, Yun William Yu, Noelle Noyes, Heng Li

Reconstructing exact haplotypes is important when sequencing a mixture of similar sequences. Long-read sequencing can connect distant alleles to disentangle similar haplotypes, but handling sequencing errors requires specialized techniques. We present devider, an algorithm for haplotyping small sequences - such as viruses or genes - from long-read sequencing. devider uses a positional de Bruijn graph with sequence-to-graph alignment on an alphabet of informative alleles to provide a fast assembly-inspired approach compatible with various long-read sequencing technologies. On a synthetic Nanopore dataset containing seven HIV strains, devider recovered 97% of the haplotype content and had the most accurate abundance estimates while taking < 4 minutes and 1 GB of memory for > 8000× coverage. Benchmarking on synthetic mixtures of antimicrobial resistance (AMR) genes showed that devider recovered 83% of haplotypes, 23 percentage points higher than the next best method. On real PacBio and Nanopore datasets, devider recapitulates previously known results in seconds, disentangling a bacterial community with > 10 strains and an HIV-1 co-infection dataset. We used devider to investigate the within-host diversity of a long-read bovine gut metagenome enriched for AMR genes, discovering 13 distinct haplotypes for a tet(Q) tetracycline resistance gene with > 18,000× coverage and 6 haplotypes for a CfxA2 beta-lactamase gene. We found clear recombination blocks for these AMR gene haplotypes, showcasing devider's ability to unveil evolutionary signals for heterogeneous mixtures.

当对相似序列的混合物进行测序时，重建精确的单倍型是重要的。长读测序可以连接遥远的等位基因，以解开相似的单倍型，但处理测序错误需要专门的技术。我们提出了devider，一种从长读序列中对小序列（如病毒或基因）进行单倍型分析的算法。devider使用位置de Bruijn图，在信息等位基因的字母表上进行序列对图对齐，以提供与各种长读测序技术兼容的快速组装启发方法。在包含7个HIV菌株的合成纳米孔数据集上，分离器恢复了97%的单倍型内容，并且获得了最准确的丰度估计，而花费了4分钟和1gb内存来获得8000x的覆盖率。对合成抗微生物药物耐药性（AMR）基因混合物的基准测试表明，分离方法恢复了83%的单倍型，比次优方法高出23个百分点。在真实的PacBio和Nanopore数据集上，devider可以在几秒钟内概括出先前已知的结果，从而分离出包含10个菌株和HIV-1合并感染数据集的细菌群落。我们使用分裂器研究了富含AMR基因的长读牛肠道宏基因组在宿主内的多样性，发现了一个覆盖面积为18,000倍的tet(Q)四环素抗性基因的13个不同的单倍型和一个CfxA2 β -内酰胺酶基因的6个单倍型。我们发现了这些AMR基因单倍型的清晰重组块，展示了分裂者揭示异质混合物进化信号的能力。

{"title":"Long-read reconstruction of many diverse haplotypes with devider","authors":"Jim Shaw, Christina Boucher, Yun William Yu, Noelle Noyes, Heng Li","doi":"10.1101/gr.280510.125","DOIUrl":"https://doi.org/10.1101/gr.280510.125","url":null,"abstract":"Reconstructing exact haplotypes is important when sequencing a mixture of similar sequences. Long-read sequencing can connect distant alleles to disentangle similar haplotypes, but handling sequencing errors requires specialized techniques. We present devider, an algorithm for haplotyping small sequences - such as viruses or genes - from long-read sequencing. devider uses a positional de Bruijn graph with sequence-to-graph alignment on an alphabet of informative alleles to provide a fast assembly-inspired approach compatible with various long-read sequencing technologies. On a synthetic Nanopore dataset containing seven HIV strains, devider recovered 97% of the haplotype content and had the most accurate abundance estimates while taking < 4 minutes and 1 GB of memory for > 8000× coverage. Benchmarking on synthetic mixtures of antimicrobial resistance (AMR) genes showed that devider recovered 83% of haplotypes, 23 percentage points higher than the next best method. On real PacBio and Nanopore datasets, devider recapitulates previously known results in seconds, disentangling a bacterial community with > 10 strains and an HIV-1 co-infection dataset. We used devider to investigate the within-host diversity of a long-read bovine gut metagenome enriched for AMR genes, discovering 13 distinct haplotypes for a tet(Q) tetracycline resistance gene with > 18,000× coverage and 6 haplotypes for a CfxA2 beta-lactamase gene. We found clear recombination blocks for these AMR gene haplotypes, showcasing devider's ability to unveil evolutionary signals for heterogeneous mixtures.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"28 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145127786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep structural clustering reveals hidden systematic biases in RNA sequencing data 深层结构聚类揭示了RNA测序数据中隐藏的系统性偏差

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-09-19 DOI: 10.1101/gr.280713.125

Qiang Su, Yi Long, Deming Gou, Junmin Quan, Xiaoming Zhou, Qizhou Lian

RNA sequencing (RNA-seq) is a pivotal tool for transcriptomic analysis, providing comprehensive exploration of gene expression across diverse biological contexts. However, RNA-seq data is susceptible to various biases that can significantly compromise the accuracy and reliability of transcript quantification. This study investigates the influence of high-dimensional RNA structures on local sequencing efficiency using an innovative unsupervised Variational Autoencoder-Gaussian Mixture Model (VAE-GMM). The VAE-GMM effectively captures intricate high-dimensional k-mer structural similarities by learning compact latent representations, which reduces dimensionality while meticulously preserving essential structural features crucial for bias identification. This sophisticated modeling allows precise tracking of local RNA-read conversion dynamics and the identification of complex, often overlooked, bias sources. We rigorously validate the VAE-GMM model's performance and robustness against conventional machine learning techniques, including Gaussian Mixture Models (GMM-only), Principal Component Analysis-based GMMs, k-means clustering, and Hierarchical Clustering. These validations, using an extensive and diverse array of datasets including synthetic RNA constructs, various human cell lines, and authentic tissue samples, consistently demonstrate the model's superior versatility and accuracy across different biological systems. Furthermore, in silico simulations of the sequencing process closely align with actual sequencing data, strongly reinforcing the critical role of high-dimensional RNA structures in determining sequencing efficiency and their impact on data quality. Our findings offer valuable insights into the underlying mechanisms of RNA structure-mediated sequencing bias. This deeper understanding enables more accurate and reliable RNA-seq analyses and is expected to improve the interpretation of transcriptomic data in future genomic studies.

RNA测序（RNA-seq）是转录组学分析的关键工具，可以全面探索不同生物背景下的基因表达。然而，RNA-seq数据容易受到各种偏差的影响，这些偏差会严重损害转录物定量的准确性和可靠性。本研究利用创新的无监督变分自编码器-高斯混合模型（VAE-GMM）研究了高维RNA结构对局部测序效率的影响。VAE-GMM通过学习紧凑的潜在表示有效地捕获复杂的高维k-mer结构相似性，从而降低了维数，同时一丝不苟地保留了对偏差识别至关重要的基本结构特征。这种复杂的建模允许精确跟踪局部rna读取转换动态和识别复杂的，经常被忽视的偏差源。我们严格验证了VAE-GMM模型对传统机器学习技术的性能和鲁棒性，包括高斯混合模型（仅限gmm）、基于主成分分析的gmm、k-means聚类和分层聚类。这些验证使用了广泛而多样的数据集，包括合成RNA结构、各种人类细胞系和真实的组织样本，一致地证明了该模型在不同生物系统中的优越多功能性和准确性。此外，测序过程的计算机模拟与实际测序数据密切一致，有力地强化了高维RNA结构在决定测序效率及其对数据质量的影响方面的关键作用。我们的发现为RNA结构介导的测序偏倚的潜在机制提供了有价值的见解。这种更深入的理解使RNA-seq分析更加准确和可靠，并有望在未来的基因组研究中改善转录组数据的解释。

{"title":"Deep structural clustering reveals hidden systematic biases in RNA sequencing data","authors":"Qiang Su, Yi Long, Deming Gou, Junmin Quan, Xiaoming Zhou, Qizhou Lian","doi":"10.1101/gr.280713.125","DOIUrl":"https://doi.org/10.1101/gr.280713.125","url":null,"abstract":"RNA sequencing (RNA-seq) is a pivotal tool for transcriptomic analysis, providing comprehensive exploration of gene expression across diverse biological contexts. However, RNA-seq data is susceptible to various biases that can significantly compromise the accuracy and reliability of transcript quantification. This study investigates the influence of high-dimensional RNA structures on local sequencing efficiency using an innovative unsupervised Variational Autoencoder-Gaussian Mixture Model (VAE-GMM). The VAE-GMM effectively captures intricate high-dimensional k-mer structural similarities by learning compact latent representations, which reduces dimensionality while meticulously preserving essential structural features crucial for bias identification. This sophisticated modeling allows precise tracking of local RNA-read conversion dynamics and the identification of complex, often overlooked, bias sources. We rigorously validate the VAE-GMM model's performance and robustness against conventional machine learning techniques, including Gaussian Mixture Models (GMM-only), Principal Component Analysis-based GMMs, k-means clustering, and Hierarchical Clustering. These validations, using an extensive and diverse array of datasets including synthetic RNA constructs, various human cell lines, and authentic tissue samples, consistently demonstrate the model's superior versatility and accuracy across different biological systems. Furthermore, in silico simulations of the sequencing process closely align with actual sequencing data, strongly reinforcing the critical role of high-dimensional RNA structures in determining sequencing efficiency and their impact on data quality. Our findings offer valuable insights into the underlying mechanisms of RNA structure-mediated sequencing bias. This deeper understanding enables more accurate and reliable RNA-seq analyses and is expected to improve the interpretation of transcriptomic data in future genomic studies.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"27 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145089444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Recalibrating differential gene expression by genetic dosage variance prioritizes functionally relevant genes 通过基因剂量方差重新校准差异基因表达优先考虑功能相关基因

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-09-17 DOI: 10.1101/gr.280360.124

Philipp Rentzsch, Aaron Kollotzek, Kaushik Ram Ganapathy, Pejman Mohammadi, Tuuli Lappalainen

Differential expression (DE) analysis is a widely used method for identifying genes that are functionally relevant for an observed phenotype or biological response. However, typical DE analysis includes selection of genes based on a threshold of fold change in expression under the implicit assumption that all genes are equally sensitive to dosage changes of their transcripts. This tends to favor highly variable genes over more constrained genes where even small changes in expression may be biologically relevant. To address this limitation, we have developed a method to recalibrate each gene's DE fold change based on genetic expression variance observed in the human population. The newly established metric ranks statistically differentially expressed genes, not by nominal change of expression, but by relative change in comparison to natural dosage variation for each gene. We apply our method to RNA sequencing data sets from in vitro stimulus response and neuropsychiatric disease experiments. Compared to the standard approach, our method adjusts the bias in discovery toward highly variable genes and enriches for pathways and biological processes related to metabolic and regulatory activity, indicating a prioritization of functionally relevant driver genes. Tissue-specific recalibration increases detection of known disease-relevant processes. Altogether, our method provides a novel view on DE and contributes toward bridging the existing gap between statistical and biological significance. We believe that this approach will simplify the identification of disease-causing molecular processes and enhance the discovery of therapeutic targets.

差异表达（DE）分析是一种广泛使用的方法，用于鉴定与观察到的表型或生物反应在功能上相关的基因。然而，典型的DE分析包括基于表达倍数变化阈值的基因选择，隐含的假设是所有基因对其转录物的剂量变化同样敏感。这倾向于支持高度可变的基因，而不是更受限制的基因，即使是表达的微小变化也可能具有生物学相关性。为了解决这一限制，我们开发了一种方法，根据在人群中观察到的遗传表达差异来重新校准每个基因的DE折叠变化。新建立的指标对统计上表达差异的基因进行排名，不是通过名义上的表达变化，而是通过与每个基因的自然剂量变化相比的相对变化。我们将我们的方法应用于体外刺激反应和神经精神疾病实验的RNA测序数据集。与标准方法相比，我们的方法调整了对高度可变基因的发现偏差，并丰富了与代谢和调节活动相关的途径和生物过程，表明了功能相关驱动基因的优先级。组织特异性重新校准增加了对已知疾病相关过程的检测。总之，我们的方法提供了一个关于DE的新观点，并有助于弥合统计和生物学意义之间的现有差距。我们相信这种方法将简化对致病分子过程的识别，并加强对治疗靶点的发现。

{"title":"Recalibrating differential gene expression by genetic dosage variance prioritizes functionally relevant genes","authors":"Philipp Rentzsch, Aaron Kollotzek, Kaushik Ram Ganapathy, Pejman Mohammadi, Tuuli Lappalainen","doi":"10.1101/gr.280360.124","DOIUrl":"https://doi.org/10.1101/gr.280360.124","url":null,"abstract":"Differential expression (DE) analysis is a widely used method for identifying genes that are functionally relevant for an observed phenotype or biological response. However, typical DE analysis includes selection of genes based on a threshold of fold change in expression under the implicit assumption that all genes are equally sensitive to dosage changes of their transcripts. This tends to favor highly variable genes over more constrained genes where even small changes in expression may be biologically relevant. To address this limitation, we have developed a method to recalibrate each gene's DE fold change based on genetic expression variance observed in the human population. The newly established metric ranks statistically differentially expressed genes, not by nominal change of expression, but by relative change in comparison to natural dosage variation for each gene. We apply our method to RNA sequencing data sets from in vitro stimulus response and neuropsychiatric disease experiments. Compared to the standard approach, our method adjusts the bias in discovery toward highly variable genes and enriches for pathways and biological processes related to metabolic and regulatory activity, indicating a prioritization of functionally relevant driver genes. Tissue-specific recalibration increases detection of known disease-relevant processes. Altogether, our method provides a novel view on DE and contributes toward bridging the existing gap between statistical and biological significance. We believe that this approach will simplify the identification of disease-causing molecular processes and enhance the discovery of therapeutic targets.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"53 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145077422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Analyzing the large and complex SFARI autism cohort data using the Genotypes and Phenotypes in Families (GPF) platform 使用家族基因型和表型（GPF）平台分析大量复杂的SFARI自闭症队列数据

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-09-16 DOI: 10.1101/gr.280356.124

Liubomir Chorbadjiev, Murat Cokol, Zohar Weinstein, Kevin Shi, Christopher Fleisch, Nikolay Dimitrov, Svetlin Mladenov, Ivo Todorov, Iordan Ivanov, Simon Xu, Steven Ford, Yoon-ha Lee, Boris Yamrom, Steven Marks, Adriana Munoz, Alex Lash, Natalia Volfovsky, Ivan Iossifov

The exploration of genotypic variants impacting phenotypes is a cornerstone in genetics research. The emergence of vast collections containing deeply genotyped and phenotyped families has made it possible to pursue the search for variants associated with complex diseases. However, managing these large-scale data sets requires specialized computational tools to organize and analyze the extensive data. Genotypes and Phenotypes in Families (GPF) is an open-source platform that manages genotypes and phenotypes derived from collections of families. GPF allows interactive exploration of genetic variants, enrichment analysis for de novo mutations, phenotype/genotype association tools, and secure data sharing. GPF is used to disseminate two family collection data sets, SSC and SPARK, for the study of autism, built by the Simons Foundation. The GPF instance at the Simons Foundation (GPF-SFARI) provides protected access to comprehensive genotypic and phenotypic data for SSC and SPARK. GPF-SFARI also provides public access to an extensive collection of de novo mutations from individuals with autism and related disorders and to gene-level statistics of the protected data sets characterizing the genes’ roles in autism. However, GPF is versatile and can manage genotypic data from other small or large family collections. Here, we highlight the primary features of GPF within the context of GPF-SFARI.

探索影响表型的基因型变异是遗传学研究的基石。大量包含深度基因型和表型型家族的藏品的出现，使得寻找与复杂疾病相关的变异成为可能。然而，管理这些大规模的数据集需要专门的计算工具来组织和分析大量的数据。家族基因型和表型（GPF）是一个管理来自家族集合的基因型和表型的开源平台。GPF允许对遗传变异进行交互式探索，对新生突变进行富集分析，使用表型/基因型关联工具和安全的数据共享。GPF用于传播两个家庭收集数据集，SSC和SPARK，用于研究自闭症，由西蒙斯基金会建立。西蒙斯基金会的GPF实例（GPF- sfari）为SSC和SPARK提供了全面的基因型和表型数据的保护访问。GPF-SFARI还向公众提供广泛收集的来自自闭症和相关疾病患者的新生突变，以及描述基因在自闭症中作用的受保护数据集的基因水平统计。然而，GPF是通用的，可以管理来自其他小型或大型家庭收集的基因型数据。在这里，我们强调了GPF在GPF- sfari背景下的主要特征。

{"title":"Analyzing the large and complex SFARI autism cohort data using the Genotypes and Phenotypes in Families (GPF) platform","authors":"Liubomir Chorbadjiev, Murat Cokol, Zohar Weinstein, Kevin Shi, Christopher Fleisch, Nikolay Dimitrov, Svetlin Mladenov, Ivo Todorov, Iordan Ivanov, Simon Xu, Steven Ford, Yoon-ha Lee, Boris Yamrom, Steven Marks, Adriana Munoz, Alex Lash, Natalia Volfovsky, Ivan Iossifov","doi":"10.1101/gr.280356.124","DOIUrl":"https://doi.org/10.1101/gr.280356.124","url":null,"abstract":"The exploration of genotypic variants impacting phenotypes is a cornerstone in genetics research. The emergence of vast collections containing deeply genotyped and phenotyped families has made it possible to pursue the search for variants associated with complex diseases. However, managing these large-scale data sets requires specialized computational tools to organize and analyze the extensive data. Genotypes and Phenotypes in Families (GPF) is an open-source platform that manages genotypes and phenotypes derived from collections of families. GPF allows interactive exploration of genetic variants, enrichment analysis for de novo mutations, phenotype/genotype association tools, and secure data sharing. GPF is used to disseminate two family collection data sets, SSC and SPARK, for the study of autism, built by the Simons Foundation. The GPF instance at the Simons Foundation (GPF-SFARI) provides protected access to comprehensive genotypic and phenotypic data for SSC and SPARK. GPF-SFARI also provides public access to an extensive collection of de novo mutations from individuals with autism and related disorders and to gene-level statistics of the protected data sets characterizing the genes’ roles in autism. However, GPF is versatile and can manage genotypic data from other small or large family collections. Here, we highlight the primary features of GPF within the context of GPF-SFARI.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"37 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145072494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Long-read sequencing reveals HBV integration patterns and oncogenic impact on early-onset hepatocellular carcinoma 长读序列揭示了HBV整合模式和对早发性肝细胞癌的致癌影响

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-09-16 DOI: 10.1101/gr.279889.124

Yao Wang, Dong Yu, Yue Mei, Zhida Fu, Jian Lin, Di Wu, Yuan Yang, Hongli Yan

Hepatitis B virus (HBV) integration is a key driver of hepatocellular carcinoma (HCC) occurrence and progression; however, its oncogenic mechanisms remain incompletely understood because of limitations in detection methods and sample availability. In this study, we employed Oxford Nanopore Technologies (ONT) whole-genome sequencing and full-length transcriptome sequencing to characterize HBV integration events at the genomic and transcriptomic levels, along with their regulatory effects on structural variations (SVs) and gene expression. Functional validation was performed using dual-luciferase assays and cell-based experiments. Our findings revealed that integrated HBV sequences form long concatemers, mediating inter- and intrachromosomal recombination in the human genome. Notably, integrated HBV enhancer I (HBV-Enh I) was detected in 6 of 7 tumor tissues and was associated with aberrant gene expression. HBV integration induced oncogenic SVs, such as focal MYC amplification and NAV2 deletion, and directly modulated gene expression. Additionally, ectopic overexpression of MYOCD, driven by HBV-Enh I integration, promoted HCC cell migration and invasion. In summary, HBV integration acts as a major driver of large-scale genomic SVs and transcriptomic dysregulation, through either direct alterations in genome dosage or cis-regulatory mechanisms. HBV-Enh I is frequently integrated in HCC and might play a pivotal role in abnormal gene expression, highlighting its potential as a therapeutic target.

乙型肝炎病毒（HBV）整合是肝细胞癌（HCC）发生和进展的关键驱动因素；然而，由于检测方法和样本可用性的限制，其致癌机制仍不完全清楚。在这项研究中，我们采用牛津纳米孔技术（ONT）的全基因组测序和全长转录组测序来表征基因组和转录组水平上的HBV整合事件，以及它们对结构变异（SVs）和基因表达的调节作用。通过双荧光素酶测定和基于细胞的实验进行功能验证。我们的研究结果表明，整合的HBV序列形成长串联体，介导人类基因组的染色体间和染色体内重组。值得注意的是，整合HBV增强子I （HBV- enh I）在7个肿瘤组织中的6个中检测到，并与异常基因表达相关。HBV整合诱导局灶性MYC扩增和NAV2缺失等致癌性SVs，并直接调节基因表达。此外，在HBV-Enh I整合的驱动下，心肌的异位过表达促进了HCC细胞的迁移和侵袭。总之，HBV整合是大规模基因组SVs和转录组失调的主要驱动因素，通过直接改变基因组剂量或顺式调节机制。HBV-Enh I经常被整合到HCC中，可能在异常基因表达中起关键作用，突出了其作为治疗靶点的潜力。

{"title":"Long-read sequencing reveals HBV integration patterns and oncogenic impact on early-onset hepatocellular carcinoma","authors":"Yao Wang, Dong Yu, Yue Mei, Zhida Fu, Jian Lin, Di Wu, Yuan Yang, Hongli Yan","doi":"10.1101/gr.279889.124","DOIUrl":"https://doi.org/10.1101/gr.279889.124","url":null,"abstract":"Hepatitis B virus (HBV) integration is a key driver of hepatocellular carcinoma (HCC) occurrence and progression; however, its oncogenic mechanisms remain incompletely understood because of limitations in detection methods and sample availability. In this study, we employed Oxford Nanopore Technologies (ONT) whole-genome sequencing and full-length transcriptome sequencing to characterize HBV integration events at the genomic and transcriptomic levels, along with their regulatory effects on structural variations (SVs) and gene expression. Functional validation was performed using dual-luciferase assays and cell-based experiments. Our findings revealed that integrated HBV sequences form long concatemers, mediating inter- and intrachromosomal recombination in the human genome. Notably, integrated HBV enhancer I (HBV-Enh I) was detected in 6 of 7 tumor tissues and was associated with aberrant gene expression. HBV integration induced oncogenic SVs, such as focal MYC amplification and NAV2 deletion, and directly modulated gene expression. Additionally, ectopic overexpression of MYOCD, driven by HBV-Enh I integration, promoted HCC cell migration and invasion. In summary, HBV integration acts as a major driver of large-scale genomic SVs and transcriptomic dysregulation, through either direct alterations in genome dosage or cis-regulatory mechanisms. HBV-Enh I is frequently integrated in HCC and might play a pivotal role in abnormal gene expression, highlighting its potential as a therapeutic target.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"46 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145067702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

T2T-CHM13 improves read mapping and detection of clinically relevant genetic variation in the Swedish population T2T-CHM13改善了瑞典人群中临床相关遗传变异的读取定位和检测

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-09-16 DOI: 10.1101/gr.279320.124

Daniel Schmitz, Adam Ameur, Åsa Johansson

The T2T-CHM13 reference genome, released in March 2022, fills in the 8% of the human genome that were not resolved in GRCh38 and reconstructs large parts of the known genome. The more accurate and complete reference genome is expected to improve the quality of read mapping and variant calling. Even though whole genome sequencing (WGS)-based approaches have become the golden standard in medical genetics, the extent of these benefits still remains unclear. In this study, we aim to evaluate mapping quality and variant call performance with T2T-CHM13 as a reference using a cross-sectional Swedish cohort (SweGen) comprising 1000 individuals with short-read Illumina WGS data available. Remapping and variant calling was performed using the nf-core/sarek pipeline. T2T-CHM13 improved a wide range of mapping and variant calling related metrics, including a higher fraction of properly paired reads, lower mismatch rate, and more uniform coverage of coding regions. Moreover, the fraction of ambiguous alignments was higher, reflecting segmental duplications that were incorrectly collapsed in GRCh37 and GRCh38. In comparison to GRCh38, we identified 10 million additional variants in the cohort, including 5.5 million singletons, and observed an increased sensitivity for rare variants. SnpEff assigned impact ratings of moderate or high to 13% more variants in T2T-CHM13 than GRCh38. In summary, we conclude that T2T-CHM13 improves alignment metrics with higher mapping quality, better variant calling performance and confidence, including for rare and deleterious variants. The T2T-CHM13 genome reference thus facilitates enhanced discovery of new disease-causing variation, benefiting, for example, rare-disease diagnostics.

T2T-CHM13参考基因组于2022年3月发布，填补了GRCh38中未解析的8%的人类基因组，并重建了大部分已知基因组。更准确和完整的参考基因组有望提高读图定位和变异召唤的质量。尽管基于全基因组测序（WGS）的方法已成为医学遗传学的黄金标准，但这些益处的程度仍不清楚。在这项研究中，我们的目标是评估映射质量和变体调用性能，以T2T-CHM13作为参考，使用瑞典的横断面队列（SweGen），包括1000个具有短读Illumina WGS数据的个体。使用nf-core/sarek管道执行重新映射和变体调用。T2T-CHM13改进了广泛的映射和变体调用相关指标，包括更高比例的正确配对读取，更低的错配率和更统一的编码区域覆盖。此外，模糊比对的比例更高，反映了GRCh37和GRCh38中错误折叠的片段重复。与GRCh38相比，我们在队列中发现了1000万个额外的变异，包括550万个单基因，并观察到对罕见变异的敏感性增加。SnpEff认为T2T-CHM13中或高影响等级的变异比GRCh38多13%。总之，我们得出结论，T2T-CHM13改进了校准指标，具有更高的映射质量，更好的变体调用性能和置信度，包括罕见和有害的变体。因此，T2T-CHM13基因组参比有助于加强发现新的致病变异，例如有利于罕见疾病诊断。

{"title":"T2T-CHM13 improves read mapping and detection of clinically relevant genetic variation in the Swedish population","authors":"Daniel Schmitz, Adam Ameur, Åsa Johansson","doi":"10.1101/gr.279320.124","DOIUrl":"https://doi.org/10.1101/gr.279320.124","url":null,"abstract":"The T2T-CHM13 reference genome, released in March 2022, fills in the 8% of the human genome that were not resolved in GRCh38 and reconstructs large parts of the known genome. The more accurate and complete reference genome is expected to improve the quality of read mapping and variant calling. Even though whole genome sequencing (WGS)-based approaches have become the golden standard in medical genetics, the extent of these benefits still remains unclear. In this study, we aim to evaluate mapping quality and variant call performance with T2T-CHM13 as a reference using a cross-sectional Swedish cohort (SweGen) comprising 1000 individuals with short-read Illumina WGS data available. Remapping and variant calling was performed using the nf-core/sarek pipeline. T2T-CHM13 improved a wide range of mapping and variant calling related metrics, including a higher fraction of properly paired reads, lower mismatch rate, and more uniform coverage of coding regions. Moreover, the fraction of ambiguous alignments was higher, reflecting segmental duplications that were incorrectly collapsed in GRCh37 and GRCh38. In comparison to GRCh38, we identified 10 million additional variants in the cohort, including 5.5 million singletons, and observed an increased sensitivity for rare variants. SnpEff assigned impact ratings of moderate or high to 13% more variants in T2T-CHM13 than GRCh38. In summary, we conclude that T2T-CHM13 improves alignment metrics with higher mapping quality, better variant calling performance and confidence, including for rare and deleterious variants. The T2T-CHM13 genome reference thus facilitates enhanced discovery of new disease-causing variation, benefiting, for example, rare-disease diagnostics.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"321 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145067755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0