Genome research最新文献_第10页

Estimating the size of long tandem repeat expansions from short reads with ScatTR 用ScatTR估计短读长串联重复扩增的大小

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-08-21 DOI: 10.1101/gr.280563.125

Rashid Al-Abri, Gamze Gursoy

Tandem repeats (TRs) are sequences of DNA where two or more base pairs are repeated back-to-back at specific locations in the genome. TR expansions, where the number of repeat units exceeds the normal range, have been implicated in over 50 conditions. However, accurately measuring the copy number of TRs is challenging, especially when their expansions are larger than the fragment sizes used in standard short-read genome sequencing. Here, we introduce ScatTR, a novel computational method that leverages a maximum likelihood framework to estimate the copy number of large TR expansions from short-read sequencing data. ScatTR calculates the likelihood of different alignments between sequencing reads and reference sequences that represent various TR lengths and employs a Monte Carlo technique to find the best match. In simulated data, ScatTR outperforms state-of-the-art methods, particularly for TRs with longer motifs and those with lengths that greatly exceed typical sequencing fragment sizes. When applied to data from the 1000 Genomes Project, ScatTR detects potential large TR expansions that other methods missed, highlighting its ability to better characterize genome-wide TR variation.

串联重复序列（TRs）是DNA序列，其中两个或多个碱基对在基因组的特定位置连续重复。在超过50种情况下，重复单位数量超过正常范围的TR扩展已被涉及。然而，准确测量TRs的拷贝数是具有挑战性的，特别是当它们的扩增量大于标准短读基因组测序中使用的片段大小时。在这里，我们介绍了ScatTR，一种新的计算方法，利用最大似然框架来估计短读测序数据中大TR扩展的拷贝数。ScatTR计算测序读数和代表不同TR长度的参考序列之间不同排列的可能性，并采用蒙特卡罗技术找到最佳匹配。在模拟数据中，ScatTR优于最先进的方法，特别是对于具有较长基序和长度大大超过典型测序片段大小的TRs。当应用于1000基因组计划的数据时，ScatTR检测到其他方法遗漏的潜在的大TR扩增，突出了其更好地表征全基因组TR变异的能力。

{"title":"Estimating the size of long tandem repeat expansions from short reads with ScatTR","authors":"Rashid Al-Abri, Gamze Gursoy","doi":"10.1101/gr.280563.125","DOIUrl":"https://doi.org/10.1101/gr.280563.125","url":null,"abstract":"Tandem repeats (TRs) are sequences of DNA where two or more base pairs are repeated back-to-back at specific locations in the genome. TR expansions, where the number of repeat units exceeds the normal range, have been implicated in over 50 conditions. However, accurately measuring the copy number of TRs is challenging, especially when their expansions are larger than the fragment sizes used in standard short-read genome sequencing. Here, we introduce ScatTR, a novel computational method that leverages a maximum likelihood framework to estimate the copy number of large TR expansions from short-read sequencing data. ScatTR calculates the likelihood of different alignments between sequencing reads and reference sequences that represent various TR lengths and employs a Monte Carlo technique to find the best match. In simulated data, ScatTR outperforms state-of-the-art methods, particularly for TRs with longer motifs and those with lengths that greatly exceed typical sequencing fragment sizes. When applied to data from the 1000 Genomes Project, ScatTR detects potential large TR expansions that other methods missed, highlighting its ability to better characterize genome-wide TR variation.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"146 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deciphering context-specific gene programs from single-cell and spatial transcriptomics data with DeCEP 用DeCEP从单细胞和空间转录组学数据中破译上下文特异性基因程序

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-08-21 DOI: 10.1101/gr.279689.124

Lin Li, Xianbin Su, Ze-Guang Han

Functional gene programs play a wide range of roles in health and disease by orchestrating transcriptional coregulation to govern cell identity. Understanding these intricate gene programs is essential for unraveling the complexities of biological systems; however, deciphering them remains a significant challenge. Recent advancements in single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) technologies have empowered the comprehensive characterization of gene programs at both single-cell and spatial resolutions. Here, we present DeCEP, a computational framework designed to characterize context-specific gene programs using scRNA-seq and ST data. DeCEP leverages functional gene lists and directed graphs to construct functional networks underlying distinct cellular or spatial contexts. It then identifies context-dependent hub genes associated with specific gene programs based on network topology and assigns gene program activity to individual cells or spatial locations. Through evaluation on both simulated and real biological datasets, DeCEP demonstrates complementary strengths over existing methods by enabling more fine-grained characterization of gene programs within specific contexts, particularly those characterized by pronounced transcriptional heterogeneity. Furthermore, we showcase the ability of DeCEP in elucidating biological insights through case studies on normal liver tissue, Alzheimer' disease, and cancer.

功能性基因程序在健康和疾病中发挥着广泛的作用，通过协调转录协同调节来控制细胞身份。理解这些复杂的基因程序对于揭示生物系统的复杂性至关重要；然而，破译它们仍然是一个重大挑战。单细胞RNA测序（scRNA-seq）和空间转录组学（ST）技术的最新进展使得在单细胞和空间分辨率上对基因程序进行全面表征成为可能。在这里，我们提出了DeCEP，这是一个计算框架，旨在利用scRNA-seq和ST数据表征上下文特异性基因程序。DeCEP利用功能基因列表和有向图来构建不同细胞或空间背景下的功能网络。然后，它识别与基于网络拓扑的特定基因程序相关的上下文依赖的中心基因，并将基因程序活动分配给单个细胞或空间位置。通过对模拟和真实生物数据集的评估，DeCEP证明了与现有方法相比的互补优势，它能够在特定背景下更精细地表征基因程序，特别是那些以明显的转录异质性为特征的基因程序。此外，我们通过对正常肝组织、阿尔茨海默病和癌症的案例研究，展示了DeCEP在阐明生物学见解方面的能力。

{"title":"Deciphering context-specific gene programs from single-cell and spatial transcriptomics data with DeCEP","authors":"Lin Li, Xianbin Su, Ze-Guang Han","doi":"10.1101/gr.279689.124","DOIUrl":"https://doi.org/10.1101/gr.279689.124","url":null,"abstract":"Functional gene programs play a wide range of roles in health and disease by orchestrating transcriptional coregulation to govern cell identity. Understanding these intricate gene programs is essential for unraveling the complexities of biological systems; however, deciphering them remains a significant challenge. Recent advancements in single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) technologies have empowered the comprehensive characterization of gene programs at both single-cell and spatial resolutions. Here, we present DeCEP, a computational framework designed to characterize context-specific gene programs using scRNA-seq and ST data. DeCEP leverages functional gene lists and directed graphs to construct functional networks underlying distinct cellular or spatial contexts. It then identifies context-dependent hub genes associated with specific gene programs based on network topology and assigns gene program activity to individual cells or spatial locations. Through evaluation on both simulated and real biological datasets, DeCEP demonstrates complementary strengths over existing methods by enabling more fine-grained characterization of gene programs within specific contexts, particularly those characterized by pronounced transcriptional heterogeneity. Furthermore, we showcase the ability of DeCEP in elucidating biological insights through case studies on normal liver tissue, Alzheimer' disease, and cancer.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"38 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accurate detection of tandem repeats from error-prone sequences with EquiRep 用EquiRep准确检测易出错序列的串联重复序列

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-08-21 DOI: 10.1101/gr.280750.125

Zhezheng Song, Tasfia Zahin, Xiang Li, Mingfu Shao

A tandem repeat is a sequence of nucleotides that appear as multiple contiguous, near-identical copies arranged consecutively. Tandem repeats are widespread across natural genomes, play critical roles in genetic diversity, gene regulation, and are associated with various neurological and developmental disorders. They can also arise in sequencing reads generated by certain technologies, such as those used for sequencing circular molecules. A key challenge in analyzing tandem repeats is reconstructing the sequence of the underlying repeat unit. While several methods exist, they often exhibit low accuracy when the repeat unit length increases or the number of copies is low. Furthermore, methods capable of handling highly mutated sequences remain scarce, highlighting a significant opportunity for improvement. We introduce EquiRep, a tool for accurate detection of tandem repeats from erroneous sequences. EquiRep estimates the likelihood of positions originating from the same location in the unit through self-alignment, followed by a novel refinement approach. The resulting equivalence classes and consecutive position information are then used to build a weighted graph. A cycle in this graph with maximum bottleneck weight covering most nucleotide positions is identified to reconstruct the repeat unit. We test EquiRep on two applications, identifying repeat units from satellite DNAs and reconstructing circular RNAs from rolling-circular long-read sequencing data, using both simulated and raw sequencing datasets. Our results show that EquiRep consistently outperforms or matches state-of-the-art methods, demonstrating robustness to sequencing errors and superior performance on long repeat units and low-frequency repeats. These capabilities underscore EquiRep’s broad utility in tandem repeat analysis.

串联重复序列是核苷酸序列，表现为连续排列的多个连续的、几乎相同的拷贝。串联重复序列在天然基因组中广泛存在，在遗传多样性、基因调控中发挥关键作用，并与各种神经和发育障碍有关。它们也可能出现在由某些技术产生的测序读取中，例如用于测序环状分子的技术。分析串联重复序列的一个关键挑战是重建基础重复单元的序列。虽然存在几种方法，但当重复单位长度增加或拷贝数较低时，它们往往表现出较低的准确性。此外，能够处理高度突变序列的方法仍然很少，这突出了改进的重要机会。我们介绍EquiRep，一个工具，准确地检测串联重复序列的错误序列。EquiRep通过自对准来估计来自单元中同一位置的位置的可能性，然后采用一种新的改进方法。然后使用得到的等价类和连续位置信息来构建加权图。在这个图中，一个最大瓶颈权重覆盖大多数核苷酸位置的循环被确定来重建重复单元。我们在两个应用程序上测试EquiRep，使用模拟和原始测序数据集，从卫星dna中识别重复单元，并从滚动环状长读测序数据中重建环状rna。我们的研究结果表明，EquiRep始终优于或匹配最先进的方法，对测序错误表现出鲁棒性，在长重复单元和低频重复上表现优异。这些功能强调了EquiRep在串联重复分析中的广泛应用。

{"title":"Accurate detection of tandem repeats from error-prone sequences with EquiRep","authors":"Zhezheng Song, Tasfia Zahin, Xiang Li, Mingfu Shao","doi":"10.1101/gr.280750.125","DOIUrl":"https://doi.org/10.1101/gr.280750.125","url":null,"abstract":"A tandem repeat is a sequence of nucleotides that appear as multiple contiguous, near-identical copies arranged consecutively. Tandem repeats are widespread across natural genomes, play critical roles in genetic diversity, gene regulation, and are associated with various neurological and developmental disorders. They can also arise in sequencing reads generated by certain technologies, such as those used for sequencing circular molecules. A key challenge in analyzing tandem repeats is reconstructing the sequence of the underlying repeat unit. While several methods exist, they often exhibit low accuracy when the repeat unit length increases or the number of copies is low. Furthermore, methods capable of handling highly mutated sequences remain scarce, highlighting a significant opportunity for improvement. We introduce EquiRep, a tool for accurate detection of tandem repeats from erroneous sequences. EquiRep estimates the likelihood of positions originating from the same location in the unit through self-alignment, followed by a novel refinement approach. The resulting equivalence classes and consecutive position information are then used to build a weighted graph. A cycle in this graph with maximum bottleneck weight covering most nucleotide positions is identified to reconstruct the repeat unit. We test EquiRep on two applications, identifying repeat units from satellite DNAs and reconstructing circular RNAs from rolling-circular long-read sequencing data, using both simulated and raw sequencing datasets. Our results show that EquiRep consistently outperforms or matches state-of-the-art methods, demonstrating robustness to sequencing errors and superior performance on long repeat units and low-frequency repeats. These capabilities underscore EquiRep’s broad utility in tandem repeat analysis.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"8 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pangenome-based genome inference using integer programming 基于泛基因组的整数规划基因组推断

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-08-21 DOI: 10.1101/gr.280567.125

Ghanshyam Chandra, Md Helal Hossen, Stephan Scholz, Alexander T Dilthey, Daniel Gibney, Chirag Jain

Affordable genotyping methods are essential in genomics. Commonly used genotyping methods primarily support single nucleotide variants and short indels but neglect structural variants. Additionally, accuracy of read alignments to a reference genome is unreliable in highly polymorphic and repetitive regions, further impacting genotyping performance. Recent works highlight the advantage of haplotype-resolved pangenome graphs in addressing these challenges. Building on these developments, we propose a rigorous alignment-free genotyping method. Our optimization framework identifies a path through the pangenome graph that maximizes the matches between the path and substrings of sequencing reads (e.g., k-mers) while minimizing recombination events (haplotype switches) along the path. We prove that this problem is NP-Hard and develop efficient integer-programming solutions. We benchmarked the algorithm using downsampled short-read datasets from homozygous human cell lines with coverage ranging from 0.1× to 10×. Our algorithm accurately estimates complete major histocompatibility complex (MHC) haplotype sequences with small edit distances from the ground-truth sequences, providing a significant advantage over existing methods on low-coverage inputs. While this algorithm is designed for haploid genomes, we discuss directions for extending it to diploid genotyping.

负担得起的基因分型方法在基因组学中至关重要。常用的基因分型方法主要支持单核苷酸变异和短序列，而忽略了结构变异。此外，在高度多态性和重复的区域，对参考基因组的读取比对的准确性是不可靠的，这进一步影响了基因分型的性能。最近的工作强调了单倍型解析泛基因组图在解决这些挑战方面的优势。在这些发展的基础上，我们提出了一种严格的无比对基因分型方法。我们的优化框架通过泛基因组图确定了一条路径，该路径与测序读取子串（例如k-mers）之间的匹配最大化，同时最小化路径上的重组事件（单倍型切换）。我们证明了这个问题是np困难的，并给出了有效的整数规划解。我们使用纯合子人类细胞系的下采样短读数据集对算法进行基准测试，覆盖范围从0.1倍到10倍。我们的算法准确地估计完整的主要组织相容性复合体（MHC）单倍型序列，与基础真值序列的编辑距离很小，在低覆盖率输入上比现有方法具有显着优势。虽然该算法是为单倍体基因组设计的，但我们讨论了将其扩展到二倍体基因分型的方向。

{"title":"Pangenome-based genome inference using integer programming","authors":"Ghanshyam Chandra, Md Helal Hossen, Stephan Scholz, Alexander T Dilthey, Daniel Gibney, Chirag Jain","doi":"10.1101/gr.280567.125","DOIUrl":"https://doi.org/10.1101/gr.280567.125","url":null,"abstract":"Affordable genotyping methods are essential in genomics. Commonly used genotyping methods primarily support single nucleotide variants and short indels but neglect structural variants. Additionally, accuracy of read alignments to a reference genome is unreliable in highly polymorphic and repetitive regions, further impacting genotyping performance. Recent works highlight the advantage of haplotype-resolved pangenome graphs in addressing these challenges. Building on these developments, we propose a rigorous alignment-free genotyping method. Our optimization framework identifies a path through the pangenome graph that maximizes the matches between the path and substrings of sequencing reads (e.g., k-mers) while minimizing recombination events (haplotype switches) along the path. We prove that this problem is NP-Hard and develop efficient integer-programming solutions. We benchmarked the algorithm using downsampled short-read datasets from homozygous human cell lines with coverage ranging from 0.1× to 10×. Our algorithm accurately estimates complete major histocompatibility complex (MHC) haplotype sequences with small edit distances from the ground-truth sequences, providing a significant advantage over existing methods on low-coverage inputs. While this algorithm is designed for haploid genomes, we discuss directions for extending it to diploid genotyping.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"50 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High-quality assembly of the Chinese white truffle genome and recalibrated divergence time estimate provide insight into the evolutionary dynamics of Tuberaceae 中国白松露基因组的高质量组装和重新校准的分化时间估计提供了对结核科进化动力学的深入了解

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-08-21 DOI: 10.1101/gr.280368.124

Jacopo Martelossi, Jacopo Vujovic, Yue Huang, Alessia Tatti, Kaiwei Xu, Federico Puliga, Yuanxue Chen, Omar Rota Stabelli, Fabrizio Ghiselli, Xiaoping Zhang, Alessandra Zambonelli

The genus Tuber (family: Tuberaceae) includes the most economically valuable ectomycorrhizal (ECM), truffle-forming fungi. Previous genomic analyses revealed that massive transposable element (TE) proliferation represents a convergent genomic feature of ECM fungi, including Tuberaceae. Repetitive sequences constitute a principal driver of genome evolution shaping its architecture and regulatory networks. In this context, Tuberaceae can become an important model system to study their genomic impact; however, the family lacks high-quality assemblies. Here, we investigate the interplay between TEs and Tuberaceae genome evolution by producing a highly contiguous assembly for the endangered Chinese white truffle Tuber panzhihuanense, along with a recalibrated timeline for Tuberaceae diversification and comprehensive comparative genomic analyses. We find that, concurrently with a Paleogene diversification of the family, pre-existing Chromoviridae-related Gypsy clades independently expanded in different truffle lineages, leading to increased genome size and high gene family turnover rates, but without resulting in highly rearranged genomes. Additionally, we uncover a significant enrichment of ECM-induced gene families stemming from ancestral duplication events. Finally, we explore the repetitive structure of nuclear ribosomal DNA (rDNA) loci for the first time in the clade. Most of the 45S rDNA paralogues are undergoing concerted evolution, though an isolated divergent locus raises concerns about potential issues for metabarcoding and biodiversity assessments. Our study establishes a fundamental genomic resource for future research on truffle genomics and showcases a clear example of how establishment and self-perpetuating expansion of heterochromatin can drive massive genome size variation due to activity of selfish genetic elements.

块菌属（科：块菌科）包括最有经济价值的外生菌根（ECM），块菌形成真菌。先前的基因组分析表明，大量转座因子（TE）增殖是ECM真菌（包括Tuberaceae）的一个趋同的基因组特征。重复序列构成了基因组进化的主要驱动力，塑造了其结构和调控网络。在此背景下，Tuberaceae可以成为研究其基因组影响的重要模型系统；然而，该家族缺乏高质量的组件。在这里，我们通过对濒临灭绝的中国白松露攀枝花进行高度连续的组装，以及重新校准的结核科多样化时间线和全面的比较基因组分析，来研究TEs与结核科基因组进化之间的相互作用。我们发现，与家族的古近纪多样化同时，已有的与chromoviridae相关的吉普赛分支在不同的松露谱系中独立扩展，导致基因组大小增加和高基因家族周转率，但没有导致高度重排的基因组。此外，我们发现了由祖先重复事件引起的ecm诱导的基因家族的显著富集。最后，我们首次在进化枝中探索了核糖体DNA （rDNA）位点的重复结构。大多数45S rDNA同源物正在经历协同进化，尽管一个孤立的分化位点引起了人们对元条形码和生物多样性评估的潜在问题的担忧。我们的研究为未来松露基因组学的研究建立了基础的基因组资源，并展示了一个清楚的例子，即异染色质的建立和自我延续扩张如何由于自私遗传元件的活动而驱动大量基因组大小的变化。

{"title":"High-quality assembly of the Chinese white truffle genome and recalibrated divergence time estimate provide insight into the evolutionary dynamics of Tuberaceae","authors":"Jacopo Martelossi, Jacopo Vujovic, Yue Huang, Alessia Tatti, Kaiwei Xu, Federico Puliga, Yuanxue Chen, Omar Rota Stabelli, Fabrizio Ghiselli, Xiaoping Zhang, Alessandra Zambonelli","doi":"10.1101/gr.280368.124","DOIUrl":"https://doi.org/10.1101/gr.280368.124","url":null,"abstract":"The genus Tuber (family: Tuberaceae) includes the most economically valuable ectomycorrhizal (ECM), truffle-forming fungi. Previous genomic analyses revealed that massive transposable element (TE) proliferation represents a convergent genomic feature of ECM fungi, including Tuberaceae. Repetitive sequences constitute a principal driver of genome evolution shaping its architecture and regulatory networks. In this context, Tuberaceae can become an important model system to study their genomic impact; however, the family lacks high-quality assemblies. Here, we investigate the interplay between TEs and Tuberaceae genome evolution by producing a highly contiguous assembly for the endangered Chinese white truffle Tuber panzhihuanense, along with a recalibrated timeline for Tuberaceae diversification and comprehensive comparative genomic analyses. We find that, concurrently with a Paleogene diversification of the family, pre-existing Chromoviridae-related Gypsy clades independently expanded in different truffle lineages, leading to increased genome size and high gene family turnover rates, but without resulting in highly rearranged genomes. Additionally, we uncover a significant enrichment of ECM-induced gene families stemming from ancestral duplication events. Finally, we explore the repetitive structure of nuclear ribosomal DNA (rDNA) loci for the first time in the clade. Most of the 45S rDNA paralogues are undergoing concerted evolution, though an isolated divergent locus raises concerns about potential issues for metabarcoding and biodiversity assessments. Our study establishes a fundamental genomic resource for future research on truffle genomics and showcases a clear example of how establishment and self-perpetuating expansion of heterochromatin can drive massive genome size variation due to activity of selfish genetic elements.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"9 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ultra-long sequencing for contiguous haplotype resolution of the human immunoglobulin heavy chain locus 人免疫球蛋白重链位点连续单倍型分离的超长测序

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-08-21 DOI: 10.1101/gr.280400.125

Mari B Gornitzka, Egil Røsjø, Uddalok Jana, Easton E Ford, Alan Tourancheau, William Lees, Zachary Vanwinkle, Melissa L Smith, Corey T Watson, Andreas Lossius

Genetic diversity within the human immunoglobulin heavy chain (IGH) locus influences the expressed antibody repertoire and susceptibility to infectious and autoimmune diseases. However, repetitive sequences and complex structural variation pose significant challenges for large-scale characterization. Here, we introduce a method that combines Oxford Nanopore Technologies ultra-long sequencing and adaptive sampling with a bioinformatic pipeline to produce haplotype-resolved, annotated IGH assemblies. Notably, our strategy overcomes prior limitations in phasing resolution, enabling single-contig haplotype assemblies that span the entire IGH locus. We apply this method to four individuals and validate the accuracy of the IGH assemblies using Pacific Biosciences HiFi reads, demonstrating near-complete sequence congruence, with only some residual indel errors. Moreover, when applying our pipeline to the reference material HG002, it reveals no base differences and a limited number of indels compared with the Telomere-to-Telomere genome benchmark across the IGH region. Importantly, in the four individuals, our approach uncovers 28 novel alleles and previously uncharacterized large structural variants, including a 120 kb duplication spanning IGHE to IGHA1 within the IGH constant region (IGHC) and, within the IGHV region, an expanded seven-copy IGHV3-23 gene haplotype. These findings underscore the power of our method to resolve the full complexity of the IGH locus and uncover previously unrecognized variants that may affect immune function and disease susceptibility. Thus, our method provides a strong basis for future immunological research and translational applications.

人类免疫球蛋白重链（IGH）基因座的遗传多样性影响抗体库的表达以及对感染性和自身免疫性疾病的易感性。然而，重复序列和复杂的结构变化对大规模表征构成了重大挑战。在这里，我们介绍了一种将Oxford Nanopore Technologies的超长测序和自适应采样与生物信息学管道相结合的方法，以产生单倍型解析，注释的IGH组装。值得注意的是，我们的策略克服了先前在分相分辨率方面的限制，实现了跨越整个IGH位点的单片段单倍型组装。我们将这种方法应用于4个个体，并使用Pacific Biosciences HiFi reads验证了IGH序列的准确性，结果显示序列几乎完全一致，只有一些残留的indel误差。此外，当将我们的管道应用于参考物质HG002时，与IGH区域的端粒到端粒基因组基准相比，它显示没有碱基差异，并且索引数量有限。重要的是，在这4个个体中，我们的方法发现了28个新的等位基因和以前未被表征的大结构变异，包括IGH恒定区（IGHC）内跨越IGHE到IGHA1的120 kb重复，以及在IGHV区域内扩展的7拷贝IGHV3-23基因单倍型。这些发现强调了我们的方法在解决IGH位点的全部复杂性和揭示以前未被识别的可能影响免疫功能和疾病易感性的变异方面的力量。因此，我们的方法为未来的免疫学研究和转化应用提供了坚实的基础。

{"title":"Ultra-long sequencing for contiguous haplotype resolution of the human immunoglobulin heavy chain locus","authors":"Mari B Gornitzka, Egil Røsjø, Uddalok Jana, Easton E Ford, Alan Tourancheau, William Lees, Zachary Vanwinkle, Melissa L Smith, Corey T Watson, Andreas Lossius","doi":"10.1101/gr.280400.125","DOIUrl":"https://doi.org/10.1101/gr.280400.125","url":null,"abstract":"Genetic diversity within the human immunoglobulin heavy chain (IGH) locus influences the expressed antibody repertoire and susceptibility to infectious and autoimmune diseases. However, repetitive sequences and complex structural variation pose significant challenges for large-scale characterization. Here, we introduce a method that combines Oxford Nanopore Technologies ultra-long sequencing and adaptive sampling with a bioinformatic pipeline to produce haplotype-resolved, annotated IGH assemblies. Notably, our strategy overcomes prior limitations in phasing resolution, enabling single-contig haplotype assemblies that span the entire IGH locus. We apply this method to four individuals and validate the accuracy of the IGH assemblies using Pacific Biosciences HiFi reads, demonstrating near-complete sequence congruence, with only some residual indel errors. Moreover, when applying our pipeline to the reference material HG002, it reveals no base differences and a limited number of indels compared with the Telomere-to-Telomere genome benchmark across the IGH region. Importantly, in the four individuals, our approach uncovers 28 novel alleles and previously uncharacterized large structural variants, including a 120 kb duplication spanning IGHE to IGHA1 within the IGH constant region (IGHC) and, within the IGHV region, an expanded seven-copy IGHV3-23 gene haplotype. These findings underscore the power of our method to resolve the full complexity of the IGH locus and uncover previously unrecognized variants that may affect immune function and disease susceptibility. Thus, our method provides a strong basis for future immunological research and translational applications.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"8 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unveiling the functional fate of duplicated genes through expression profiling and structural analysis 通过表达谱和结构分析揭示重复基因的功能命运

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-08-14 DOI: 10.1101/gr.280166.124

Alex Warwick Vesztrocy, Natasha Glover, Paul D. Thomas, Christophe Dessimoz, Irene Julca

Gene duplication is a major evolutionary source of functional innovation. Following duplication events, gene copies (paralogues) may undergo various fates, including retention with functional modifications (such as subfunctionalization or neofunctionalization) or loss. When paralogues are retained, this results in complex orthology relationships, including one-to-many or many-to-many. In such cases, determining which one-to-one pair is more likely to have conserved functions can be challenging. It has been proposed that, following gene duplication, the copy that diverges more slowly in sequence is more likely to maintain the ancestral function -referred to here as "the least diverged orthologue (LDO) conjecture". This study explores this conjecture, using a novel method to identify asymmetric evolution of paralogues and apply it to all gene families across the Tree of Life in the PANTHER database. Structural data for over 1 million proteins and expression data for 16 animals and 20 plants were then used to investigate functional divergence following duplication. This analysis, the most comprehensive to date, revealed that whilst the majority of paralogues display similar rates of sequence evolution, significant differences in branch lengths following gene duplication can be correlated with functional divergence. Overall, the results support the least diverged orthologue conjecture, suggesting that the least diverged orthologue (LDO) tends to retain the ancestral function, whilst the most diverged orthologue (MDO) may acquire a new, potentially specialized, role.

基因复制是功能创新的主要进化来源。在重复事件发生后，基因拷贝（平行物）可能经历不同的命运，包括保留功能修饰（如亚功能化或新功能化）或丢失。当保留平行关系时，这将导致复杂的正交关系，包括一对多或多对多。在这种情况下，确定哪个一对一对更有可能具有保守函数可能是一项挑战。有人提出，在基因复制之后，序列分化较慢的拷贝更有可能保持祖先的功能-这里称为“最小分化同源（LDO）猜想”。本研究探索了这一猜想，使用一种新颖的方法来识别类似物的不对称进化，并将其应用于PANTHER数据库中生命之树的所有基因家族。然后使用超过100万种蛋白质的结构数据和16种动物和20种植物的表达数据来研究复制后的功能差异。这是迄今为止最全面的分析，揭示了虽然大多数旁系显示出相似的序列进化速度，但基因复制后分支长度的显着差异可能与功能分化有关。总体而言，研究结果支持最小分叉正交子猜想，表明最小分叉正交子倾向于保留祖先功能，而最大分叉正交子可能获得新的、潜在的特殊作用。

{"title":"Unveiling the functional fate of duplicated genes through expression profiling and structural analysis","authors":"Alex Warwick Vesztrocy, Natasha Glover, Paul D. Thomas, Christophe Dessimoz, Irene Julca","doi":"10.1101/gr.280166.124","DOIUrl":"https://doi.org/10.1101/gr.280166.124","url":null,"abstract":"Gene duplication is a major evolutionary source of functional innovation. Following duplication events, gene copies (paralogues) may undergo various fates, including retention with functional modifications (such as subfunctionalization or neofunctionalization) or loss. When paralogues are retained, this results in complex orthology relationships, including one-to-many or many-to-many. In such cases, determining which one-to-one pair is more likely to have conserved functions can be challenging. It has been proposed that, following gene duplication, the copy that diverges more slowly in sequence is more likely to maintain the ancestral function -referred to here as \"the least diverged orthologue (LDO) conjecture\". This study explores this conjecture, using a novel method to identify asymmetric evolution of paralogues and apply it to all gene families across the Tree of Life in the PANTHER database. Structural data for over 1 million proteins and expression data for 16 animals and 20 plants were then used to investigate functional divergence following duplication. This analysis, the most comprehensive to date, revealed that whilst the majority of paralogues display similar rates of sequence evolution, significant differences in branch lengths following gene duplication can be correlated with functional divergence. Overall, the results support the least diverged orthologue conjecture, suggesting that the least diverged orthologue (LDO) tends to retain the ancestral function, whilst the most diverged orthologue (MDO) may acquire a new, potentially specialized, role.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"749 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144850649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multicondition and multimodal temporal profile inference during mouse embryonic development 小鼠胚胎发育过程中的多条件和多模态时间分布推断

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-08-14 DOI: 10.1101/gr.279997.124

Ran Zhang, Chengxiang Qiu, Galina Filippova, Gang Li, Jay Shendure, Jean-Philippe Vert, Xinxian Deng, William Stafford Noble, Christine Disteche

The emergence of single-cell time-series datasets enables modeling of changes in various types of cellular profiles over time. However, due to the disruptive nature of single-cell measurements, it is impossible to capture the full temporal trajectory of a particular cell. Furthermore, single-cell profiles can be collected at mismatched time points across different conditions (e.g., sex, batch, disease) and data modalities (e.g., scRNA-seq, scATAC-seq), which makes modeling challenging. Here we propose a joint modeling framework, Sunbear, for integrating multicondition and multimodal single-cell profiles across time. Sunbear can be used to impute single-cell temporal profile changes, align multidataset and multimodal profiles across time, and extrapolate single-cell profiles in a missing modality. We applied Sunbear to reveal sex-biased transcription during mouse embryonic development and predict dynamic relationships between epigenetic priming and transcription for cells in which multimodal profiles are unavailable. Sunbear thus enables the projection of single-cell time-series snapshots to multimodal and multicondition views of cellular trajectories.

单细胞时间序列数据集的出现使各种类型的细胞剖面随时间变化的建模成为可能。然而，由于单细胞测量的破坏性，不可能捕获特定细胞的完整时间轨迹。此外，可以在不同条件（例如，性别、批次、疾病）和数据模式（例如，scRNA-seq, scATAC-seq）的不匹配时间点收集单细胞谱，这使得建模具有挑战性。在这里，我们提出了一个联合建模框架，Sunbear，用于整合多条件和多模态的单细胞剖面。Sunbear可用于估算单细胞时间剖面变化，调整多数据集和多模态剖面，并在缺失的模态中推断单细胞剖面。我们使用Sunbear来揭示小鼠胚胎发育过程中的性别偏向转录，并预测在多模态谱不可用的细胞中表观遗传启动和转录之间的动态关系。因此，Sunbear能够将单细胞时间序列快照投射到细胞轨迹的多模式和多条件视图中。

{"title":"Multicondition and multimodal temporal profile inference during mouse embryonic development","authors":"Ran Zhang, Chengxiang Qiu, Galina Filippova, Gang Li, Jay Shendure, Jean-Philippe Vert, Xinxian Deng, William Stafford Noble, Christine Disteche","doi":"10.1101/gr.279997.124","DOIUrl":"https://doi.org/10.1101/gr.279997.124","url":null,"abstract":"The emergence of single-cell time-series datasets enables modeling of changes in various types of cellular profiles over time. However, due to the disruptive nature of single-cell measurements, it is impossible to capture the full temporal trajectory of a particular cell. Furthermore, single-cell profiles can be collected at mismatched time points across different conditions (e.g., sex, batch, disease) and data modalities (e.g., scRNA-seq, scATAC-seq), which makes modeling challenging. Here we propose a joint modeling framework, Sunbear, for integrating multicondition and multimodal single-cell profiles across time. Sunbear can be used to impute single-cell temporal profile changes, align multidataset and multimodal profiles across time, and extrapolate single-cell profiles in a missing modality. We applied Sunbear to reveal sex-biased transcription during mouse embryonic development and predict dynamic relationships between epigenetic priming and transcription for cells in which multimodal profiles are unavailable. Sunbear thus enables the projection of single-cell time-series snapshots to multimodal and multicondition views of cellular trajectories.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"2 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144850667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient integration of spatial omics data for joint domain detection, matching, and alignment with stMSA 有效整合空间组学数据，用于联合域检测、匹配和与stMSA对齐

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-08-14 DOI: 10.1101/gr.280584.125

Han Shu, Jing Chen, Chang Xu, Jialu Hu, Yongtian Wang, Jiajie Peng, Qinghua Jiang, Xuequn Shang, Tao Wang

Spatial omics (SO) is a powerful methodology that enables the study of genes, proteins, and other molecular features within the spatial context of tissue architecture. With the growing availability of SO datasets, researchers are eager to extract biological insights from larger datasets for a more comprehensive understanding. However, existing approaches focus on batch effect correction, often neglecting complex biological patterns in tissue slices, complicating feature integration and posing challenges when combining transcriptomics with other omics layers. Here, we introduce stMSA (SpaTial Multi-Slice/omics Analysis), a deep graph contrastive learning model that incorporates graph auto-encoder techniques. stMSA is specifically designed to produce batch-corrected representations while retaining the distinct spatial patterns within each slice, considering both intra- and interbatch relationships during integration. Extensive evaluations show that stMSA outperforms state-of-the-art methods in distinguishing tissue structures across diverse slices, even when faced with varying experimental protocols and sequencing technologies. Furthermore, stMSA effectively deciphers complex developmental trajectories by integrating spatial proteomics and transcriptomics data and excels in cross-slice matching and alignment for 3D tissue reconstruction.

空间组学（SO）是一种强大的方法，可以在组织结构的空间背景下研究基因、蛋白质和其他分子特征。随着SO数据集的可用性越来越高，研究人员渴望从更大的数据集中提取生物学见解，以获得更全面的理解。然而，现有的方法侧重于批量效应校正，往往忽略了组织切片中复杂的生物模式，使特征整合复杂化，并在将转录组学与其他组学层结合时提出了挑战。在这里，我们介绍了stMSA（空间多片/组学分析），这是一种结合了图自编码器技术的深度图对比学习模型。stMSA专门设计用于生成批校正表示，同时保留每个切片内不同的空间模式，同时考虑集成期间的批内和批间关系。广泛的评估表明，即使面对不同的实验方案和测序技术，stMSA在区分不同切片的组织结构方面也优于最先进的方法。此外，stMSA通过整合空间蛋白质组学和转录组学数据，有效地破译复杂的发育轨迹，并在三维组织重建的横切片匹配和比对方面表现出色。

{"title":"Efficient integration of spatial omics data for joint domain detection, matching, and alignment with stMSA","authors":"Han Shu, Jing Chen, Chang Xu, Jialu Hu, Yongtian Wang, Jiajie Peng, Qinghua Jiang, Xuequn Shang, Tao Wang","doi":"10.1101/gr.280584.125","DOIUrl":"https://doi.org/10.1101/gr.280584.125","url":null,"abstract":"Spatial omics (SO) is a powerful methodology that enables the study of genes, proteins, and other molecular features within the spatial context of tissue architecture. With the growing availability of SO datasets, researchers are eager to extract biological insights from larger datasets for a more comprehensive understanding. However, existing approaches focus on batch effect correction, often neglecting complex biological patterns in tissue slices, complicating feature integration and posing challenges when combining transcriptomics with other omics layers. Here, we introduce stMSA (SpaTial Multi-Slice/omics Analysis), a deep graph contrastive learning model that incorporates graph auto-encoder techniques. stMSA is specifically designed to produce batch-corrected representations while retaining the distinct spatial patterns within each slice, considering both intra- and interbatch relationships during integration. Extensive evaluations show that stMSA outperforms state-of-the-art methods in distinguishing tissue structures across diverse slices, even when faced with varying experimental protocols and sequencing technologies. Furthermore, stMSA effectively deciphers complex developmental trajectories by integrating spatial proteomics and transcriptomics data and excels in cross-slice matching and alignment for 3D tissue reconstruction.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"292 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144850648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FocalSV enables target region-based structural variant assembly and refinement using single-molecule long-read sequencing data FocalSV可以使用单分子长读测序数据进行基于目标区域的结构变异组装和优化

IF 7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research

Pub Date : 2025-08-13 DOI: 10.1101/gr.280282.124

Can Luo, Zimeng Jamie Zhou, Yichen Henry Liu, Xin Maizie Zhou

Structural variants (SVs) play a critical role in shaping the diversity of the human genome and their detection holds significant potential for advancing precision medicine. Despite notable progress in single-molecule long-read sequencing technologies, accurately identifying SV breakpoints and resolving their sequence remains a major challenge. Current alignment-based tools often struggle with precise breakpoint detection and sequence characterization, while whole genome assembly-based methods are computationally demanding and less practical for targeted analyses. Neither approach is ideally suited for scenarios where regions of interest are predefined and require precise SV characterization. To address this gap, we introduce FocalSV, a targeted SV detection framework that integrates both assembly- and alignment-based signals. By combining the precision of local assemblies with the efficiency of region-specific analysis, FocalSV enables more accurate SV detection. FocalSV supports user-defined target regions and can automatically identify and expand regions with potential structural variants to enable more comprehensive detection. FocalSV was evaluated on ten germline datasets and two paired normal-tumor cancer datasets, demonstrating superior performance in both precision and efficiency.

结构变异（SVs）在塑造人类基因组多样性方面发挥着关键作用，它们的检测对推进精准医学具有重大潜力。尽管单分子长读测序技术取得了显著进展，但准确识别SV断点并确定其序列仍然是一个重大挑战。当前基于比对的工具通常难以精确地检测断点和序列特征，而基于全基因组组装的方法在计算上要求很高，对于目标分析不太实用。这两种方法都不适合预先定义感兴趣区域并需要精确SV表征的场景。为了解决这一差距，我们引入了FocalSV，这是一个有针对性的SV检测框架，它集成了基于装配和对准的信号。通过结合局部组件的精度和区域特定分析的效率，FocalSV能够更准确地检测SV。FocalSV支持用户定义的目标区域，可以自动识别和扩展具有潜在结构变异的区域，从而实现更全面的检测。FocalSV在10个生殖系数据集和2个配对的正常肿瘤数据集上进行了评估，显示出在精度和效率方面的优越性能。

{"title":"FocalSV enables target region-based structural variant assembly and refinement using single-molecule long-read sequencing data","authors":"Can Luo, Zimeng Jamie Zhou, Yichen Henry Liu, Xin Maizie Zhou","doi":"10.1101/gr.280282.124","DOIUrl":"https://doi.org/10.1101/gr.280282.124","url":null,"abstract":"Structural variants (SVs) play a critical role in shaping the diversity of the human genome and their detection holds significant potential for advancing precision medicine. Despite notable progress in single-molecule long-read sequencing technologies, accurately identifying SV breakpoints and resolving their sequence remains a major challenge. Current alignment-based tools often struggle with precise breakpoint detection and sequence characterization, while whole genome assembly-based methods are computationally demanding and less practical for targeted analyses. Neither approach is ideally suited for scenarios where regions of interest are predefined and require precise SV characterization. To address this gap, we introduce FocalSV, a targeted SV detection framework that integrates both assembly- and alignment-based signals. By combining the precision of local assemblies with the efficiency of region-specific analysis, FocalSV enables more accurate SV detection. FocalSV supports user-defined target regions and can automatically identify and expand regions with potential structural variants to enable more comprehensive detection. FocalSV was evaluated on ten germline datasets and two paired normal-tumor cancer datasets, demonstrating superior performance in both precision and efficiency.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"185 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144824980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0