首页 > 最新文献

NAR Genomics and Bioinformatics最新文献

英文 中文
Proteins need extra attention: improving the predictive power of protein language models on mutational datasets with hint tokens. 蛋白质需要特别注意:用提示符号提高蛋白质语言模型对突变数据集的预测能力。
IF 2.8 Q1 GENETICS & HEREDITY Pub Date : 2025-09-26 eCollection Date: 2025-09-01 DOI: 10.1093/nargab/lqaf128
Xinning Li, Ryann M Perez, Sam Giannakoulias, E James Petersson

In this computational study, we address the challenge of predicting protein functions following mutations by fine-tuning protein language models (PLMs) using a novel tokenization strategy, hint token learning (HTL). To evaluate the effectiveness of HTL, we benchmarked this approach across four pretrained models with varying architectures and sizes on four diverse protein mutational datasets. Our results showed significant improvements in weighted F1 scores in most cases when HTL was applied. To understand how HTL enhances protein mutational predictions, we trained sparse autoencoders on embeddings derived from the fine-tuned PLMs. Analysis of the latent spaces revealed that the number of activated residues within functional protein domains increased by PLM training with HTL. These findings indicate that PLMs fine-tuned with HTL may capture more biologically relevant representations of proteins. Our study highlights the potential of HTL to advance protein function prediction and provides insights into how HTL enables PLMs to capture mutational impacts at the functional level. All data and code are available at: https://github.com/ejp-lab/EJPLab_Computational_Projects/tree/master/HintTokenLearning.

在这项计算研究中,我们通过使用一种新的标记化策略,提示标记学习(html)对蛋白质语言模型(PLMs)进行微调,解决了预测突变后蛋白质功能的挑战。为了评估html的有效性,我们在四个不同的蛋白质突变数据集上对四个具有不同架构和大小的预训练模型进行了基准测试。我们的结果显示,在大多数情况下,当应用html时,加权F1分数有了显著提高。为了理解html如何增强蛋白质突变预测,我们训练稀疏自编码器对源自微调plm的嵌入进行训练。对潜在空间的分析表明,HTL训练的PLM增加了功能蛋白结构域内的活化残基数量。这些发现表明,经过HTL微调的PLMs可能会捕获更多与蛋白质生物学相关的表征。我们的研究强调了HTL在推进蛋白质功能预测方面的潜力,并提供了HTL如何使PLMs在功能水平上捕获突变影响的见解。所有数据和代码可在:https://github.com/ejp-lab/EJPLab_Computational_Projects/tree/master/HintTokenLearning。
{"title":"Proteins need extra attention: improving the predictive power of protein language models on mutational datasets with hint tokens.","authors":"Xinning Li, Ryann M Perez, Sam Giannakoulias, E James Petersson","doi":"10.1093/nargab/lqaf128","DOIUrl":"10.1093/nargab/lqaf128","url":null,"abstract":"<p><p>In this computational study, we address the challenge of predicting protein functions following mutations by fine-tuning protein language models (PLMs) using a novel tokenization strategy, hint token learning (HTL). To evaluate the effectiveness of HTL, we benchmarked this approach across four pretrained models with varying architectures and sizes on four diverse protein mutational datasets. Our results showed significant improvements in weighted F1 scores in most cases when HTL was applied. To understand how HTL enhances protein mutational predictions, we trained sparse autoencoders on embeddings derived from the fine-tuned PLMs. Analysis of the latent spaces revealed that the number of activated residues within functional protein domains increased by PLM training with HTL. These findings indicate that PLMs fine-tuned with HTL may capture more biologically relevant representations of proteins. Our study highlights the potential of HTL to advance protein function prediction and provides insights into how HTL enables PLMs to capture mutational impacts at the functional level. All data and code are available at: https://github.com/ejp-lab/EJPLab_Computational_Projects/tree/master/HintTokenLearning.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 3","pages":"lqaf128"},"PeriodicalIF":2.8,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12464817/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145187117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
nf-core/detaxizer: a benchmarking study for decontamination from human sequences. Nf-core /去氧剂:人类序列去污的基准研究。
IF 2.8 Q1 GENETICS & HEREDITY Pub Date : 2025-09-23 eCollection Date: 2025-09-01 DOI: 10.1093/nargab/lqaf125
Jannik Seidel, Camill Kaipf, Daniel Straub, Sven Nahnsen

Privacy is paramount in health data, particularly in human genetics, where information extends beyond individuals to their relatives. Metagenomic datasets contain substantial human genetic material, necessitating careful handling to mitigate data leakage risks when sharing or publishing. The same applies to genetic datasets from the environment or datasets from contaminated laboratory samples, although to a lesser extent. Completely removing human sequence data while retaining unbiased nonhuman reads is not achievable currently, but several tools exist. To address these topics, we developed nf-core/detaxizer, a nextflow-based pipeline that employs Kraken2 and bbmap/bbduk for taxonomic classification, identifying and optionally filtering Homo sapiens reads. Due to its generalized design, other taxa can also be classified and filtered. We benchmark its filtering efficacy for human reads against Hostile and CLEAN, demonstrating its utility for secure data preprocessing. The comparison showed that the choice of tool and database can result in differences of up to an order of magnitude in both the amount of human data not removed and the amount of microbial data mistakenly removed. As part of the nf-core initiative, nf-core/detaxizer adheres to best practices, leveraging containerized dependencies for streamlined installation. The source code is openly available under the MIT license: https://github.com/nf-core/detaxizer.

在健康数据中,隐私至关重要,特别是在人类遗传学中,信息从个人延伸到其亲属。宏基因组数据集包含大量的人类遗传物质,需要谨慎处理,以减轻共享或发布时的数据泄露风险。这同样适用于来自环境的遗传数据集或来自受污染的实验室样本的数据集,尽管程度较轻。完全去除人类序列数据,同时保留无偏的非人类读数目前还无法实现,但有几种工具存在。为了解决这些问题,我们开发了nf-core/detaxizer,这是一个基于nextflow的管道,使用Kraken2和bbmap/bbduk进行分类分类,识别和选择性过滤智人的阅读。由于其一般化的设计,其他分类群也可以被分类和过滤。我们将其对人类读取的过滤效果与Hostile和CLEAN进行基准测试,展示其对安全数据预处理的效用。比较表明,工具和数据库的选择可能导致未删除的人类数据量和错误删除的微生物数据量的差异高达一个数量级。作为nf-core计划的一部分,nf-core/detaxizer遵循最佳实践,利用容器化依赖项来简化安装。源代码在MIT许可下是公开的:https://github.com/nf-core/detaxizer。
{"title":"nf-core/detaxizer: a benchmarking study for decontamination from human sequences.","authors":"Jannik Seidel, Camill Kaipf, Daniel Straub, Sven Nahnsen","doi":"10.1093/nargab/lqaf125","DOIUrl":"10.1093/nargab/lqaf125","url":null,"abstract":"<p><p>Privacy is paramount in health data, particularly in human genetics, where information extends beyond individuals to their relatives. Metagenomic datasets contain substantial human genetic material, necessitating careful handling to mitigate data leakage risks when sharing or publishing. The same applies to genetic datasets from the environment or datasets from contaminated laboratory samples, although to a lesser extent. Completely removing human sequence data while retaining unbiased nonhuman reads is not achievable currently, but several tools exist. To address these topics, we developed nf-core/detaxizer, a nextflow-based pipeline that employs Kraken2 and bbmap/bbduk for taxonomic classification, identifying and optionally filtering <i>Homo sapiens</i> reads. Due to its generalized design, other taxa can also be classified and filtered. We benchmark its filtering efficacy for human reads against Hostile and CLEAN, demonstrating its utility for secure data preprocessing. The comparison showed that the choice of tool and database can result in differences of up to an order of magnitude in both the amount of human data not removed and the amount of microbial data mistakenly removed. As part of the nf-core initiative, nf-core/detaxizer adheres to best practices, leveraging containerized dependencies for streamlined installation. The source code is openly available under the MIT license: https://github.com/nf-core/detaxizer.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 3","pages":"lqaf125"},"PeriodicalIF":2.8,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12455401/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145138739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IMGT® analysis of the human IGH locus: unveiling novel polymorphisms and copy number variations in 15 genome assemblies from diverse ancestral backgrounds. 人类IGH位点的IMGT®分析:揭示来自不同祖先背景的15个基因组组装的新多态性和拷贝数变化。
IF 2.8 Q1 GENETICS & HEREDITY Pub Date : 2025-09-17 eCollection Date: 2025-09-01 DOI: 10.1093/nargab/lqaf127
Ariadni Papadaki, Maria Georga, Joumana Jabado-Michaloud, Géraldine Folch, Guilhem Zeitoun, Patrice Duroux, Véronique Giudicelli, Sofia Kossida

Unraveling the genetic complexity of the human immunoglobulin heavy (IGH) chain locus provides valuable insights into the mechanisms underlying the efficacy and specificity of the adaptive immune response. Despite its crucial role, the IGH locus remains insufficiently characterized, with its allelic diversity and polymorphisms inadequately investigated. In this study, we present an analysis of the human IGH locus, incorporating 15 human genome assemblies from diverse ancestries, including African, European, Asian, Saudi, and mixed backgrounds. Through our examination of both maternal and paternal assemblies, we uncover novel IGH alleles, copy number variations (CNV), and polymorphisms, particularly within the variable (IGHV) region. Our findings reveal extensive and previously uncharacterized genetic variability in the constant (IGHC) region and distinct IMGT CNV forms across individuals. This research contributes to a significant enrichment of the IMGT® IGH reference directory, databases, tools and web resources, and lays the groundwork for an IMGT® haplotype database which can be progressively enriched as additional datasets become available. Such a resource promises to propel personalized immunogenomics forward, with exciting applications in cancer immunotherapy, COVID-19, and other immune-related diseases.

揭示人类重免疫球蛋白(IGH)链位点的遗传复杂性为适应性免疫反应的有效性和特异性的机制提供了有价值的见解。尽管IGH位点发挥着至关重要的作用,但对其等位基因多样性和多态性的研究仍然不够充分。在这项研究中,我们对人类的IGH位点进行了分析,纳入了来自不同祖先的15个人类基因组组合,包括非洲、欧洲、亚洲、沙特和混合背景。通过对母系和父系组合的检查,我们发现了新的IGH等位基因、拷贝数变异(CNV)和多态性,特别是在可变(IGHV)区域。我们的研究结果揭示了个体之间在恒定(IGHC)区域和不同的IMGT CNV形式中广泛的和以前未表征的遗传变异性。本研究为丰富IMGT®IGH参考目录、数据库、工具和网络资源做出了贡献,并为IMGT®单倍型数据库奠定了基础,该数据库可以随着其他数据集的可用而逐步丰富。这种资源有望推动个性化免疫基因组学向前发展,在癌症免疫治疗、COVID-19和其他免疫相关疾病中有令人兴奋的应用。
{"title":"IMGT<sup>®</sup> analysis of the human IGH locus: unveiling novel polymorphisms and copy number variations in 15 genome assemblies from diverse ancestral backgrounds.","authors":"Ariadni Papadaki, Maria Georga, Joumana Jabado-Michaloud, Géraldine Folch, Guilhem Zeitoun, Patrice Duroux, Véronique Giudicelli, Sofia Kossida","doi":"10.1093/nargab/lqaf127","DOIUrl":"10.1093/nargab/lqaf127","url":null,"abstract":"<p><p>Unraveling the genetic complexity of the human immunoglobulin heavy (IGH) chain locus provides valuable insights into the mechanisms underlying the efficacy and specificity of the adaptive immune response. Despite its crucial role, the IGH locus remains insufficiently characterized, with its allelic diversity and polymorphisms inadequately investigated. In this study, we present an analysis of the human IGH locus, incorporating 15 human genome assemblies from diverse ancestries, including African, European, Asian, Saudi, and mixed backgrounds. Through our examination of both maternal and paternal assemblies, we uncover novel IGH alleles, copy number variations (CNV), and polymorphisms, particularly within the variable (IGHV) region. Our findings reveal extensive and previously uncharacterized genetic variability in the constant (IGHC) region and distinct IMGT CNV forms across individuals. This research contributes to a significant enrichment of the IMGT<sup>®</sup> IGH reference directory, databases, tools and web resources, and lays the groundwork for an IMGT<sup>®</sup> haplotype database which can be progressively enriched as additional datasets become available. Such a resource promises to propel personalized immunogenomics forward, with exciting applications in cancer immunotherapy, COVID-19, and other immune-related diseases.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 3","pages":"lqaf127"},"PeriodicalIF":2.8,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448898/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145114359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
araCNA: somatic copy number profiling using long-range sequence models. araCNA:使用长程序列模型的体细胞拷贝数分析。
IF 2.8 Q1 GENETICS & HEREDITY Pub Date : 2025-09-09 eCollection Date: 2025-09-01 DOI: 10.1093/nargab/lqaf124
Ellen Visscher, Christopher Yau

Somatic copy number alterations (CNAs) are hallmarks of cancer. Current algorithms that call CNAs from whole-genome sequenced (WGS) data have not exploited deep learning methods owing to computational scaling limitations. Here, we present a novel deep-learning approach, araCNA, trained only on simulated data that can accurately predict CNAs in real WGS cancer genomes. araCNA uses novel transformer alternatives (e.g. Mamba) to handle genomic-scale sequence lengths (∼1M) and learn long-range interactions. Results are extremely accurate on simulated data, and this zero-shot approach is on par with existing methods when applied to 50 WGS samples from the Cancer Genome Atlas. Notably, our approach requires only a tumour sample and not a matched normal sample, has fewer markers of overfitting, and performs inference in only a few minutes. araCNA demonstrates how domain knowledge can be used to simulate training sets that harness the power of modern machine learning in biological applications.

体细胞拷贝数改变(CNAs)是癌症的标志。由于计算尺度的限制,目前从全基因组测序(WGS)数据中调用cna的算法尚未利用深度学习方法。在这里,我们提出了一种新的深度学习方法,araCNA,仅在模拟数据上进行训练,可以准确预测真实WGS癌症基因组中的CNAs。araCNA使用新的变压器替代品(例如Mamba)来处理基因组尺度的序列长度(~ 1M)并学习远程相互作用。结果在模拟数据上非常准确,当应用于来自癌症基因组图谱的50个WGS样本时,这种零射击方法与现有方法相当。值得注意的是,我们的方法只需要一个肿瘤样本,而不是一个匹配的正常样本,有更少的过拟合标记,并在几分钟内完成推理。araCNA演示了如何使用领域知识来模拟训练集,从而在生物学应用中利用现代机器学习的力量。
{"title":"araCNA: somatic copy number profiling using long-range sequence models.","authors":"Ellen Visscher, Christopher Yau","doi":"10.1093/nargab/lqaf124","DOIUrl":"10.1093/nargab/lqaf124","url":null,"abstract":"<p><p>Somatic copy number alterations (CNAs) are hallmarks of cancer. Current algorithms that call CNAs from whole-genome sequenced (WGS) data have not exploited deep learning methods owing to computational scaling limitations. Here, we present a novel deep-learning approach, araCNA, trained only on simulated data that can accurately predict CNAs in real WGS cancer genomes. araCNA uses novel transformer alternatives (e.g. Mamba) to handle genomic-scale sequence lengths (∼1M) and learn long-range interactions. Results are extremely accurate on simulated data, and this zero-shot approach is on par with existing methods when applied to 50 WGS samples from the Cancer Genome Atlas. Notably, our approach requires only a tumour sample and not a matched normal sample, has fewer markers of overfitting, and performs inference in only a few minutes. araCNA demonstrates how domain knowledge can be used to simulate training sets that harness the power of modern machine learning in biological applications.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 3","pages":"lqaf124"},"PeriodicalIF":2.8,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12418177/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145041624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cell-free DNA as a potential alternative to genomic DNA in genetic studies. 无细胞DNA在遗传学研究中作为基因组DNA的潜在替代品。
IF 2.8 Q1 GENETICS & HEREDITY Pub Date : 2025-09-09 eCollection Date: 2025-09-01 DOI: 10.1093/nargab/lqaf119
Jingyu Zeng, Huanhuan Zhu, Yu Wang, Guodan Zeng, Panhong Liu, Rijing Ou, Xianmei Lan, Yuhui Zheng, Chenhui Zhao, Linxuan Li, Haiqiang Zhang, Jianhua Yin, Mingzhi Liao, Yan Zhang, Xin Jin

Next-generation sequencing has greatly advanced genomics, enabling large-scale studies of population genetics and complex traits. Genomic DNA (gDNA) from white blood cells has traditionally been the main data source, but cell-free DNA (cfDNA), found in bodily fluids as fragmented DNA, is increasingly recognized as a valuable biomarker in clinical and genetic studies. However, a direct comparison between cfDNA and gDNA has not been fully explored. In this study, we analyzed cfDNA and gDNA from 186 healthy individuals, using the same sequencing platform. We compared sequencing quality, variant detection, allele frequencies (AF), genotype concordance, population structure, and genomic association results (genome-wide association study and expression quantitative trait locus). While cfDNA showed higher duplication rates and lower effective sequencing depth, both DNA types displayed similar quality metrics at the same depth. We also observed that significant depth differences between cfDNA and gDNA were mainly found in centromeric regions. While gDNA identified more variants with more uniform coverage, AF spectra, population structure, and genomic associations were largely consistent between the two DNA types. This study provides a detailed comparison of cfDNA and gDNA, highlighting the potential of cfDNA as an alternative to gDNA in genomic research. Our findings could serve as a reference for future studies on cfDNA and gDNA.

下一代测序极大地推进了基因组学,使大规模研究群体遗传学和复杂性状成为可能。传统上,来自白细胞的基因组DNA (gDNA)一直是主要的数据来源,但在体液中以片段DNA形式发现的无细胞DNA (cfDNA)越来越被认为是临床和遗传研究中有价值的生物标志物。然而,cfDNA和gDNA之间的直接比较尚未得到充分的探讨。在这项研究中,我们使用相同的测序平台分析了186名健康个体的cfDNA和gDNA。我们比较了测序质量、变异检测、等位基因频率(AF)、基因型一致性、群体结构和基因组关联结果(全基因组关联研究和表达数量性状位点)。虽然cfDNA具有较高的重复率和较低的有效测序深度,但两种DNA类型在相同深度下显示出相似的质量指标。我们还观察到cfDNA和gDNA之间的显著深度差异主要存在于着丝粒区域。虽然gDNA鉴定出更多的变异,覆盖范围更均匀,但两种DNA类型之间的AF谱、群体结构和基因组关联在很大程度上是一致的。本研究提供了cfDNA和gDNA的详细比较,强调了cfDNA作为基因组研究中gDNA替代品的潜力。本研究结果可为今后cfDNA和gDNA的研究提供参考。
{"title":"Cell-free DNA as a potential alternative to genomic DNA in genetic studies.","authors":"Jingyu Zeng, Huanhuan Zhu, Yu Wang, Guodan Zeng, Panhong Liu, Rijing Ou, Xianmei Lan, Yuhui Zheng, Chenhui Zhao, Linxuan Li, Haiqiang Zhang, Jianhua Yin, Mingzhi Liao, Yan Zhang, Xin Jin","doi":"10.1093/nargab/lqaf119","DOIUrl":"10.1093/nargab/lqaf119","url":null,"abstract":"<p><p>Next-generation sequencing has greatly advanced genomics, enabling large-scale studies of population genetics and complex traits. Genomic DNA (gDNA) from white blood cells has traditionally been the main data source, but cell-free DNA (cfDNA), found in bodily fluids as fragmented DNA, is increasingly recognized as a valuable biomarker in clinical and genetic studies. However, a direct comparison between cfDNA and gDNA has not been fully explored. In this study, we analyzed cfDNA and gDNA from 186 healthy individuals, using the same sequencing platform. We compared sequencing quality, variant detection, allele frequencies (AF), genotype concordance, population structure, and genomic association results (genome-wide association study and expression quantitative trait locus). While cfDNA showed higher duplication rates and lower effective sequencing depth, both DNA types displayed similar quality metrics at the same depth. We also observed that significant depth differences between cfDNA and gDNA were mainly found in centromeric regions. While gDNA identified more variants with more uniform coverage, AF spectra, population structure, and genomic associations were largely consistent between the two DNA types. This study provides a detailed comparison of cfDNA and gDNA, highlighting the potential of cfDNA as an alternative to gDNA in genomic research. Our findings could serve as a reference for future studies on cfDNA and gDNA.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 3","pages":"lqaf119"},"PeriodicalIF":2.8,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12408905/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145016381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large-scale composite hypothesis testing procedure for omics data analyses. 组学数据分析的大规模复合假设检验程序。
IF 2.8 Q1 GENETICS & HEREDITY Pub Date : 2025-09-05 eCollection Date: 2025-09-01 DOI: 10.1093/nargab/lqaf118
Annaïg De Walsche, Franck Gauthier, Nathalie Boissot, Alain Charcosset, Tristan Mary-Huard

Composite hypothesis testing using summary statistics is a well-established approach for assessing the effect of a single marker or gene across multiple traits or omics levels. Numerous procedures have been developed for this task and have been successfully applied to identify complex patterns of association between traits, conditions, or phenotypes. However, existing methods often struggle with scalability in large datasets or fail to account for dependencies between traits or omics levels, limiting their ability to control false positives effectively. To overcome these challenges, we present the qch_copula approach, which integrates mixture models with a copula function to capture dependencies between traits or omics and provides rigorously defined P-values for any composite hypothesis. Through a comprehensive benchmark against eight state-of-the-art methods, we demonstrate that qch_copula controls Type I error rates effectively while enhancing the detection of joint association patterns. Compared to other mixture model-based approaches, our method notably reduces memory usage during the EM algorithm, allowing the analysis of up to 20 traits and 105-106 markers. The effectiveness of qch_copula is further validated through two application cases in human and plant genetics. The method is available in the R package qch, accessible on CRAN.

使用汇总统计的复合假设检验是一种行之有效的方法,用于评估单个标记或基因在多个性状或组学水平上的影响。为了这项任务,已经开发了许多程序,并已成功地应用于识别性状、条件或表型之间的复杂关联模式。然而,现有的方法往往难以在大型数据集中实现可扩展性,或者无法解释性状或组学水平之间的依赖关系,从而限制了它们有效控制假阳性的能力。为了克服这些挑战,我们提出了qch_copula方法,该方法将混合模型与copula函数集成在一起,以捕获性状或组学之间的依赖关系,并为任何复合假设提供严格定义的p值。通过对八种最先进方法的综合基准测试,我们证明了qch_copula有效地控制了I型错误率,同时增强了对联合关联模式的检测。与其他基于混合模型的方法相比,我们的方法在EM算法中显著减少了内存使用,允许分析多达20个性状和105-106个标记。qch_copula在人类和植物遗传学中的应用进一步验证了其有效性。该方法在R包qch中可用,可在CRAN上访问。
{"title":"Large-scale composite hypothesis testing procedure for omics data analyses.","authors":"Annaïg De Walsche, Franck Gauthier, Nathalie Boissot, Alain Charcosset, Tristan Mary-Huard","doi":"10.1093/nargab/lqaf118","DOIUrl":"10.1093/nargab/lqaf118","url":null,"abstract":"<p><p>Composite hypothesis testing using summary statistics is a well-established approach for assessing the effect of a single marker or gene across multiple traits or omics levels. Numerous procedures have been developed for this task and have been successfully applied to identify complex patterns of association between traits, conditions, or phenotypes. However, existing methods often struggle with scalability in large datasets or fail to account for dependencies between traits or omics levels, limiting their ability to control false positives effectively. To overcome these challenges, we present the qch_copula approach, which integrates mixture models with a copula function to capture dependencies between traits or omics and provides rigorously defined <i>P</i>-values for any composite hypothesis. Through a comprehensive benchmark against eight state-of-the-art methods, we demonstrate that qch_copula controls Type I error rates effectively while enhancing the detection of joint association patterns. Compared to other mixture model-based approaches, our method notably reduces memory usage during the EM algorithm, allowing the analysis of up to 20 traits and 10<sup>5</sup>-10<sup>6</sup> markers. The effectiveness of qch_copula is further validated through two application cases in human and plant genetics. The method is available in the R package qch, accessible on CRAN.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 3","pages":"lqaf118"},"PeriodicalIF":2.8,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12412788/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145016374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
proRate: an R package to infer gene transcription rates with a novel least sum of squares method. 比例:一个R包推断基因转录率与新颖的最小平方和方法。
IF 2.8 Q1 GENETICS & HEREDITY Pub Date : 2025-09-05 eCollection Date: 2025-09-01 DOI: 10.1093/nargab/lqaf123
Yu Liu, Fadhl Alakwaa

The dynamics of transcriptional elongation influence many biological activities, such as RNA splicing, polyadenylation, and nuclear export. To quantify the elongation rate, a typical method is to treat cells with drugs that inhibit RNA polymerase II (Pol II) from entering the gene body and then track Pol II using Pro-seq or Gro-seq. However, the downstream data analysis is challenged by the problem of identifying the transition point between the gene regions inhibited by the drug and not, which is necessary to calculate the transcription rate. Although the traditional hidden Markov model (HMM) can be used to solve it, this method is complicated with its hidden variable and many parameters to be estimated. Hence, we developed the R package proRate, which identifies the transition point with a novel least sum of squares (LSS) method and calculates the elongation rate accordingly. In addition, proRate also covers other functions frequently used in transcription dynamic study, including metagene plotting, pause index calculation, gene structure analysis, etc. The effectiveness of this package is proved by its performance on three Pro-seq or Gro-seq datasets, showing higher accuracy than HMM. proRate is freely available at https://github.com/yuabrahamliu/proRate or https://github.com/FADHLyemen/proRate.

转录延伸的动态影响许多生物活动,如RNA剪接、聚腺苷酸化和核输出。为了量化延伸率,一种典型的方法是用抑制RNA聚合酶II (Pol II)进入基因体的药物治疗细胞,然后使用Pro-seq或Gro-seq跟踪Pol II。然而,下游数据分析面临着确定药物抑制基因区域与非药物抑制基因区域之间的过渡点的问题,这是计算转录率所必需的。传统的隐马尔可夫模型(HMM)虽然可以解决这一问题,但由于该方法存在隐变量,且需要估计的参数较多,因而比较复杂。因此,我们开发了R包比例,它识别过渡点与新颖的最小平方和(LSS)方法,并计算相应的伸长率。此外,prate还涵盖了转录动力学研究中经常用到的其他功能,包括元成绘图、暂停指数计算、基因结构分析等。通过对三个Pro-seq或Gro-seq数据集的分析,证明了该算法的有效性,显示出比HMM更高的准确率。prote可以在https://github.com/yuabrahamliu/proRate或https://github.com/FADHLyemen/proRate免费获得。
{"title":"<i>proRate</i>: an R package to infer gene transcription rates with a novel least sum of squares method.","authors":"Yu Liu, Fadhl Alakwaa","doi":"10.1093/nargab/lqaf123","DOIUrl":"10.1093/nargab/lqaf123","url":null,"abstract":"<p><p>The dynamics of transcriptional elongation influence many biological activities, such as RNA splicing, polyadenylation, and nuclear export. To quantify the elongation rate, a typical method is to treat cells with drugs that inhibit RNA polymerase II (Pol II) from entering the gene body and then track Pol II using Pro-seq or Gro-seq. However, the downstream data analysis is challenged by the problem of identifying the transition point between the gene regions inhibited by the drug and not, which is necessary to calculate the transcription rate. Although the traditional hidden Markov model (HMM) can be used to solve it, this method is complicated with its hidden variable and many parameters to be estimated. Hence, we developed the R package <i>proRate</i>, which identifies the transition point with a novel least sum of squares (LSS) method and calculates the elongation rate accordingly. In addition, <i>proRate</i> also covers other functions frequently used in transcription dynamic study, including metagene plotting, pause index calculation, gene structure analysis, etc. The effectiveness of this package is proved by its performance on three Pro-seq or Gro-seq datasets, showing higher accuracy than HMM. <i>proRate</i> is freely available at https://github.com/yuabrahamliu/proRate or https://github.com/FADHLyemen/proRate.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 3","pages":"lqaf123"},"PeriodicalIF":2.8,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12412782/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145016400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A high-resolution meiotic crossover map from single-nucleus ATAC-seq reveals insights into the recombination landscape in mammals. 来自单核ATAC-seq的高分辨率减数分裂交叉图揭示了对哺乳动物重组景观的见解。
IF 2.8 Q1 GENETICS & HEREDITY Pub Date : 2025-09-03 eCollection Date: 2025-09-01 DOI: 10.1093/nargab/lqaf122
Stevan Novakovic, Caitlin Harris, Ruijie Liu, Davis J McCarthy, Wayne Crismani

Meiotic crossovers promote correct chromosome segregation and the shuffling of genetic diversity. However, the measurement of crossovers remains challenging, impeding our ability to decipher the molecular mechanisms that are necessary for their formation and regulation. Here we demonstrate a novel repurposing of the single-nucleus Assay for Transposase Accessible Chromatin with sequencing (snATAC-seq) as a simple and high-throughput method to identify and characterize meiotic crossovers from haploid testis nuclei. We first validate the feasibility of obtaining genome-wide coverage from snATAC-seq by using ATAC-seq on bulk haploid mouse testis nuclei, ensuring adequate variant detection for haplotyping. Subsequently, we adapt droplet-based snATAC-seq for crossover detection, revealing >25 000 crossovers in F1 hybrid mice. Comparison between the wild type and a hyper-recombinogenic Fancm-deficient mutant mouse model confirmed an increase in crossover rates in this genotype, however with a distribution which was unchanged. We also find that regions with the highest rate of crossover formation are enriched for PRDM9. Our findings demonstrate the utility of snATAC-seq as a robust and scalable tool for high-throughput crossover detection, offering insights into meiotic crossover dynamics and elucidating the underlying molecular mechanisms. It is possible that the research presented here with snATAC-seq of haploid post-meiotic nuclei could be extended into fertility-related diagnostics.

减数分裂交叉促进正确的染色体分离和遗传多样性的洗牌。然而,交叉的测量仍然具有挑战性,阻碍了我们破译其形成和调节所必需的分子机制的能力。在这里,我们展示了一种新的转座酶可及染色质单核测序(snATAC-seq)的重新用途,作为一种简单和高通量的方法来鉴定和表征来自单倍体睾丸核的减数分裂交叉。我们首先通过对大体积单倍体小鼠睾丸核使用ATAC-seq验证了从snATAC-seq获得全基因组覆盖的可行性,确保了对单倍型的充分变异检测。随后,我们采用基于液滴的snATAC-seq进行交叉检测,在F1杂交小鼠中发现了大约25 000个交叉。野生型和高度重组的fancm缺陷突变小鼠模型的比较证实了该基因型的交叉率增加,但其分布没有变化。我们还发现交叉形成率最高的区域富含PRDM9。我们的研究结果证明了snATAC-seq作为高通量交叉检测的强大且可扩展的工具的实用性,提供了对减数分裂交叉动力学的见解并阐明了潜在的分子机制。本文提出的单倍体减数分裂后细胞核的snATAC-seq研究可能会扩展到与生育相关的诊断中。
{"title":"A high-resolution meiotic crossover map from single-nucleus ATAC-seq reveals insights into the recombination landscape in mammals.","authors":"Stevan Novakovic, Caitlin Harris, Ruijie Liu, Davis J McCarthy, Wayne Crismani","doi":"10.1093/nargab/lqaf122","DOIUrl":"10.1093/nargab/lqaf122","url":null,"abstract":"<p><p>Meiotic crossovers promote correct chromosome segregation and the shuffling of genetic diversity. However, the measurement of crossovers remains challenging, impeding our ability to decipher the molecular mechanisms that are necessary for their formation and regulation. Here we demonstrate a novel repurposing of the single-nucleus Assay for Transposase Accessible Chromatin with sequencing (snATAC-seq) as a simple and high-throughput method to identify and characterize meiotic crossovers from haploid testis nuclei. We first validate the feasibility of obtaining genome-wide coverage from snATAC-seq by using ATAC-seq on bulk haploid mouse testis nuclei, ensuring adequate variant detection for haplotyping. Subsequently, we adapt droplet-based snATAC-seq for crossover detection, revealing >25 000 crossovers in F<sub>1</sub> hybrid mice. Comparison between the wild type and a hyper-recombinogenic <i>Fancm</i>-deficient mutant mouse model confirmed an increase in crossover rates in this genotype, however with a distribution which was unchanged. We also find that regions with the highest rate of crossover formation are enriched for PRDM9. Our findings demonstrate the utility of snATAC-seq as a robust and scalable tool for high-throughput crossover detection, offering insights into meiotic crossover dynamics and elucidating the underlying molecular mechanisms. It is possible that the research presented here with snATAC-seq of haploid post-meiotic nuclei could be extended into fertility-related diagnostics.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 3","pages":"lqaf122"},"PeriodicalIF":2.8,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12408907/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145016350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GViNC: an innovative framework for genome graph comparison reveals hidden patterns in the genetic diversity of human populations. GViNC:基因组图谱比较的创新框架,揭示了人类种群遗传多样性的隐藏模式。
IF 2.8 Q1 GENETICS & HEREDITY Pub Date : 2025-09-03 eCollection Date: 2025-09-01 DOI: 10.1093/nargab/lqaf121
Venkatesh Kamaraj, Ayam Gupta, Karthik Raman, Manikandan Narayanan, Himanshu Sinha

Genome graphs provide a powerful reference structure for representing genetic diversity. Their structure emphasizes the polymorphic regions in a collection of genomes, enabling network-based comparisons of population-level variation. However, current tools are limited in their ability to quantify and compare structural features across large genome graphs. We introduce GViNC, Genome graph Visualization, Navigation, and Comparison, a novel framework that enables partitioning genome graphs into interpretable subgraphs, mapping linear coordinates to graph nodes, and summarizing both local and global structural variation using new metrics for variability, hypervariability, and graph distances. We applied GViNC to multiple pan-genomic and population-specific genome graphs constructed with over 85M variants in 2504 individuals from the 1000 Genomes Project. We found that genomic complexity varied by ancestry and across chromosomes, with rare variants increasing variability by 10-fold and hypervariability by 50-fold. GViNC highlighted key regions of the human genome, such as Human Leukocyte Antigen and DEFB loci, and many previously unreported high-diversity regions, some with population-specific signatures in protein-coding and regulatory genes. By bridging sequence-level variation and graph-level topology, GViNC enables scalable, quantitative exploration of genome structure across populations. GViNC's versatility can aid researchers in extensively investigating the genetic diversity of different cohorts, populations, or species of interest.

基因组图谱为表达遗传多样性提供了强有力的参考结构。它们的结构强调基因组集合中的多态性区域,使基于网络的种群水平变异比较成为可能。然而,目前的工具在量化和比较大型基因组图的结构特征方面的能力有限。我们介绍了GViNC,基因组图可视化,导航和比较,这是一个新的框架,可以将基因组图划分为可解释的子图,将线性坐标映射到图节点,并使用变异性,超变异性和图距离的新度量来总结局部和全局结构变化。我们将GViNC应用于多个泛基因组和群体特异性基因组图,这些基因组图由来自1000基因组计划的2504个个体的超过8500万个变体构建而成。我们发现基因组复杂性因祖先和染色体而异,罕见变异使变异性增加10倍,高变异性增加50倍。GViNC强调了人类基因组的关键区域,如人类白细胞抗原和DEFB位点,以及许多以前未报道的高多样性区域,其中一些在蛋白质编码和调控基因中具有群体特异性特征。通过桥接序列级变异和图级拓扑结构,GViNC能够跨种群进行可扩展的、定量的基因组结构探索。GViNC的多功能性可以帮助研究人员广泛调查不同群体、种群或感兴趣物种的遗传多样性。
{"title":"GViNC: an innovative framework for genome graph comparison reveals hidden patterns in the genetic diversity of human populations.","authors":"Venkatesh Kamaraj, Ayam Gupta, Karthik Raman, Manikandan Narayanan, Himanshu Sinha","doi":"10.1093/nargab/lqaf121","DOIUrl":"10.1093/nargab/lqaf121","url":null,"abstract":"<p><p>Genome graphs provide a powerful reference structure for representing genetic diversity. Their structure emphasizes the polymorphic regions in a collection of genomes, enabling network-based comparisons of population-level variation. However, current tools are limited in their ability to quantify and compare structural features across large genome graphs. We introduce GViNC, Genome graph Visualization, Navigation, and Comparison, a novel framework that enables partitioning genome graphs into interpretable subgraphs, mapping linear coordinates to graph nodes, and summarizing both local and global structural variation using new metrics for variability, hypervariability, and graph distances. We applied GViNC to multiple pan-genomic and population-specific genome graphs constructed with over 85M variants in 2504 individuals from the 1000 Genomes Project. We found that genomic complexity varied by ancestry and across chromosomes, with rare variants increasing variability by 10-fold and hypervariability by 50-fold. GViNC highlighted key regions of the human genome, such as Human Leukocyte Antigen and DEFB loci, and many previously unreported high-diversity regions, some with population-specific signatures in protein-coding and regulatory genes. By bridging sequence-level variation and graph-level topology, GViNC enables scalable, quantitative exploration of genome structure across populations. GViNC's versatility can aid researchers in extensively investigating the genetic diversity of different cohorts, populations, or species of interest.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 3","pages":"lqaf121"},"PeriodicalIF":2.8,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12408910/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145016341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction to 'Immunopipe: a comprehensive and flexible scRNA-seq and scTCR-seq data analysis pipeline'. 对“Immunopipe:一个全面而灵活的scRNA-seq和scTCR-seq数据分析管道”的更正。
IF 2.8 Q1 GENETICS & HEREDITY Pub Date : 2025-09-03 eCollection Date: 2025-09-01 DOI: 10.1093/nargab/lqaf126

[This corrects the article DOI: 10.1093/nar/lqaf063.].

[这更正了文章DOI: 10.1093/nar/lqaf063.]。
{"title":"Correction to 'Immunopipe: a comprehensive and flexible scRNA-seq and scTCR-seq data analysis pipeline'.","authors":"","doi":"10.1093/nargab/lqaf126","DOIUrl":"10.1093/nargab/lqaf126","url":null,"abstract":"<p><p>[This corrects the article DOI: 10.1093/nar/lqaf063.].</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 3","pages":"lqaf126"},"PeriodicalIF":2.8,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12408894/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145016366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
NAR Genomics and Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1