GigaScience最新文献_第7页

BVSim: A benchmarking variation simulator mimicking human variation spectrum. BVSim：模拟人类变异谱的基准变异模拟器。

IF 11.8 2区生物学 Q1 MULTIDISCIPLINARY SCIENCES

GigaScience

Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf095

Yongyi Luo, Zhen Zhang, Shu Wang, Jiandong Shi, Jingyu Hao, Sheng Lian, Taobo Hu, Toyotaka Ishibashi, Depeng Wang, Weichuan Yu, Xiaodan Fan

Background: Genomic variations, including single-nucleotide polymorphisms, small insertions and deletions, and structural variations, are crucial for understanding evolution and disease. However, comprehensive simulation tools for benchmarking genomic analysis methods are lacking. Existing simulators do not accurately represent the nonuniform distribution and length patterns of structural variations in human genomes, and simulating complex structural variations remains challenging.

Results: We present BVSim, a flexible tool that provides probabilistic simulations of genomic variations, primarily focusing on human patterns while accommodating diverse species. BVSim effectively simulates both simple and complex structural variations and small variants by mimicking real-life variation distributions, which often exhibit higher frequencies near telomeres and within tandem repeat regions. Notably, BVSim allows users to input single or multiple benchmark samples from any reference genome, enabling the tool to summarize and represent the unique distribution patterns of structural variation positions and lengths specific to those species. Its compatibility with standard file formats facilitates seamless integration into various genomic research workflows, making it a very useful resource for benchmarking downstream tools such as variant callers. With numerical experiments, we show that BVSim generated more realistic sequences significantly different from other simulators' outputs.

Conclusions: BVSim is written in Python and freely available to noncommercial users under the GPL3 license. Source code, application guide, and toy examples are provided on the GitHub page at https://github.com/YongyiLuo98/BVSim. The tool is registered in SciCrunch (RRID:SCR_026926), bio.tools (biotools:BVSim), and WorkflowHub (doi:10.48546/WORKFLOWHUB.WORKFLOW.1361.1).

背景：基因组变异，包括单核苷酸多态性、小插入和缺失以及结构变异，对于理解进化和疾病至关重要。然而，缺乏全面的模拟工具来对标基因组分析方法。现有的模拟器不能准确地表示人类基因组结构变异的非均匀分布和长度模式，并且模拟复杂的结构变异仍然具有挑战性。结果：我们提出了BVSim，一个灵活的工具，提供基因组变异的概率模拟，主要关注人类模式，同时适应不同物种。BVSim通过模拟现实生活中的变异分布，有效地模拟了简单和复杂的结构变异和小变异，这些变异通常在端粒附近和串联重复区域内表现出更高的频率。值得注意的是，BVSim允许用户从任何参考基因组中输入单个或多个基准样本，使该工具能够总结和表示这些物种特有的结构变异位置和长度的独特分布模式。它与标准文件格式的兼容性促进了与各种基因组研究工作流程的无缝集成，使其成为对下游工具（如变体调用器）进行基准测试的非常有用的资源。通过数值实验，我们证明了BVSim生成的序列比其他模拟器的输出更真实。结论：BVSim是用Python编写的，并且在GPL3许可下免费提供给非商业用户。源代码、应用指南和玩具示例在GitHub页面https://github.com/YongyiLuo98/BVSim上提供。该工具注册在SciCrunch (RRID:SCR_026926), bio。tools （biotools:BVSim）和workflowwhub （doi:10.48546/ workflowwhub . workflow .1361.1）。

{"title":"BVSim: A benchmarking variation simulator mimicking human variation spectrum.","authors":"Yongyi Luo, Zhen Zhang, Shu Wang, Jiandong Shi, Jingyu Hao, Sheng Lian, Taobo Hu, Toyotaka Ishibashi, Depeng Wang, Weichuan Yu, Xiaodan Fan","doi":"10.1093/gigascience/giaf095","DOIUrl":"https://doi.org/10.1093/gigascience/giaf095","url":null,"abstract":"Background: Genomic variations, including single-nucleotide polymorphisms, small insertions and deletions, and structural variations, are crucial for understanding evolution and disease. However, comprehensive simulation tools for benchmarking genomic analysis methods are lacking. Existing simulators do not accurately represent the nonuniform distribution and length patterns of structural variations in human genomes, and simulating complex structural variations remains challenging.Results: We present BVSim, a flexible tool that provides probabilistic simulations of genomic variations, primarily focusing on human patterns while accommodating diverse species. BVSim effectively simulates both simple and complex structural variations and small variants by mimicking real-life variation distributions, which often exhibit higher frequencies near telomeres and within tandem repeat regions. Notably, BVSim allows users to input single or multiple benchmark samples from any reference genome, enabling the tool to summarize and represent the unique distribution patterns of structural variation positions and lengths specific to those species. Its compatibility with standard file formats facilitates seamless integration into various genomic research workflows, making it a very useful resource for benchmarking downstream tools such as variant callers. With numerical experiments, we show that BVSim generated more realistic sequences significantly different from other simulators' outputs.Conclusions: BVSim is written in Python and freely available to noncommercial users under the GPL3 license. Source code, application guide, and toy examples are provided on the GitHub page at https://github.com/YongyiLuo98/BVSim. The tool is registered in SciCrunch (RRID:SCR_026926), bio.tools (biotools:BVSim), and WorkflowHub (doi:10.48546/WORKFLOWHUB.WORKFLOW.1361.1).","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12398280/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144950505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving taxonomic inference from ancient environmental metagenomes by masking microbial-like regions in reference genomes. 通过在参考基因组中隐藏微生物样区域来改进古代环境宏基因组的分类推断。

IF 11.8 2区生物学 Q1 MULTIDISCIPLINARY SCIENCES

GigaScience

Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf108

Nikolay Oskolkov, Chenyu Jin, Samantha López Clinton, Benjamin Guinet, Flore Wijnands, Ernst Johnson, Verena E Kutschera, Cormac M Kinsella, Peter D Heintzman, Tom van der Valk

Ancient environmental DNA is increasingly vital for reconstructing past ecosystems, particularly when paleontological and archaeological tissue remains are absent. Detecting ancient plant and animal DNA in environmental samples relies on using extensive eukaryotic reference genome databases for profiling metagenomics data. However, many eukaryotic genomes contain regions with high sequence similarity to microbial DNA, which can lead to the misclassification of bacterial and archaeal reads as eukaryotic. This issue is especially problematic in ancient eDNA datasets, where plant and animal DNA is typically present at very low abundance. In this study, we present a method for identifying bacterial- and archaeal-like sequences in eukaryotic genomes and apply it to nearly 3,000 reference genomes from NCBI RefSeq and GenBank (vertebrates, invertebrates, plants) as well as the 1,323 PhyloNorway plant genome assemblies from herbarium material from northern high-latitude regions. We find that microbial-like regions are widespread across eukaryotic genomes and provide a comprehensive resource of their genomic coordinates and taxonomic annotations. This resource enables the masking of microbial-like regions during profiling analyses, thereby improving the reliability of ancient environmental metagenomic datasets for downstream analyses.

古代环境DNA对于重建过去的生态系统越来越重要，特别是在古生物和考古组织遗骸缺失的情况下。检测环境样本中的古代植物和动物DNA依赖于使用广泛的真核参考基因组数据库来分析宏基因组学数据。然而，许多真核生物基因组包含与微生物DNA序列高度相似的区域，这可能导致将细菌和古细菌的reads错误分类为真核生物。这个问题在古老的eDNA数据集中尤其成问题，因为植物和动物DNA通常以非常低的丰度存在。在这项研究中，我们提出了一种鉴定真核生物基因组中细菌和古细菌样序列的方法，并将其应用于NCBI RefSeq和GenBank（脊椎动物、无脊椎动物、植物）的近3000个参考基因组，以及来自北方高纬度地区植物标本物的1323个PhyloNorway植物基因组组合。我们发现微生物样区域在真核生物基因组中广泛存在，并提供了其基因组坐标和分类注释的综合资源。该资源能够在分析分析过程中屏蔽微生物样区域，从而提高古代环境宏基因组数据集的可靠性，用于下游分析。

{"title":"Improving taxonomic inference from ancient environmental metagenomes by masking microbial-like regions in reference genomes.","authors":"Nikolay Oskolkov, Chenyu Jin, Samantha López Clinton, Benjamin Guinet, Flore Wijnands, Ernst Johnson, Verena E Kutschera, Cormac M Kinsella, Peter D Heintzman, Tom van der Valk","doi":"10.1093/gigascience/giaf108","DOIUrl":"10.1093/gigascience/giaf108","url":null,"abstract":"Ancient environmental DNA is increasingly vital for reconstructing past ecosystems, particularly when paleontological and archaeological tissue remains are absent. Detecting ancient plant and animal DNA in environmental samples relies on using extensive eukaryotic reference genome databases for profiling metagenomics data. However, many eukaryotic genomes contain regions with high sequence similarity to microbial DNA, which can lead to the misclassification of bacterial and archaeal reads as eukaryotic. This issue is especially problematic in ancient eDNA datasets, where plant and animal DNA is typically present at very low abundance. In this study, we present a method for identifying bacterial- and archaeal-like sequences in eukaryotic genomes and apply it to nearly 3,000 reference genomes from NCBI RefSeq and GenBank (vertebrates, invertebrates, plants) as well as the 1,323 PhyloNorway plant genome assemblies from herbarium material from northern high-latitude regions. We find that microbial-like regions are widespread across eukaryotic genomes and provide a comprehensive resource of their genomic coordinates and taxonomic annotations. This resource enables the masking of microbial-like regions during profiling analyses, thereby improving the reliability of ancient environmental metagenomic datasets for downstream analyses.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12491943/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145212353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MRanalysis: a comprehensive online platform for integrated, multimethod Mendelian randomization and associated post-GWAS analyses. 核磁共振分析：一个综合、多方法孟德尔随机化和相关gwas后分析的综合在线平台。

IF 11.8 2区生物学 Q1 MULTIDISCIPLINARY SCIENCES

GigaScience

Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf131

Abao Xing, Tiantian Cai, Haofan Du, Zhifan Li, Hoiman Ng, Junrong Li, Guanmin Jiang, Lijun Chen, Kefeng Li

Background: Mendelian randomization (MR) is a powerful epidemiological method for inferring causal relationships between exposures and outcomes using genome-wide association study (GWAS) data. However, its adoption is limited by inconsistent data formats, lack of standardized workflows, and the need for programming expertise. To address these challenges, we developed MRanalysis, a user-friendly, web-based platform for integrated MR analysis, and GWASkit, a standalone tool for GWAS data preprocessing.

Results: MRanalysis provides a comprehensive, no-code workflow for MR analysis, including data quality assessment, power estimation, single-nucleotide polymorphism to gene enrichment, and visualization. It supports univariable, multivariable, and mediation MR analyses through an intuitive interface. GWASkit facilitates rapid GWAS data preprocessing, such as rs ID conversion and format standardization, with significantly higher accuracy and efficiency than existing tools. Case studies demonstrate the utility and efficiency of both tools in real-world scenarios.

Conclusions: MRanalysis and GWASkit lower barriers to MR analysis, making it more accessible, reliable, and efficient. By democratizing MR, these tools can accelerate discoveries in genetic epidemiology, inform public health strategies, and guide targeted interventions. MRanalysis is freely available at https://mranalysis.cn, and GWASkit can be accessed at https://github.com/Li-OmicsLab-MPU/GWASkit. Together, they represent a significant advance in understanding the complex relationships between genes, exposures, and health outcomes.

背景：孟德尔随机化（MR）是一种强大的流行病学方法，可以利用全基因组关联研究（GWAS）数据推断暴露与结果之间的因果关系。然而，它的采用受到不一致的数据格式、缺乏标准化工作流以及对编程专业知识的需求的限制。为了应对这些挑战，我们开发了磁共振分析，这是一个用户友好的基于网络的集成磁共振分析平台，以及GWASkit，这是一个用于GWAS数据预处理的独立工具。结果：核磁共振分析为核磁共振分析提供了一个全面的、无代码的工作流程，包括数据质量评估、功率估计、SNP-to-gene富集和可视化。它通过直观的界面支持单变量、多变量和中介MR分析。GWASkit促进了快速的GWAS数据预处理，例如rs ID转换和格式标准化，具有比现有工具更高的准确性和效率。案例研究展示了这两种工具在真实场景中的效用和效率。结论：磁共振分析和GWASkit降低了磁共振分析的门槛，使其更容易获得、可靠和高效。通过使MR民主化，这些工具可以加速遗传流行病学的发现，为公共卫生战略提供信息，并指导有针对性的干预措施。MRanalysis可以在https://mranalysis.cn上免费获得，GWASkit可以在https://github.com/Li-OmicsLab-MPU/GWASkit上访问。总之，它们在理解基因、暴露和健康结果之间的复杂关系方面取得了重大进展。

{"title":"MRanalysis: a comprehensive online platform for integrated, multimethod Mendelian randomization and associated post-GWAS analyses.","authors":"Abao Xing, Tiantian Cai, Haofan Du, Zhifan Li, Hoiman Ng, Junrong Li, Guanmin Jiang, Lijun Chen, Kefeng Li","doi":"10.1093/gigascience/giaf131","DOIUrl":"10.1093/gigascience/giaf131","url":null,"abstract":"Background: Mendelian randomization (MR) is a powerful epidemiological method for inferring causal relationships between exposures and outcomes using genome-wide association study (GWAS) data. However, its adoption is limited by inconsistent data formats, lack of standardized workflows, and the need for programming expertise. To address these challenges, we developed MRanalysis, a user-friendly, web-based platform for integrated MR analysis, and GWASkit, a standalone tool for GWAS data preprocessing.Results: MRanalysis provides a comprehensive, no-code workflow for MR analysis, including data quality assessment, power estimation, single-nucleotide polymorphism to gene enrichment, and visualization. It supports univariable, multivariable, and mediation MR analyses through an intuitive interface. GWASkit facilitates rapid GWAS data preprocessing, such as rs ID conversion and format standardization, with significantly higher accuracy and efficiency than existing tools. Case studies demonstrate the utility and efficiency of both tools in real-world scenarios.Conclusions: MRanalysis and GWASkit lower barriers to MR analysis, making it more accessible, reliable, and efficient. By democratizing MR, these tools can accelerate discoveries in genetic epidemiology, inform public health strategies, and guide targeted interventions. MRanalysis is freely available at https://mranalysis.cn, and GWASkit can be accessed at https://github.com/Li-OmicsLab-MPU/GWASkit. Together, they represent a significant advance in understanding the complex relationships between genes, exposures, and health outcomes.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12616851/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145344680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving the reliability, quality, and maintainability of bioinformatics pipelines with nf-test. 利用非检验技术提高生物信息学管道的可靠性、质量和可维护性。

IF 11.8 2区生物学 Q1 MULTIDISCIPLINARY SCIENCES

GigaScience

Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf130

Lukas Forer, Sebastian Schönherr

Background: The workflow management system Nextflow, together with the nf-core community, has established an essential ecosystem in bioinformatics. However, ensuring the correctness and reliability of large and complex Nextflow pipelines remains challenging due to the lack of a unified, automated unit-testing framework.

Results: To address this gap, we present nf-test, a modular testing framework for bioinformatics workflows. It enables users to test process blocks, workflow patterns, and entire pipelines in isolation while validating their outputs. Built with a syntax similar to Nextflow DSL2, nf-test offers unique features such as snapshot testing and smart testing, which optimize resource usage by testing only modified modules. We demonstrate across multiple pipelines that these features minimize development time, reduce test execution time by up to 80%, and enhance software quality by identifying bugs and issues early in the development process.

Conclusions: Already adopted by numerous pipelines, nf-test significantly improves the robustness, maintainability, and reliability of bioinformatics pipelines.

工作流管理系统Nextflow与非核心社区一起，建立了生物信息学中必不可少的生态系统。然而，由于缺乏统一的自动化单元测试框架，确保大型复杂Nextflow管道的正确性和可靠性仍然具有挑战性。为了解决这一差距，我们提出了nf-test，一个生物信息学工作流程的模块化测试框架。它使用户能够在验证其输出的同时单独测试流程块、工作流模式和整个管道。nf-test使用类似Nextflow DSL2的语法构建，提供了快照测试和智能测试等独特功能，通过仅测试修改模块来优化资源使用。我们在多个管道中演示了这些特性最小化了开发时间，减少了多达80%的测试执行时间，并通过在开发过程的早期识别错误和问题来提高软件质量。nf-test已被众多管道采用，显著提高了生物信息学管道的鲁棒性、可维护性和可靠性。

{"title":"Improving the reliability, quality, and maintainability of bioinformatics pipelines with nf-test.","authors":"Lukas Forer, Sebastian Schönherr","doi":"10.1093/gigascience/giaf130","DOIUrl":"10.1093/gigascience/giaf130","url":null,"abstract":"Background: The workflow management system Nextflow, together with the nf-core community, has established an essential ecosystem in bioinformatics. However, ensuring the correctness and reliability of large and complex Nextflow pipelines remains challenging due to the lack of a unified, automated unit-testing framework.Results: To address this gap, we present nf-test, a modular testing framework for bioinformatics workflows. It enables users to test process blocks, workflow patterns, and entire pipelines in isolation while validating their outputs. Built with a syntax similar to Nextflow DSL2, nf-test offers unique features such as snapshot testing and smart testing, which optimize resource usage by testing only modified modules. We demonstrate across multiple pipelines that these features minimize development time, reduce test execution time by up to 80%, and enhance software quality by identifying bugs and issues early in the development process.Conclusions: Already adopted by numerous pipelines, nf-test significantly improves the robustness, maintainability, and reliability of bioinformatics pipelines.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12616847/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145344736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SurGen: 1020 H&E-stained whole-slide images with survival and genetic markers. 外科医生：1020张h&e染色的全片图像，带有生存和遗传标记。

IF 11.8 2区生物学 Q1 MULTIDISCIPLINARY SCIENCES

GigaScience

Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf086

Craig Myles, In Hwa Um, Craig Marshall, David Harris-Birtill, David J Harrison

Background: Cancer remains one of the leading causes of morbidity and mortality worldwide. Comprehensive datasets that combine histopathological images with genetic and survival data across various tumour sites are essential for advancing computational pathology and personalised medicine.

Results: We present SurGen, a dataset comprising 1,020 H&E-stained whole-slide images (WSIs) from 843 colorectal cancer cases. The dataset includes detailed annotations for key genetic mutations (KRAS, NRAS, BRAF) and mismatch repair status, as well as survival data for 426 cases. We illustrate SurGen's utility with a proof-of-concept model that predicts mismatch repair status directly from WSIs, achieving a test area under the receiver operating characteristic curve of 0.8273. These preliminary results underscore the dataset's potential to facilitate research in biomarker discovery, prognostic modelling, and advanced machine learning applications in colorectal cancer and beyond.

Conclusions: SurGen offers a valuable resource for the scientific community, enabling studies that require high-quality WSIs linked with comprehensive clinical and genetic information on colorectal cancer. Our initial findings affirm the dataset's capacity to advance diagnostic precision and foster the development of personalised treatment strategies in colorectal oncology. Data available online: https://doi.org/10.6019/S-BIAD1285.

背景：癌症仍然是世界范围内发病率和死亡率的主要原因之一。将组织病理学图像与各种肿瘤部位的遗传和生存数据相结合的综合数据集对于推进计算病理学和个性化医学至关重要。结果：我们展示了SurGen，一个包含来自843例结直肠癌病例的1,020张h&e染色全切片图像（wsi）的数据集。该数据集包括关键基因突变（KRAS， NRAS， BRAF）和错配修复状态的详细注释，以及426例的生存数据。我们用一个概念验证模型来说明SurGen的实用性，该模型直接从wsi预测错配修复状态，在接收器工作特性曲线下实现了0.8273的测试区域。这些初步结果强调了该数据集在促进生物标志物发现、预后建模和结肠直肠癌及其他领域先进机器学习应用研究方面的潜力。结论：SurGen为科学界提供了宝贵的资源，使需要高质量wsi与结直肠癌综合临床和遗传信息相关的研究成为可能。我们的初步研究结果证实了该数据集在提高诊断精度和促进结肠直肠癌个性化治疗策略发展方面的能力。网上资料：https://doi.org/10.6019/S-BIAD1285。

{"title":"SurGen: 1020 H&E-stained whole-slide images with survival and genetic markers.","authors":"Craig Myles, In Hwa Um, Craig Marshall, David Harris-Birtill, David J Harrison","doi":"10.1093/gigascience/giaf086","DOIUrl":"10.1093/gigascience/giaf086","url":null,"abstract":"Background: Cancer remains one of the leading causes of morbidity and mortality worldwide. Comprehensive datasets that combine histopathological images with genetic and survival data across various tumour sites are essential for advancing computational pathology and personalised medicine.Results: We present SurGen, a dataset comprising 1,020 H&E-stained whole-slide images (WSIs) from 843 colorectal cancer cases. The dataset includes detailed annotations for key genetic mutations (KRAS, NRAS, BRAF) and mismatch repair status, as well as survival data for 426 cases. We illustrate SurGen's utility with a proof-of-concept model that predicts mismatch repair status directly from WSIs, achieving a test area under the receiver operating characteristic curve of 0.8273. These preliminary results underscore the dataset's potential to facilitate research in biomarker discovery, prognostic modelling, and advanced machine learning applications in colorectal cancer and beyond.Conclusions: SurGen offers a valuable resource for the scientific community, enabling studies that require high-quality WSIs linked with comprehensive clinical and genetic information on colorectal cancer. Our initial findings affirm the dataset's capacity to advance diagnostic precision and foster the development of personalised treatment strategies in colorectal oncology. Data available online: https://doi.org/10.6019/S-BIAD1285.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12569769/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145344713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CODARFE: Unlocking the prediction of continuous environmental variables based on microbiome. CODARFE：解开基于微生物组的连续环境变量预测。

IF 11.8 2区生物学 Q1 MULTIDISCIPLINARY SCIENCES

GigaScience

Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf055

Murilo Caminotto Barbosa, João Fernando Marques da Silva, Leonardo Cardoso Alves, Robert D Finn, Alexandre Rossi Paschoal

Background: Despite the surge in microbiome data acquisition, there is a limited availability of tools capable of effectively analyzing it and identifying correlations between taxonomic compositions and continuous environmental factors. Furthermore, existing tools also do not predict the environmental factors in new samples, underscoring the pressing need for innovative solutions to enhance our understanding of microbiome dynamics and fulfill the prediction gap. Here we introduce CODARFE, a novel tool for sparse compositional microbiome predictor selection and prediction of continuous environmental factors.

Results: We tested CODARFE against 4 state-of-the-art tools in 2 experiments. First, CODARFE outperformed predictor selection in 21 of 24 databases in terms of correlation. Second, among all the tools, CODARFE achieved the highest number of previously identified bacteria linked to environmental factors for human data-that is, at least 7% more. We also tested CODARFE in a cross-study, using the same biome but under different external effects, using a model trained on 1 dataset to predict environmental factors on another dataset, achieving 11% of mean absolute percentage error. Finally, CODARFE is available in 5 formats, including a Windows version with a graphical interface, to installable source code for Linux servers and an embedded Jupyter notebook available at MGnify.

Conclusions: Our findings underscore the robustness and broad applicability of CODARFE across diverse fields, even under varying experimental conditions. Additionally, the ability to predict outcomes in new samples allows for the generation of new insights in previously unexplored contexts, providing researchers with a versatile tool.

背景：尽管微生物组数据采集激增，但能够有效分析微生物组数据并识别分类组成与连续环境因素之间相关性的工具有限。此外，现有的工具也不能预测新样品中的环境因素，这强调了迫切需要创新的解决方案来增强我们对微生物组动力学的理解并填补预测空白。本文介绍了一种用于稀疏组成微生物组预测因子选择和连续环境因子预测的新工具CODARFE。结果：我们在2个实验中对CODARFE与4种最先进的工具进行了测试。首先，在相关性方面，CODARFE在24个数据库中的21个中优于预测器选择。其次，在所有工具中，CODARFE获得了与人类数据的环境因素相关的先前鉴定细菌的最高数量，即至少高出7%。我们还在交叉研究中测试了CODARFE，使用相同的生物群系，但在不同的外部影响下，使用在一个数据集上训练的模型来预测另一个数据集上的环境因素，实现了11%的平均绝对百分比误差。最后，CODARFE有5种格式，包括带有图形界面的Windows版本，用于安装Linux服务器的源代码和MGnify提供的嵌入式Jupyter笔记本。结论：我们的研究结果强调了CODARFE在不同领域的稳健性和广泛适用性，即使在不同的实验条件下也是如此。此外，在新样本中预测结果的能力允许在以前未探索的环境中产生新的见解，为研究人员提供了一个通用的工具。

{"title":"CODARFE: Unlocking the prediction of continuous environmental variables based on microbiome.","authors":"Murilo Caminotto Barbosa, João Fernando Marques da Silva, Leonardo Cardoso Alves, Robert D Finn, Alexandre Rossi Paschoal","doi":"10.1093/gigascience/giaf055","DOIUrl":"10.1093/gigascience/giaf055","url":null,"abstract":"Background: Despite the surge in microbiome data acquisition, there is a limited availability of tools capable of effectively analyzing it and identifying correlations between taxonomic compositions and continuous environmental factors. Furthermore, existing tools also do not predict the environmental factors in new samples, underscoring the pressing need for innovative solutions to enhance our understanding of microbiome dynamics and fulfill the prediction gap. Here we introduce CODARFE, a novel tool for sparse compositional microbiome predictor selection and prediction of continuous environmental factors.Results: We tested CODARFE against 4 state-of-the-art tools in 2 experiments. First, CODARFE outperformed predictor selection in 21 of 24 databases in terms of correlation. Second, among all the tools, CODARFE achieved the highest number of previously identified bacteria linked to environmental factors for human data-that is, at least 7% more. We also tested CODARFE in a cross-study, using the same biome but under different external effects, using a model trained on 1 dataset to predict environmental factors on another dataset, achieving 11% of mean absolute percentage error. Finally, CODARFE is available in 5 formats, including a Windows version with a graphical interface, to installable source code for Linux servers and an embedded Jupyter notebook available at MGnify.Conclusions: Our findings underscore the robustness and broad applicability of CODARFE across diverse fields, even under varying experimental conditions. Additionally, the ability to predict outcomes in new samples allows for the generation of new insights in previously unexplored contexts, providing researchers with a versatile tool.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12365963/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144474816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SynProtX: a large-scale proteomics-based deep learning model for predicting synergistic anticancer drug combinations. SynProtX：用于预测协同抗癌药物组合的基于蛋白质组学的大规模深度学习模型。

IF 11.8 2区生物学 Q1 MULTIDISCIPLINARY SCIENCES

GigaScience

Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf080

Bundit Boonyarit, Matin Kositchutima, Tisorn Na Phattalung, Nattawin Yamprasert, Chanitra Thuwajit, Thanyada Rungrotmongkol, Sarana Nutanong

Motivation: Drug combination therapy plays a pivotal role in addressing the molecular heterogeneity of cancer, improving treatment efficacy, minimizing resistance, and reducing toxicity. Deep learning approaches have significantly advanced drug combination discovery by addressing the limitations of conventional laboratory experiments, which are time-consuming and costly. While most existing models rely on the molecular structure of drugs and gene expression data, incorporating protein-level expression provides a more accurate representation of cellular behavior and drug responses. In this study, we introduce SynProtX, an enhanced deep learning model that explicitly integrates large-scale proteomics with deep neural networks (DNNs) and the molecular structure of drugs with graph neural networks (GNNs).

Results: The SynProtX-GATFP model, which combines molecular graphs and fingerprints through a graph attention network architecture, demonstrated superior predictive performance for the FRIEDMAN study dataset. We further evaluated its cell line-specific performance, which achieved accuracy across diverse tissue and study datasets. By incorporating protein expression data, the model consistently enhanced predictive performance over gene expression-only models, reflecting the functional state of cancer cells. The generalizability of SynProtX was rigorously validated using cold-start prediction, including leave-drug-combination-out, leave-drug-out, and leave-cell-line-out validation strategies, highlighting its robust performance and potential for clinical applicability. Additionally, SynProtX identified key cancer-associated proteins and molecular substructures, offering novel insights into the biological mechanisms underlying drug synergy. These findings highlight the potential of integrating large-scale proteomics and multiomics data to advance anticancer drug design and combination therapy strategies for personalized medicine. Availability and implementation: https://github.com/manbaritone/SynProtX.

动机：药物联合治疗在解决肿瘤分子异质性、提高治疗疗效、减少耐药、降低毒性等方面发挥着关键作用。深度学习方法通过解决传统实验室实验耗时且昂贵的局限性，显著推进了药物组合的发现。虽然大多数现有模型依赖于药物的分子结构和基因表达数据，但结合蛋白质水平的表达可以更准确地表示细胞行为和药物反应。在这项研究中，我们介绍了SynProtX，这是一个增强的深度学习模型，它明确地将大规模蛋白质组学与深度神经网络（dnn）和药物分子结构与图神经网络（gnn）相结合。结果：SynProtX-GATFP模型通过图注意网络架构将分子图和指纹结合起来，对FRIEDMAN研究数据集显示出卓越的预测性能。我们进一步评估了其细胞系特异性性能，该性能在不同组织和研究数据集中实现了准确性。通过结合蛋白表达数据，该模型比仅基因表达模型的预测性能持续提高，反映了癌细胞的功能状态。通过冷启动预测，包括遗漏药物组合、遗漏药物和遗漏细胞系验证策略，对SynProtX的通用性进行了严格验证，突出了其稳健的性能和临床应用潜力。此外，SynProtX还鉴定了关键的癌症相关蛋白和分子亚结构，为药物协同作用的生物学机制提供了新的见解。这些发现突出了整合大规模蛋白质组学和多组学数据在推进抗癌药物设计和个性化药物联合治疗策略方面的潜力。可用性和实现：https://github.com/manbaritone/SynProtX。

{"title":"SynProtX: a large-scale proteomics-based deep learning model for predicting synergistic anticancer drug combinations.","authors":"Bundit Boonyarit, Matin Kositchutima, Tisorn Na Phattalung, Nattawin Yamprasert, Chanitra Thuwajit, Thanyada Rungrotmongkol, Sarana Nutanong","doi":"10.1093/gigascience/giaf080","DOIUrl":"10.1093/gigascience/giaf080","url":null,"abstract":"Motivation: Drug combination therapy plays a pivotal role in addressing the molecular heterogeneity of cancer, improving treatment efficacy, minimizing resistance, and reducing toxicity. Deep learning approaches have significantly advanced drug combination discovery by addressing the limitations of conventional laboratory experiments, which are time-consuming and costly. While most existing models rely on the molecular structure of drugs and gene expression data, incorporating protein-level expression provides a more accurate representation of cellular behavior and drug responses. In this study, we introduce SynProtX, an enhanced deep learning model that explicitly integrates large-scale proteomics with deep neural networks (DNNs) and the molecular structure of drugs with graph neural networks (GNNs).Results: The SynProtX-GATFP model, which combines molecular graphs and fingerprints through a graph attention network architecture, demonstrated superior predictive performance for the FRIEDMAN study dataset. We further evaluated its cell line-specific performance, which achieved accuracy across diverse tissue and study datasets. By incorporating protein expression data, the model consistently enhanced predictive performance over gene expression-only models, reflecting the functional state of cancer cells. The generalizability of SynProtX was rigorously validated using cold-start prediction, including leave-drug-combination-out, leave-drug-out, and leave-cell-line-out validation strategies, highlighting its robust performance and potential for clinical applicability. Additionally, SynProtX identified key cancer-associated proteins and molecular substructures, offering novel insights into the biological mechanisms underlying drug synergy. These findings highlight the potential of integrating large-scale proteomics and multiomics data to advance anticancer drug design and combination therapy strategies for personalized medicine. Availability and implementation: https://github.com/manbaritone/SynProtX.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12343095/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144834815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A holistic genome dataset of bacteria and archaea of mangrove sediments. 红树林沉积物中细菌和古细菌的整体基因组数据集。

IF 11.8 2区生物学 Q1 MULTIDISCIPLINARY SCIENCES

GigaScience

Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf081

Shijun Pan, Huan Du, Ruiqi Zheng, Cuijing Zhang, Jie Pan, Xilan Yang, Cheng Wang, Xiaolan Lin, Jinhui Li, Wan Liu, Haokui Zhou, Xiaoli Yu, Shuming Mo, Guoqing Zhang, Guoping Zhao, Zhili He, Yun Tian, Chengjian Jiang, Wu Qu, Yang Liu, Meng Li

Background: Mangroves are one of the most productive marine ecosystems with high ecosystem service value. The sediment microbial communities contribute to pivotal ecological functions in mangrove ecosystems. However, the study of mangrove sediment microbiomes is limited.

Findings: Here, we applied metagenome sequencing analysis of microbial communities in mangrove sediments across Southeast China from 2014 to 2020. This genome dataset includes 966 metagenome-assembled genomes with ≥50% completeness and ≤10% contamination generated from 6 groups of samples. Phylogenomic analysis and taxonomy classification show that mangrove sediments are inhabited by microbial communities with high species diversity. Thermoplasmatota, Thermoproteota, and Asgardarchaeota in archaea, as well as Proteobacteria, Desulfobacterota, Chloroflexota, Acidobacteriota, and Gemmatimonadota in bacteria, dominate the mangrove sediments across Southeast China. Functional analyses suggest that the microbial communities may contribute to carbon, nitrogen, and sulfur cycling in mangrove sediments.

Conclusions: These combined microbial genomes provide an important complement of global mangrove genome datasets and may serve as a foundational resource for enhancing our understanding of the composition and functions of mangrove sediment microbiomes.

背景：红树林是生产力最高的海洋生态系统之一，具有很高的生态系统服务价值。沉积物微生物群落在红树林生态系统中具有重要的生态功能。然而，对红树林沉积物微生物群的研究是有限的。研究结果：对2014 - 2020年中国东南部红树林沉积物微生物群落进行了宏基因组测序分析。该基因组数据集包括966个宏基因组组装的基因组，完整性≥50%，污染≤10%，来自6组样本。系统基因组学分析和分类分类表明，红树林沉积物中存在物种多样性较高的微生物群落。古细菌中的Thermoplasmatota、Thermoproteota和asgardarchaaeota，以及细菌中的Proteobacteria、Desulfobacterota、Chloroflexota、Acidobacteriota和Gemmatimonadota在中国东南部的红树林沉积物中占主导地位。功能分析表明，微生物群落可能有助于红树林沉积物中的碳、氮和硫循环。结论：这些组合的微生物基因组为全球红树林基因组数据集提供了重要的补充，并可为进一步了解红树林沉积物微生物组的组成和功能提供基础资源。

{"title":"A holistic genome dataset of bacteria and archaea of mangrove sediments.","authors":"Shijun Pan, Huan Du, Ruiqi Zheng, Cuijing Zhang, Jie Pan, Xilan Yang, Cheng Wang, Xiaolan Lin, Jinhui Li, Wan Liu, Haokui Zhou, Xiaoli Yu, Shuming Mo, Guoqing Zhang, Guoping Zhao, Zhili He, Yun Tian, Chengjian Jiang, Wu Qu, Yang Liu, Meng Li","doi":"10.1093/gigascience/giaf081","DOIUrl":"10.1093/gigascience/giaf081","url":null,"abstract":"Background: Mangroves are one of the most productive marine ecosystems with high ecosystem service value. The sediment microbial communities contribute to pivotal ecological functions in mangrove ecosystems. However, the study of mangrove sediment microbiomes is limited.Findings: Here, we applied metagenome sequencing analysis of microbial communities in mangrove sediments across Southeast China from 2014 to 2020. This genome dataset includes 966 metagenome-assembled genomes with ≥50% completeness and ≤10% contamination generated from 6 groups of samples. Phylogenomic analysis and taxonomy classification show that mangrove sediments are inhabited by microbial communities with high species diversity. Thermoplasmatota, Thermoproteota, and Asgardarchaeota in archaea, as well as Proteobacteria, Desulfobacterota, Chloroflexota, Acidobacteriota, and Gemmatimonadota in bacteria, dominate the mangrove sediments across Southeast China. Functional analyses suggest that the microbial communities may contribute to carbon, nitrogen, and sulfur cycling in mangrove sediments.Conclusions: These combined microbial genomes provide an important complement of global mangrove genome datasets and may serve as a foundational resource for enhancing our understanding of the composition and functions of mangrove sediment microbiomes.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12343073/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144834874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TinkerHap-a novel read-based phasing algorithm with integrated multimethod support for enhanced accuracy. TinkerHap——一种新的基于读的相位算法，集成了多方法支持，提高了精度。

IF 11.8 2区生物学 Q1 MULTIDISCIPLINARY SCIENCES

GigaScience

Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf138

Uri Hartmann, Eran Shaham, Dafna Nathan, Ilana Blech, Danny Zeevi

Background: Phasing, the assignment of alleles to their respective parental chromosomes, is fundamental to studying genetic variation and identifying disease-causing variants. Traditional approaches, including statistical, pedigree-based, and read-based phasing, face challenges such as limited accuracy for rare variants and reliance on external reference panels.

Results: To address these limitations, we developed TinkerHap, a novel phasing algorithm that integrates a read-based phaser, based on a pairwise distance-based unsupervised classification, with external phased data, such as statistical or pedigree phasing. We evaluated TinkerHap's performance against other phasing algorithms using 1,040 parent-offspring trios from the UK Biobank (Illumina short reads) and GIAB Ashkenazi trio (PacBio long reads). TinkerHap's read-based phaser alone achieved higher phasing accuracies than all other algorithms with 95.1% for short reads (second best: 94.8%) and 97.5% for long reads (second best: 95.5%). Its hybrid approach further enhanced short-read performance to 96.3% accuracy and was able to phase 99.5% of all heterozygous sites. TinkerHap also extended haplotype block sizes to a median of 79,449 bp for long reads (second best: 68,303 bp) and demonstrated higher accuracy for both single-nucleotide polymorphisms and indels.

Conclusions: The combination of a robust read-based algorithm and a hybrid integration strategy makes TinkerHap a powerful and versatile tool for genomic analysis, enabling more accurate, contiguous, and comprehensive phasing across diverse sequencing platforms and variant types.

将等位基因分配到各自的亲本染色体上，是研究遗传变异和识别致病变异的基础。传统的方法，包括统计的、基于系谱的和基于读取的分阶段，都面临着一些挑战，比如罕见变异的准确性有限，以及对外部参考面板的依赖。为了解决这些限制，我们开发了TinkerHap，这是一种新的相位算法，它将基于两两距离的无监督分类的基于读取的相位器与外部相位数据（如统计或系谱相位）集成在一起。我们使用来自UK Biobank （Illumina short-reads）和GIAB Ashkenazi三人组（PacBio long-reads）的1,040个父母-后代三人组对TinkerHap与其他相位算法的性能进行了评估。TinkerHap的基于读取的相位器单独取得了比所有其他算法更高的相位精度，短读取为95.1%（第二好：94.8%），长读取为97.5%（第二好：95.5%）。其杂交方法进一步将短读性能提高到96.3%的准确率，并能够分相99.5%的杂合位点。TinkerHap还将单倍型块大小的中位数延长至79,449碱基对（第二佳：68,303 bp），并证明了snp和索引的更高准确性。这种强大的基于读取的算法和混合策略的结合使TinkerHap成为基因组分析的独特强大工具。

{"title":"TinkerHap-a novel read-based phasing algorithm with integrated multimethod support for enhanced accuracy.","authors":"Uri Hartmann, Eran Shaham, Dafna Nathan, Ilana Blech, Danny Zeevi","doi":"10.1093/gigascience/giaf138","DOIUrl":"10.1093/gigascience/giaf138","url":null,"abstract":"Background: Phasing, the assignment of alleles to their respective parental chromosomes, is fundamental to studying genetic variation and identifying disease-causing variants. Traditional approaches, including statistical, pedigree-based, and read-based phasing, face challenges such as limited accuracy for rare variants and reliance on external reference panels.Results: To address these limitations, we developed TinkerHap, a novel phasing algorithm that integrates a read-based phaser, based on a pairwise distance-based unsupervised classification, with external phased data, such as statistical or pedigree phasing. We evaluated TinkerHap's performance against other phasing algorithms using 1,040 parent-offspring trios from the UK Biobank (Illumina short reads) and GIAB Ashkenazi trio (PacBio long reads). TinkerHap's read-based phaser alone achieved higher phasing accuracies than all other algorithms with 95.1% for short reads (second best: 94.8%) and 97.5% for long reads (second best: 95.5%). Its hybrid approach further enhanced short-read performance to 96.3% accuracy and was able to phase 99.5% of all heterozygous sites. TinkerHap also extended haplotype block sizes to a median of 79,449 bp for long reads (second best: 68,303 bp) and demonstrated higher accuracy for both single-nucleotide polymorphisms and indels.Conclusions: The combination of a robust read-based algorithm and a hybrid integration strategy makes TinkerHap a powerful and versatile tool for genomic analysis, enabling more accurate, contiguous, and comprehensive phasing across diverse sequencing platforms and variant types.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12723663/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145377046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ultra-deep long-read metagenomics captures diverse taxonomic and biosynthetic potential of soil microbes. 超深长读宏基因组学捕获了土壤微生物的多种分类和生物合成潜力。

IF 11.8 2区生物学 Q1 MULTIDISCIPLINARY SCIENCES

GigaScience

Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf135

Caner Bağcı, Timo Negri, Elena Buena-Atienza, Caspar Gross, Stephan Ossowski, Nadine Ziemert

Background: Soil ecosystems have long been recognized as hotspots of microbial diversity, but most estimates of their microbial and functional complexity remain speculative despite decades of study, in part because conventional sequencing campaigns lack the depth and contiguity required to recover low-abundance and repetitive genomes. Here, we revisit this question using one of the deepest metagenomic sequencing efforts to date, applying 148 billion basepairs of Nanopore long-read data and 122 billion basepairs of Illumina short-read data to a single forest soil sample.

Results: Our hybrid assembly reconstructed 837 metagenome-assembled genomes, including 466 that meet high- and medium-quality standards, nearly all lacking close relatives among cultivated taxa. Rarefaction and k-mer analyses reveal that, even at this depth, we capture only a fraction of the extant diversity: nonparametric models project that more than 10 trillion basepairs of sequencing data would be required to approach saturation. These findings offer a quantitative, technology-enabled update to long-standing diversity estimates and demonstrate that conventional metagenomic sequencing efforts likely miss most microbial and biosynthetic potential in soil. We further identify more than 11,000 biosynthetic gene clusters, over 99% of which have no match in current databases, underscoring the breadth of unexplored metabolic capacity.

Conclusions: Taken together, our results emphasize both the power and the present limitations of metagenomics in resolving natural microbial complexity, and they provide a new baseline for evaluating future advances in microbial genome recovery, taxonomic classification, and natural product discovery.

背景：土壤生态系统一直被认为是微生物多样性的热点，但尽管经过数十年的研究，大多数对其微生物和功能复杂性的估计仍然是推测性的，部分原因是传统的测序活动缺乏恢复低丰度和重复基因组所需的深度和连续性。在这里，我们使用迄今为止最深入的宏基因组测序工作之一来重新审视这个问题，将1480亿个碱基对的纳米孔长读数据和1220亿个碱基对的Illumina短读数据应用于单个森林土壤样本。结果：我们的杂交组合重建了837个宏基因组组装的基因组，其中466个符合高、中质量标准，几乎所有在栽培类群中缺乏近亲。稀疏和k-mer分析表明，即使在这个深度，我们也只捕获了现存多样性的一小部分：非参数模型预测，需要超过10万亿碱基对的测序数据才能接近饱和。这些发现为长期存在的多样性估计提供了定量的、技术支持的更新，并表明传统的宏基因组测序工作可能错过了土壤中大部分微生物和生物合成潜力。我们进一步确定了超过11000个生物合成基因簇，其中99%以上在当前数据库中没有匹配，强调了未开发的代谢能力的广度。综上所述，我们的研究结果强调了宏基因组学在解决天然微生物复杂性方面的能力和目前的局限性，并为评估微生物基因组恢复、分类分类和天然产物发现的未来进展提供了新的基线。

{"title":"Ultra-deep long-read metagenomics captures diverse taxonomic and biosynthetic potential of soil microbes.","authors":"Caner Bağcı, Timo Negri, Elena Buena-Atienza, Caspar Gross, Stephan Ossowski, Nadine Ziemert","doi":"10.1093/gigascience/giaf135","DOIUrl":"10.1093/gigascience/giaf135","url":null,"abstract":"Background: Soil ecosystems have long been recognized as hotspots of microbial diversity, but most estimates of their microbial and functional complexity remain speculative despite decades of study, in part because conventional sequencing campaigns lack the depth and contiguity required to recover low-abundance and repetitive genomes. Here, we revisit this question using one of the deepest metagenomic sequencing efforts to date, applying 148 billion basepairs of Nanopore long-read data and 122 billion basepairs of Illumina short-read data to a single forest soil sample.Results: Our hybrid assembly reconstructed 837 metagenome-assembled genomes, including 466 that meet high- and medium-quality standards, nearly all lacking close relatives among cultivated taxa. Rarefaction and k-mer analyses reveal that, even at this depth, we capture only a fraction of the extant diversity: nonparametric models project that more than 10 trillion basepairs of sequencing data would be required to approach saturation. These findings offer a quantitative, technology-enabled update to long-standing diversity estimates and demonstrate that conventional metagenomic sequencing efforts likely miss most microbial and biosynthetic potential in soil. We further identify more than 11,000 biosynthetic gene clusters, over 99% of which have no match in current databases, underscoring the breadth of unexplored metabolic capacity.Conclusions: Taken together, our results emphasize both the power and the present limitations of metagenomics in resolving natural microbial complexity, and they provide a new baseline for evaluating future advances in microbial genome recovery, taxonomic classification, and natural product discovery.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12690461/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145354604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0