Background: Genomic variations, including single-nucleotide polymorphisms, small insertions and deletions, and structural variations, are crucial for understanding evolution and disease. However, comprehensive simulation tools for benchmarking genomic analysis methods are lacking. Existing simulators do not accurately represent the nonuniform distribution and length patterns of structural variations in human genomes, and simulating complex structural variations remains challenging.
Results: We present BVSim, a flexible tool that provides probabilistic simulations of genomic variations, primarily focusing on human patterns while accommodating diverse species. BVSim effectively simulates both simple and complex structural variations and small variants by mimicking real-life variation distributions, which often exhibit higher frequencies near telomeres and within tandem repeat regions. Notably, BVSim allows users to input single or multiple benchmark samples from any reference genome, enabling the tool to summarize and represent the unique distribution patterns of structural variation positions and lengths specific to those species. Its compatibility with standard file formats facilitates seamless integration into various genomic research workflows, making it a very useful resource for benchmarking downstream tools such as variant callers. With numerical experiments, we show that BVSim generated more realistic sequences significantly different from other simulators' outputs.
Conclusions: BVSim is written in Python and freely available to noncommercial users under the GPL3 license. Source code, application guide, and toy examples are provided on the GitHub page at https://github.com/YongyiLuo98/BVSim. The tool is registered in SciCrunch (RRID:SCR_026926), bio.tools (biotools:BVSim), and WorkflowHub (doi:10.48546/WORKFLOWHUB.WORKFLOW.1361.1).
{"title":"BVSim: A benchmarking variation simulator mimicking human variation spectrum.","authors":"Yongyi Luo, Zhen Zhang, Shu Wang, Jiandong Shi, Jingyu Hao, Sheng Lian, Taobo Hu, Toyotaka Ishibashi, Depeng Wang, Weichuan Yu, Xiaodan Fan","doi":"10.1093/gigascience/giaf095","DOIUrl":"https://doi.org/10.1093/gigascience/giaf095","url":null,"abstract":"<p><strong>Background: </strong>Genomic variations, including single-nucleotide polymorphisms, small insertions and deletions, and structural variations, are crucial for understanding evolution and disease. However, comprehensive simulation tools for benchmarking genomic analysis methods are lacking. Existing simulators do not accurately represent the nonuniform distribution and length patterns of structural variations in human genomes, and simulating complex structural variations remains challenging.</p><p><strong>Results: </strong>We present BVSim, a flexible tool that provides probabilistic simulations of genomic variations, primarily focusing on human patterns while accommodating diverse species. BVSim effectively simulates both simple and complex structural variations and small variants by mimicking real-life variation distributions, which often exhibit higher frequencies near telomeres and within tandem repeat regions. Notably, BVSim allows users to input single or multiple benchmark samples from any reference genome, enabling the tool to summarize and represent the unique distribution patterns of structural variation positions and lengths specific to those species. Its compatibility with standard file formats facilitates seamless integration into various genomic research workflows, making it a very useful resource for benchmarking downstream tools such as variant callers. With numerical experiments, we show that BVSim generated more realistic sequences significantly different from other simulators' outputs.</p><p><strong>Conclusions: </strong>BVSim is written in Python and freely available to noncommercial users under the GPL3 license. Source code, application guide, and toy examples are provided on the GitHub page at https://github.com/YongyiLuo98/BVSim. The tool is registered in SciCrunch (RRID:SCR_026926), bio.tools (biotools:BVSim), and WorkflowHub (doi:10.48546/WORKFLOWHUB.WORKFLOW.1361.1).</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12398280/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144950505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf108
Nikolay Oskolkov, Chenyu Jin, Samantha López Clinton, Benjamin Guinet, Flore Wijnands, Ernst Johnson, Verena E Kutschera, Cormac M Kinsella, Peter D Heintzman, Tom van der Valk
Ancient environmental DNA is increasingly vital for reconstructing past ecosystems, particularly when paleontological and archaeological tissue remains are absent. Detecting ancient plant and animal DNA in environmental samples relies on using extensive eukaryotic reference genome databases for profiling metagenomics data. However, many eukaryotic genomes contain regions with high sequence similarity to microbial DNA, which can lead to the misclassification of bacterial and archaeal reads as eukaryotic. This issue is especially problematic in ancient eDNA datasets, where plant and animal DNA is typically present at very low abundance. In this study, we present a method for identifying bacterial- and archaeal-like sequences in eukaryotic genomes and apply it to nearly 3,000 reference genomes from NCBI RefSeq and GenBank (vertebrates, invertebrates, plants) as well as the 1,323 PhyloNorway plant genome assemblies from herbarium material from northern high-latitude regions. We find that microbial-like regions are widespread across eukaryotic genomes and provide a comprehensive resource of their genomic coordinates and taxonomic annotations. This resource enables the masking of microbial-like regions during profiling analyses, thereby improving the reliability of ancient environmental metagenomic datasets for downstream analyses.
{"title":"Improving taxonomic inference from ancient environmental metagenomes by masking microbial-like regions in reference genomes.","authors":"Nikolay Oskolkov, Chenyu Jin, Samantha López Clinton, Benjamin Guinet, Flore Wijnands, Ernst Johnson, Verena E Kutschera, Cormac M Kinsella, Peter D Heintzman, Tom van der Valk","doi":"10.1093/gigascience/giaf108","DOIUrl":"10.1093/gigascience/giaf108","url":null,"abstract":"<p><p>Ancient environmental DNA is increasingly vital for reconstructing past ecosystems, particularly when paleontological and archaeological tissue remains are absent. Detecting ancient plant and animal DNA in environmental samples relies on using extensive eukaryotic reference genome databases for profiling metagenomics data. However, many eukaryotic genomes contain regions with high sequence similarity to microbial DNA, which can lead to the misclassification of bacterial and archaeal reads as eukaryotic. This issue is especially problematic in ancient eDNA datasets, where plant and animal DNA is typically present at very low abundance. In this study, we present a method for identifying bacterial- and archaeal-like sequences in eukaryotic genomes and apply it to nearly 3,000 reference genomes from NCBI RefSeq and GenBank (vertebrates, invertebrates, plants) as well as the 1,323 PhyloNorway plant genome assemblies from herbarium material from northern high-latitude regions. We find that microbial-like regions are widespread across eukaryotic genomes and provide a comprehensive resource of their genomic coordinates and taxonomic annotations. This resource enables the masking of microbial-like regions during profiling analyses, thereby improving the reliability of ancient environmental metagenomic datasets for downstream analyses.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12491943/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145212353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf131
Abao Xing, Tiantian Cai, Haofan Du, Zhifan Li, Hoiman Ng, Junrong Li, Guanmin Jiang, Lijun Chen, Kefeng Li
Background: Mendelian randomization (MR) is a powerful epidemiological method for inferring causal relationships between exposures and outcomes using genome-wide association study (GWAS) data. However, its adoption is limited by inconsistent data formats, lack of standardized workflows, and the need for programming expertise. To address these challenges, we developed MRanalysis, a user-friendly, web-based platform for integrated MR analysis, and GWASkit, a standalone tool for GWAS data preprocessing.
Results: MRanalysis provides a comprehensive, no-code workflow for MR analysis, including data quality assessment, power estimation, single-nucleotide polymorphism to gene enrichment, and visualization. It supports univariable, multivariable, and mediation MR analyses through an intuitive interface. GWASkit facilitates rapid GWAS data preprocessing, such as rs ID conversion and format standardization, with significantly higher accuracy and efficiency than existing tools. Case studies demonstrate the utility and efficiency of both tools in real-world scenarios.
Conclusions: MRanalysis and GWASkit lower barriers to MR analysis, making it more accessible, reliable, and efficient. By democratizing MR, these tools can accelerate discoveries in genetic epidemiology, inform public health strategies, and guide targeted interventions. MRanalysis is freely available at https://mranalysis.cn, and GWASkit can be accessed at https://github.com/Li-OmicsLab-MPU/GWASkit. Together, they represent a significant advance in understanding the complex relationships between genes, exposures, and health outcomes.
{"title":"MRanalysis: a comprehensive online platform for integrated, multimethod Mendelian randomization and associated post-GWAS analyses.","authors":"Abao Xing, Tiantian Cai, Haofan Du, Zhifan Li, Hoiman Ng, Junrong Li, Guanmin Jiang, Lijun Chen, Kefeng Li","doi":"10.1093/gigascience/giaf131","DOIUrl":"10.1093/gigascience/giaf131","url":null,"abstract":"<p><strong>Background: </strong>Mendelian randomization (MR) is a powerful epidemiological method for inferring causal relationships between exposures and outcomes using genome-wide association study (GWAS) data. However, its adoption is limited by inconsistent data formats, lack of standardized workflows, and the need for programming expertise. To address these challenges, we developed MRanalysis, a user-friendly, web-based platform for integrated MR analysis, and GWASkit, a standalone tool for GWAS data preprocessing.</p><p><strong>Results: </strong>MRanalysis provides a comprehensive, no-code workflow for MR analysis, including data quality assessment, power estimation, single-nucleotide polymorphism to gene enrichment, and visualization. It supports univariable, multivariable, and mediation MR analyses through an intuitive interface. GWASkit facilitates rapid GWAS data preprocessing, such as rs ID conversion and format standardization, with significantly higher accuracy and efficiency than existing tools. Case studies demonstrate the utility and efficiency of both tools in real-world scenarios.</p><p><strong>Conclusions: </strong>MRanalysis and GWASkit lower barriers to MR analysis, making it more accessible, reliable, and efficient. By democratizing MR, these tools can accelerate discoveries in genetic epidemiology, inform public health strategies, and guide targeted interventions. MRanalysis is freely available at https://mranalysis.cn, and GWASkit can be accessed at https://github.com/Li-OmicsLab-MPU/GWASkit. Together, they represent a significant advance in understanding the complex relationships between genes, exposures, and health outcomes.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12616851/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145344680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf130
Lukas Forer, Sebastian Schönherr
Background: The workflow management system Nextflow, together with the nf-core community, has established an essential ecosystem in bioinformatics. However, ensuring the correctness and reliability of large and complex Nextflow pipelines remains challenging due to the lack of a unified, automated unit-testing framework.
Results: To address this gap, we present nf-test, a modular testing framework for bioinformatics workflows. It enables users to test process blocks, workflow patterns, and entire pipelines in isolation while validating their outputs. Built with a syntax similar to Nextflow DSL2, nf-test offers unique features such as snapshot testing and smart testing, which optimize resource usage by testing only modified modules. We demonstrate across multiple pipelines that these features minimize development time, reduce test execution time by up to 80%, and enhance software quality by identifying bugs and issues early in the development process.
Conclusions: Already adopted by numerous pipelines, nf-test significantly improves the robustness, maintainability, and reliability of bioinformatics pipelines.
{"title":"Improving the reliability, quality, and maintainability of bioinformatics pipelines with nf-test.","authors":"Lukas Forer, Sebastian Schönherr","doi":"10.1093/gigascience/giaf130","DOIUrl":"10.1093/gigascience/giaf130","url":null,"abstract":"<p><strong>Background: </strong>The workflow management system Nextflow, together with the nf-core community, has established an essential ecosystem in bioinformatics. However, ensuring the correctness and reliability of large and complex Nextflow pipelines remains challenging due to the lack of a unified, automated unit-testing framework.</p><p><strong>Results: </strong>To address this gap, we present nf-test, a modular testing framework for bioinformatics workflows. It enables users to test process blocks, workflow patterns, and entire pipelines in isolation while validating their outputs. Built with a syntax similar to Nextflow DSL2, nf-test offers unique features such as snapshot testing and smart testing, which optimize resource usage by testing only modified modules. We demonstrate across multiple pipelines that these features minimize development time, reduce test execution time by up to 80%, and enhance software quality by identifying bugs and issues early in the development process.</p><p><strong>Conclusions: </strong>Already adopted by numerous pipelines, nf-test significantly improves the robustness, maintainability, and reliability of bioinformatics pipelines.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12616847/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145344736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf086
Craig Myles, In Hwa Um, Craig Marshall, David Harris-Birtill, David J Harrison
Background: Cancer remains one of the leading causes of morbidity and mortality worldwide. Comprehensive datasets that combine histopathological images with genetic and survival data across various tumour sites are essential for advancing computational pathology and personalised medicine.
Results: We present SurGen, a dataset comprising 1,020 H&E-stained whole-slide images (WSIs) from 843 colorectal cancer cases. The dataset includes detailed annotations for key genetic mutations (KRAS, NRAS, BRAF) and mismatch repair status, as well as survival data for 426 cases. We illustrate SurGen's utility with a proof-of-concept model that predicts mismatch repair status directly from WSIs, achieving a test area under the receiver operating characteristic curve of 0.8273. These preliminary results underscore the dataset's potential to facilitate research in biomarker discovery, prognostic modelling, and advanced machine learning applications in colorectal cancer and beyond.
Conclusions: SurGen offers a valuable resource for the scientific community, enabling studies that require high-quality WSIs linked with comprehensive clinical and genetic information on colorectal cancer. Our initial findings affirm the dataset's capacity to advance diagnostic precision and foster the development of personalised treatment strategies in colorectal oncology. Data available online: https://doi.org/10.6019/S-BIAD1285.
{"title":"SurGen: 1020 H&E-stained whole-slide images with survival and genetic markers.","authors":"Craig Myles, In Hwa Um, Craig Marshall, David Harris-Birtill, David J Harrison","doi":"10.1093/gigascience/giaf086","DOIUrl":"10.1093/gigascience/giaf086","url":null,"abstract":"<p><strong>Background: </strong>Cancer remains one of the leading causes of morbidity and mortality worldwide. Comprehensive datasets that combine histopathological images with genetic and survival data across various tumour sites are essential for advancing computational pathology and personalised medicine.</p><p><strong>Results: </strong>We present SurGen, a dataset comprising 1,020 H&E-stained whole-slide images (WSIs) from 843 colorectal cancer cases. The dataset includes detailed annotations for key genetic mutations (KRAS, NRAS, BRAF) and mismatch repair status, as well as survival data for 426 cases. We illustrate SurGen's utility with a proof-of-concept model that predicts mismatch repair status directly from WSIs, achieving a test area under the receiver operating characteristic curve of 0.8273. These preliminary results underscore the dataset's potential to facilitate research in biomarker discovery, prognostic modelling, and advanced machine learning applications in colorectal cancer and beyond.</p><p><strong>Conclusions: </strong>SurGen offers a valuable resource for the scientific community, enabling studies that require high-quality WSIs linked with comprehensive clinical and genetic information on colorectal cancer. Our initial findings affirm the dataset's capacity to advance diagnostic precision and foster the development of personalised treatment strategies in colorectal oncology. Data available online: https://doi.org/10.6019/S-BIAD1285.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12569769/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145344713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf055
Murilo Caminotto Barbosa, João Fernando Marques da Silva, Leonardo Cardoso Alves, Robert D Finn, Alexandre Rossi Paschoal
Background: Despite the surge in microbiome data acquisition, there is a limited availability of tools capable of effectively analyzing it and identifying correlations between taxonomic compositions and continuous environmental factors. Furthermore, existing tools also do not predict the environmental factors in new samples, underscoring the pressing need for innovative solutions to enhance our understanding of microbiome dynamics and fulfill the prediction gap. Here we introduce CODARFE, a novel tool for sparse compositional microbiome predictor selection and prediction of continuous environmental factors.
Results: We tested CODARFE against 4 state-of-the-art tools in 2 experiments. First, CODARFE outperformed predictor selection in 21 of 24 databases in terms of correlation. Second, among all the tools, CODARFE achieved the highest number of previously identified bacteria linked to environmental factors for human data-that is, at least 7% more. We also tested CODARFE in a cross-study, using the same biome but under different external effects, using a model trained on 1 dataset to predict environmental factors on another dataset, achieving 11% of mean absolute percentage error. Finally, CODARFE is available in 5 formats, including a Windows version with a graphical interface, to installable source code for Linux servers and an embedded Jupyter notebook available at MGnify.
Conclusions: Our findings underscore the robustness and broad applicability of CODARFE across diverse fields, even under varying experimental conditions. Additionally, the ability to predict outcomes in new samples allows for the generation of new insights in previously unexplored contexts, providing researchers with a versatile tool.
{"title":"CODARFE: Unlocking the prediction of continuous environmental variables based on microbiome.","authors":"Murilo Caminotto Barbosa, João Fernando Marques da Silva, Leonardo Cardoso Alves, Robert D Finn, Alexandre Rossi Paschoal","doi":"10.1093/gigascience/giaf055","DOIUrl":"10.1093/gigascience/giaf055","url":null,"abstract":"<p><strong>Background: </strong>Despite the surge in microbiome data acquisition, there is a limited availability of tools capable of effectively analyzing it and identifying correlations between taxonomic compositions and continuous environmental factors. Furthermore, existing tools also do not predict the environmental factors in new samples, underscoring the pressing need for innovative solutions to enhance our understanding of microbiome dynamics and fulfill the prediction gap. Here we introduce CODARFE, a novel tool for sparse compositional microbiome predictor selection and prediction of continuous environmental factors.</p><p><strong>Results: </strong>We tested CODARFE against 4 state-of-the-art tools in 2 experiments. First, CODARFE outperformed predictor selection in 21 of 24 databases in terms of correlation. Second, among all the tools, CODARFE achieved the highest number of previously identified bacteria linked to environmental factors for human data-that is, at least 7% more. We also tested CODARFE in a cross-study, using the same biome but under different external effects, using a model trained on 1 dataset to predict environmental factors on another dataset, achieving 11% of mean absolute percentage error. Finally, CODARFE is available in 5 formats, including a Windows version with a graphical interface, to installable source code for Linux servers and an embedded Jupyter notebook available at MGnify.</p><p><strong>Conclusions: </strong>Our findings underscore the robustness and broad applicability of CODARFE across diverse fields, even under varying experimental conditions. Additionally, the ability to predict outcomes in new samples allows for the generation of new insights in previously unexplored contexts, providing researchers with a versatile tool.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12365963/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144474816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Drug combination therapy plays a pivotal role in addressing the molecular heterogeneity of cancer, improving treatment efficacy, minimizing resistance, and reducing toxicity. Deep learning approaches have significantly advanced drug combination discovery by addressing the limitations of conventional laboratory experiments, which are time-consuming and costly. While most existing models rely on the molecular structure of drugs and gene expression data, incorporating protein-level expression provides a more accurate representation of cellular behavior and drug responses. In this study, we introduce SynProtX, an enhanced deep learning model that explicitly integrates large-scale proteomics with deep neural networks (DNNs) and the molecular structure of drugs with graph neural networks (GNNs).
Results: The SynProtX-GATFP model, which combines molecular graphs and fingerprints through a graph attention network architecture, demonstrated superior predictive performance for the FRIEDMAN study dataset. We further evaluated its cell line-specific performance, which achieved accuracy across diverse tissue and study datasets. By incorporating protein expression data, the model consistently enhanced predictive performance over gene expression-only models, reflecting the functional state of cancer cells. The generalizability of SynProtX was rigorously validated using cold-start prediction, including leave-drug-combination-out, leave-drug-out, and leave-cell-line-out validation strategies, highlighting its robust performance and potential for clinical applicability. Additionally, SynProtX identified key cancer-associated proteins and molecular substructures, offering novel insights into the biological mechanisms underlying drug synergy. These findings highlight the potential of integrating large-scale proteomics and multiomics data to advance anticancer drug design and combination therapy strategies for personalized medicine. Availability and implementation: https://github.com/manbaritone/SynProtX.
{"title":"SynProtX: a large-scale proteomics-based deep learning model for predicting synergistic anticancer drug combinations.","authors":"Bundit Boonyarit, Matin Kositchutima, Tisorn Na Phattalung, Nattawin Yamprasert, Chanitra Thuwajit, Thanyada Rungrotmongkol, Sarana Nutanong","doi":"10.1093/gigascience/giaf080","DOIUrl":"10.1093/gigascience/giaf080","url":null,"abstract":"<p><strong>Motivation: </strong>Drug combination therapy plays a pivotal role in addressing the molecular heterogeneity of cancer, improving treatment efficacy, minimizing resistance, and reducing toxicity. Deep learning approaches have significantly advanced drug combination discovery by addressing the limitations of conventional laboratory experiments, which are time-consuming and costly. While most existing models rely on the molecular structure of drugs and gene expression data, incorporating protein-level expression provides a more accurate representation of cellular behavior and drug responses. In this study, we introduce SynProtX, an enhanced deep learning model that explicitly integrates large-scale proteomics with deep neural networks (DNNs) and the molecular structure of drugs with graph neural networks (GNNs).</p><p><strong>Results: </strong>The SynProtX-GATFP model, which combines molecular graphs and fingerprints through a graph attention network architecture, demonstrated superior predictive performance for the FRIEDMAN study dataset. We further evaluated its cell line-specific performance, which achieved accuracy across diverse tissue and study datasets. By incorporating protein expression data, the model consistently enhanced predictive performance over gene expression-only models, reflecting the functional state of cancer cells. The generalizability of SynProtX was rigorously validated using cold-start prediction, including leave-drug-combination-out, leave-drug-out, and leave-cell-line-out validation strategies, highlighting its robust performance and potential for clinical applicability. Additionally, SynProtX identified key cancer-associated proteins and molecular substructures, offering novel insights into the biological mechanisms underlying drug synergy. These findings highlight the potential of integrating large-scale proteomics and multiomics data to advance anticancer drug design and combination therapy strategies for personalized medicine. Availability and implementation: https://github.com/manbaritone/SynProtX.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12343095/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144834815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf081
Shijun Pan, Huan Du, Ruiqi Zheng, Cuijing Zhang, Jie Pan, Xilan Yang, Cheng Wang, Xiaolan Lin, Jinhui Li, Wan Liu, Haokui Zhou, Xiaoli Yu, Shuming Mo, Guoqing Zhang, Guoping Zhao, Zhili He, Yun Tian, Chengjian Jiang, Wu Qu, Yang Liu, Meng Li
Background: Mangroves are one of the most productive marine ecosystems with high ecosystem service value. The sediment microbial communities contribute to pivotal ecological functions in mangrove ecosystems. However, the study of mangrove sediment microbiomes is limited.
Findings: Here, we applied metagenome sequencing analysis of microbial communities in mangrove sediments across Southeast China from 2014 to 2020. This genome dataset includes 966 metagenome-assembled genomes with ≥50% completeness and ≤10% contamination generated from 6 groups of samples. Phylogenomic analysis and taxonomy classification show that mangrove sediments are inhabited by microbial communities with high species diversity. Thermoplasmatota, Thermoproteota, and Asgardarchaeota in archaea, as well as Proteobacteria, Desulfobacterota, Chloroflexota, Acidobacteriota, and Gemmatimonadota in bacteria, dominate the mangrove sediments across Southeast China. Functional analyses suggest that the microbial communities may contribute to carbon, nitrogen, and sulfur cycling in mangrove sediments.
Conclusions: These combined microbial genomes provide an important complement of global mangrove genome datasets and may serve as a foundational resource for enhancing our understanding of the composition and functions of mangrove sediment microbiomes.
{"title":"A holistic genome dataset of bacteria and archaea of mangrove sediments.","authors":"Shijun Pan, Huan Du, Ruiqi Zheng, Cuijing Zhang, Jie Pan, Xilan Yang, Cheng Wang, Xiaolan Lin, Jinhui Li, Wan Liu, Haokui Zhou, Xiaoli Yu, Shuming Mo, Guoqing Zhang, Guoping Zhao, Zhili He, Yun Tian, Chengjian Jiang, Wu Qu, Yang Liu, Meng Li","doi":"10.1093/gigascience/giaf081","DOIUrl":"10.1093/gigascience/giaf081","url":null,"abstract":"<p><strong>Background: </strong>Mangroves are one of the most productive marine ecosystems with high ecosystem service value. The sediment microbial communities contribute to pivotal ecological functions in mangrove ecosystems. However, the study of mangrove sediment microbiomes is limited.</p><p><strong>Findings: </strong>Here, we applied metagenome sequencing analysis of microbial communities in mangrove sediments across Southeast China from 2014 to 2020. This genome dataset includes 966 metagenome-assembled genomes with ≥50% completeness and ≤10% contamination generated from 6 groups of samples. Phylogenomic analysis and taxonomy classification show that mangrove sediments are inhabited by microbial communities with high species diversity. Thermoplasmatota, Thermoproteota, and Asgardarchaeota in archaea, as well as Proteobacteria, Desulfobacterota, Chloroflexota, Acidobacteriota, and Gemmatimonadota in bacteria, dominate the mangrove sediments across Southeast China. Functional analyses suggest that the microbial communities may contribute to carbon, nitrogen, and sulfur cycling in mangrove sediments.</p><p><strong>Conclusions: </strong>These combined microbial genomes provide an important complement of global mangrove genome datasets and may serve as a foundational resource for enhancing our understanding of the composition and functions of mangrove sediment microbiomes.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12343073/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144834874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf138
Uri Hartmann, Eran Shaham, Dafna Nathan, Ilana Blech, Danny Zeevi
Background: Phasing, the assignment of alleles to their respective parental chromosomes, is fundamental to studying genetic variation and identifying disease-causing variants. Traditional approaches, including statistical, pedigree-based, and read-based phasing, face challenges such as limited accuracy for rare variants and reliance on external reference panels.
Results: To address these limitations, we developed TinkerHap, a novel phasing algorithm that integrates a read-based phaser, based on a pairwise distance-based unsupervised classification, with external phased data, such as statistical or pedigree phasing. We evaluated TinkerHap's performance against other phasing algorithms using 1,040 parent-offspring trios from the UK Biobank (Illumina short reads) and GIAB Ashkenazi trio (PacBio long reads). TinkerHap's read-based phaser alone achieved higher phasing accuracies than all other algorithms with 95.1% for short reads (second best: 94.8%) and 97.5% for long reads (second best: 95.5%). Its hybrid approach further enhanced short-read performance to 96.3% accuracy and was able to phase 99.5% of all heterozygous sites. TinkerHap also extended haplotype block sizes to a median of 79,449 bp for long reads (second best: 68,303 bp) and demonstrated higher accuracy for both single-nucleotide polymorphisms and indels.
Conclusions: The combination of a robust read-based algorithm and a hybrid integration strategy makes TinkerHap a powerful and versatile tool for genomic analysis, enabling more accurate, contiguous, and comprehensive phasing across diverse sequencing platforms and variant types.
{"title":"TinkerHap-a novel read-based phasing algorithm with integrated multimethod support for enhanced accuracy.","authors":"Uri Hartmann, Eran Shaham, Dafna Nathan, Ilana Blech, Danny Zeevi","doi":"10.1093/gigascience/giaf138","DOIUrl":"10.1093/gigascience/giaf138","url":null,"abstract":"<p><strong>Background: </strong>Phasing, the assignment of alleles to their respective parental chromosomes, is fundamental to studying genetic variation and identifying disease-causing variants. Traditional approaches, including statistical, pedigree-based, and read-based phasing, face challenges such as limited accuracy for rare variants and reliance on external reference panels.</p><p><strong>Results: </strong>To address these limitations, we developed TinkerHap, a novel phasing algorithm that integrates a read-based phaser, based on a pairwise distance-based unsupervised classification, with external phased data, such as statistical or pedigree phasing. We evaluated TinkerHap's performance against other phasing algorithms using 1,040 parent-offspring trios from the UK Biobank (Illumina short reads) and GIAB Ashkenazi trio (PacBio long reads). TinkerHap's read-based phaser alone achieved higher phasing accuracies than all other algorithms with 95.1% for short reads (second best: 94.8%) and 97.5% for long reads (second best: 95.5%). Its hybrid approach further enhanced short-read performance to 96.3% accuracy and was able to phase 99.5% of all heterozygous sites. TinkerHap also extended haplotype block sizes to a median of 79,449 bp for long reads (second best: 68,303 bp) and demonstrated higher accuracy for both single-nucleotide polymorphisms and indels.</p><p><strong>Conclusions: </strong>The combination of a robust read-based algorithm and a hybrid integration strategy makes TinkerHap a powerful and versatile tool for genomic analysis, enabling more accurate, contiguous, and comprehensive phasing across diverse sequencing platforms and variant types.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12723663/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145377046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf135
Caner Bağcı, Timo Negri, Elena Buena-Atienza, Caspar Gross, Stephan Ossowski, Nadine Ziemert
Background: Soil ecosystems have long been recognized as hotspots of microbial diversity, but most estimates of their microbial and functional complexity remain speculative despite decades of study, in part because conventional sequencing campaigns lack the depth and contiguity required to recover low-abundance and repetitive genomes. Here, we revisit this question using one of the deepest metagenomic sequencing efforts to date, applying 148 billion basepairs of Nanopore long-read data and 122 billion basepairs of Illumina short-read data to a single forest soil sample.
Results: Our hybrid assembly reconstructed 837 metagenome-assembled genomes, including 466 that meet high- and medium-quality standards, nearly all lacking close relatives among cultivated taxa. Rarefaction and k-mer analyses reveal that, even at this depth, we capture only a fraction of the extant diversity: nonparametric models project that more than 10 trillion basepairs of sequencing data would be required to approach saturation. These findings offer a quantitative, technology-enabled update to long-standing diversity estimates and demonstrate that conventional metagenomic sequencing efforts likely miss most microbial and biosynthetic potential in soil. We further identify more than 11,000 biosynthetic gene clusters, over 99% of which have no match in current databases, underscoring the breadth of unexplored metabolic capacity.
Conclusions: Taken together, our results emphasize both the power and the present limitations of metagenomics in resolving natural microbial complexity, and they provide a new baseline for evaluating future advances in microbial genome recovery, taxonomic classification, and natural product discovery.
{"title":"Ultra-deep long-read metagenomics captures diverse taxonomic and biosynthetic potential of soil microbes.","authors":"Caner Bağcı, Timo Negri, Elena Buena-Atienza, Caspar Gross, Stephan Ossowski, Nadine Ziemert","doi":"10.1093/gigascience/giaf135","DOIUrl":"10.1093/gigascience/giaf135","url":null,"abstract":"<p><strong>Background: </strong>Soil ecosystems have long been recognized as hotspots of microbial diversity, but most estimates of their microbial and functional complexity remain speculative despite decades of study, in part because conventional sequencing campaigns lack the depth and contiguity required to recover low-abundance and repetitive genomes. Here, we revisit this question using one of the deepest metagenomic sequencing efforts to date, applying 148 billion basepairs of Nanopore long-read data and 122 billion basepairs of Illumina short-read data to a single forest soil sample.</p><p><strong>Results: </strong>Our hybrid assembly reconstructed 837 metagenome-assembled genomes, including 466 that meet high- and medium-quality standards, nearly all lacking close relatives among cultivated taxa. Rarefaction and k-mer analyses reveal that, even at this depth, we capture only a fraction of the extant diversity: nonparametric models project that more than 10 trillion basepairs of sequencing data would be required to approach saturation. These findings offer a quantitative, technology-enabled update to long-standing diversity estimates and demonstrate that conventional metagenomic sequencing efforts likely miss most microbial and biosynthetic potential in soil. We further identify more than 11,000 biosynthetic gene clusters, over 99% of which have no match in current databases, underscoring the breadth of unexplored metabolic capacity.</p><p><strong>Conclusions: </strong>Taken together, our results emphasize both the power and the present limitations of metagenomics in resolving natural microbial complexity, and they provide a new baseline for evaluating future advances in microbial genome recovery, taxonomic classification, and natural product discovery.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12690461/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145354604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}