Pub Date : 2026-01-21DOI: 10.1093/gigascience/giag007
Janet Joy, Andrew I Su
Background: Large language models (LLMs) have significantly advanced natural language processing in biomedical research; however, their reliance on implicit, statistical representations often results in factual inaccuracies or hallucinations, posing significant concerns in high-stakes biomedical contexts.
Results: To overcome these limitations, we developed BioThings Explorer-Retrieval-Augmented Generation (BTE-RAG), a Retrieval-Augmented Generation framework that integrates the reasoning capabilities of advanced language models with explicit mechanistic evidence sourced from BTE, an API federation of more than sixty authoritative biomedical knowledge sources. We systematically evaluated BTE-RAG in comparison to traditional LLM-only methods across three benchmark datasets that we created from DrugMechDB. These datasets specifically targeted gene-centric mechanisms (798 questions), metabolite effects (201 questions), and drug-biological process relationships (842 questions). On the gene-centric task, BTE-RAG increased accuracy from 51 to 75.8% for GPT-4o mini and from 69.8 to 78.6% for GPT-4o. In metabolite-focused questions, the proportion of responses with cosine similarity scores of at least 0.90 rose by 82% for GPT-4o mini and 77% for GPT-4o. While overall accuracy was consistent in the drug-biological process benchmark, the retrieval method enhanced response concordance, producing a greater than 10% increase in high-agreement answers (from 129 to 144) using GPT-4o. We additionally evaluated BTE-RAG alongside GeneGPT-based models on the GeneTuring gene-disease association benchmark and on our mechanistic gene benchmark, demonstrating that the BTE-RAG layer consistently improves accuracy relative to alternative approaches.
Conclusion: Federated knowledge retrieval provides transparent improvements in accuracy for LLMs, establishing BTE-RAG as a valuable and practical tool for mechanistic exploration and translational biomedical research.
{"title":"Federated knowledge retrieval elevates large language model performance on biomedical benchmarks.","authors":"Janet Joy, Andrew I Su","doi":"10.1093/gigascience/giag007","DOIUrl":"10.1093/gigascience/giag007","url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) have significantly advanced natural language processing in biomedical research; however, their reliance on implicit, statistical representations often results in factual inaccuracies or hallucinations, posing significant concerns in high-stakes biomedical contexts.</p><p><strong>Results: </strong>To overcome these limitations, we developed BioThings Explorer-Retrieval-Augmented Generation (BTE-RAG), a Retrieval-Augmented Generation framework that integrates the reasoning capabilities of advanced language models with explicit mechanistic evidence sourced from BTE, an API federation of more than sixty authoritative biomedical knowledge sources. We systematically evaluated BTE-RAG in comparison to traditional LLM-only methods across three benchmark datasets that we created from DrugMechDB. These datasets specifically targeted gene-centric mechanisms (798 questions), metabolite effects (201 questions), and drug-biological process relationships (842 questions). On the gene-centric task, BTE-RAG increased accuracy from 51 to 75.8% for GPT-4o mini and from 69.8 to 78.6% for GPT-4o. In metabolite-focused questions, the proportion of responses with cosine similarity scores of at least 0.90 rose by 82% for GPT-4o mini and 77% for GPT-4o. While overall accuracy was consistent in the drug-biological process benchmark, the retrieval method enhanced response concordance, producing a greater than 10% increase in high-agreement answers (from 129 to 144) using GPT-4o. We additionally evaluated BTE-RAG alongside GeneGPT-based models on the GeneTuring gene-disease association benchmark and on our mechanistic gene benchmark, demonstrating that the BTE-RAG layer consistently improves accuracy relative to alternative approaches.</p><p><strong>Conclusion: </strong>Federated knowledge retrieval provides transparent improvements in accuracy for LLMs, establishing BTE-RAG as a valuable and practical tool for mechanistic exploration and translational biomedical research.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12888809/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145997110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1093/gigascience/giaf145
Joanna Szablińska-Piernik, Paweł Sulima, Jakub Sawicki
Background: The liverwort Apopellia endiviifolia, a dioicous, simple thalloid species, is notable for its cryptic diversity, habitat adaptability, and genomic innovation, and it represents a clade that is sister to all other Jungermanniopsida. These features make A. endiviifolia an essential model for exploring speciation mechanisms and the evolution of genome structures within liverworts.
Findings: We present the genome assembly of a haploid A. endiviifolia isolate with a total size of 2,914,960,273 bp and an N50 of 468,157,909 bp, demonstrating high completeness (99.2% BUSCO) and a high consensus quality (quality value 47.6). The assembly consisted of 9 chromosomes, which included 18 telomeres and 9 centromeres (ranging from 1.9 to 5 Mbp in length). RNA sequencing-based annotation identified 34,615 genes, predominantly protein coding. The transposable elements comprised 12.16% long terminal repeat elements and 57 Helitrons. Among the retroelements, the Copia and Gypsy superfamilies comprised 8.94% and 2.95% of the genome, respectively. The Ty3/Gypsy superfamily was significantly enriched in centromeric regions. The average GC content ranged from 38.8% to 39.6%, with gene density varying between 5.52 and 9.78 genes per 500 kbp. Synteny analysis of related liverwort species has revealed complex chromosomal relationships, indicating extensive genome rearrangements among species.
Conclusions: This study provides the first high-quality reference genome assembly of the haploid liverwort A. endiviifolia. Assembly and annotation offer valuable resources for investigating liverwort evolution, centromere biology, and genome expansion in simple thalloid liverworts.
{"title":"Giant chromosomes of a tiny plant-the complete telomere-to-telomere genome assembly of the simple thalloid liverwort Apopellia endiviifolia (Jungermanniopsida, Marchantiophyta).","authors":"Joanna Szablińska-Piernik, Paweł Sulima, Jakub Sawicki","doi":"10.1093/gigascience/giaf145","DOIUrl":"10.1093/gigascience/giaf145","url":null,"abstract":"<p><strong>Background: </strong>The liverwort Apopellia endiviifolia, a dioicous, simple thalloid species, is notable for its cryptic diversity, habitat adaptability, and genomic innovation, and it represents a clade that is sister to all other Jungermanniopsida. These features make A. endiviifolia an essential model for exploring speciation mechanisms and the evolution of genome structures within liverworts.</p><p><strong>Findings: </strong>We present the genome assembly of a haploid A. endiviifolia isolate with a total size of 2,914,960,273 bp and an N50 of 468,157,909 bp, demonstrating high completeness (99.2% BUSCO) and a high consensus quality (quality value 47.6). The assembly consisted of 9 chromosomes, which included 18 telomeres and 9 centromeres (ranging from 1.9 to 5 Mbp in length). RNA sequencing-based annotation identified 34,615 genes, predominantly protein coding. The transposable elements comprised 12.16% long terminal repeat elements and 57 Helitrons. Among the retroelements, the Copia and Gypsy superfamilies comprised 8.94% and 2.95% of the genome, respectively. The Ty3/Gypsy superfamily was significantly enriched in centromeric regions. The average GC content ranged from 38.8% to 39.6%, with gene density varying between 5.52 and 9.78 genes per 500 kbp. Synteny analysis of related liverwort species has revealed complex chromosomal relationships, indicating extensive genome rearrangements among species.</p><p><strong>Conclusions: </strong>This study provides the first high-quality reference genome assembly of the haploid liverwort A. endiviifolia. Assembly and annotation offer valuable resources for investigating liverwort evolution, centromere biology, and genome expansion in simple thalloid liverworts.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12885004/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145632216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1093/gigascience/giag009
Evangelos Karatzas, Martin Beracochea, Fotis A Baltoumas, Eleni Aplakidou, Lorna Richardson, James A Fellows Yates, Daniel Lundin
The growth of metagenomics-derived amino acid sequence data has transformed our understanding of protein function, microbial diversity, and evolutionary relationships. However, the vast majority of these proteins remain functionally uncharacterized. Grouping the millions of such uncharacterised sequences with the few experimentally characterised ones allows the transfer of annotations, while the inspection of conserved residues with multiple sequence alignments can provide clues to function, even in the absence of existing functional information. To address the challenges associated with this data surge and the need to group sequences, we present a scalable, open-source, parametrizable Nextflow pipeline (nf-core/proteinfamilies) that generates nascent protein families or assigns new proteins to existing families. The computational benchmarks demonstrated that resource usage scales approximately linearly with input size, and the biological benchmarks showed that the generated protein families closely resemble manually curated families in widely used databases.
{"title":"nf-core/proteinfamilies: A scalable pipeline for the generation of protein families.","authors":"Evangelos Karatzas, Martin Beracochea, Fotis A Baltoumas, Eleni Aplakidou, Lorna Richardson, James A Fellows Yates, Daniel Lundin","doi":"10.1093/gigascience/giag009","DOIUrl":"https://doi.org/10.1093/gigascience/giag009","url":null,"abstract":"<p><p>The growth of metagenomics-derived amino acid sequence data has transformed our understanding of protein function, microbial diversity, and evolutionary relationships. However, the vast majority of these proteins remain functionally uncharacterized. Grouping the millions of such uncharacterised sequences with the few experimentally characterised ones allows the transfer of annotations, while the inspection of conserved residues with multiple sequence alignments can provide clues to function, even in the absence of existing functional information. To address the challenges associated with this data surge and the need to group sequences, we present a scalable, open-source, parametrizable Nextflow pipeline (nf-core/proteinfamilies) that generates nascent protein families or assigns new proteins to existing families. The computational benchmarks demonstrated that resource usage scales approximately linearly with input size, and the biological benchmarks showed that the generated protein families closely resemble manually curated families in widely used databases.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146009872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1093/gigascience/giaf156
Diane Duroux, Paul P Meyer, Giovanni Visonà, Niko Beerenwinkel
Background: The deployment of machine learning in clinical settings is often hindered by the limited generalizability of the models. Models that perform well during development tend to underperform in new environments, limiting their clinical utility. This issue affects models designed for the rapid identification of antimicrobial resistance, which is essential to guide treatment decisions. Traditional susceptibility tests can take up to 3 days, whereas integrating matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometry with machine learning has the potential to reduce this to 1 day. However, model performance declines drastically in hospitals or time frames outside the training data.
Results: To improve robustness, we develop advanced feature representations using masked autoencoders (MAEs) for MALDI-TOF spectra and chemical language models and SELF-referencing embedded strings (SELFIES) for antimicrobials. Cross-validated on data from 4 medical institutions, our models demonstrate improved performance and stability. The MAE and SELFIES encodings increase the area under the precision-recall curve by 4% when evaluated on unseen time periods, while the MAE and Molformer language model encodings improve it by 10% when applied across different hospitals.
Conclusions: These results underscore the value of combining deep learning with chemical and spectral information to build generalizable, high-impact clinical artificial intelligence.
{"title":"Generalizable machine learning models for rapid antimicrobial resistance prediction in unseen health care settings.","authors":"Diane Duroux, Paul P Meyer, Giovanni Visonà, Niko Beerenwinkel","doi":"10.1093/gigascience/giaf156","DOIUrl":"10.1093/gigascience/giaf156","url":null,"abstract":"<p><strong>Background: </strong>The deployment of machine learning in clinical settings is often hindered by the limited generalizability of the models. Models that perform well during development tend to underperform in new environments, limiting their clinical utility. This issue affects models designed for the rapid identification of antimicrobial resistance, which is essential to guide treatment decisions. Traditional susceptibility tests can take up to 3 days, whereas integrating matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometry with machine learning has the potential to reduce this to 1 day. However, model performance declines drastically in hospitals or time frames outside the training data.</p><p><strong>Results: </strong>To improve robustness, we develop advanced feature representations using masked autoencoders (MAEs) for MALDI-TOF spectra and chemical language models and SELF-referencing embedded strings (SELFIES) for antimicrobials. Cross-validated on data from 4 medical institutions, our models demonstrate improved performance and stability. The MAE and SELFIES encodings increase the area under the precision-recall curve by 4% when evaluated on unseen time periods, while the MAE and Molformer language model encodings improve it by 10% when applied across different hospitals.</p><p><strong>Conclusions: </strong>These results underscore the value of combining deep learning with chemical and spectral information to build generalizable, high-impact clinical artificial intelligence.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12908719/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145997116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1093/gigascience/giaf141
Ian Tsang, Lawrence Percival-Alwyn, Stephen Rawsthorne, James Cockram, Fiona Leigh, Jonathan A Atkinson
Background: Root hairs play a key role in plant nutrient and water uptake. Historically, root hair traits have largely been quantified manually. As such, this process has been laborious and low-throughput. However, given their importance for plant health and development, high-throughput quantification of root hair morphology could help underpin rapid advances in the genetic understanding of these traits. With recent increases in the accessibility and availability of artificial intelligence (AI) and machine learning techniques, the development of tools to automate plant phenotyping processes has been greatly accelerated.
Results: We present pyRootHair, a high-throughput, AI-powered software application to automate root hair trait extraction from microscope images of plant roots grown on agar plates. pyRootHair is capable of batch processing over 600 images per hour without manual input from the end user. In this study, we deploy pyRootHair on a panel of 24 diverse wheat (Triticum aestivum and Triticum turgidum ssp. durum) cultivars and uncover a large, previously unresolved amount of variation in many root hair traits. We show that the overall root hair profile falls under 2 distinct shape categories and that different root hair traits often correlate with each other. We also demonstrate that pyRootHair can be deployed on a range of plant species, including oat (Avena sativa), rice (Oryza sativa), teff (Eragrostis tef), and tomato (Solanum lycopersicum).
Conclusions: The application of pyRootHair enables users to rapidly screen a large number of plant germplasm resources for variation in root hair morphology, supporting high-resolution measurements and high-throughput data analysis. This facilitates downstream investigation of the impacts of root hair genetic control and morphological variation on plant performance. pyRootHair is installable via PyPI (https://pypi.org/project/pyRootHair/) and can be accessed on GitHub at https://github.com/iantsang779/pyRootHair.
{"title":"pyRootHair: Machine learning accelerated software for high-throughput phenotyping of plant root hair traits.","authors":"Ian Tsang, Lawrence Percival-Alwyn, Stephen Rawsthorne, James Cockram, Fiona Leigh, Jonathan A Atkinson","doi":"10.1093/gigascience/giaf141","DOIUrl":"10.1093/gigascience/giaf141","url":null,"abstract":"<p><strong>Background: </strong>Root hairs play a key role in plant nutrient and water uptake. Historically, root hair traits have largely been quantified manually. As such, this process has been laborious and low-throughput. However, given their importance for plant health and development, high-throughput quantification of root hair morphology could help underpin rapid advances in the genetic understanding of these traits. With recent increases in the accessibility and availability of artificial intelligence (AI) and machine learning techniques, the development of tools to automate plant phenotyping processes has been greatly accelerated.</p><p><strong>Results: </strong>We present pyRootHair, a high-throughput, AI-powered software application to automate root hair trait extraction from microscope images of plant roots grown on agar plates. pyRootHair is capable of batch processing over 600 images per hour without manual input from the end user. In this study, we deploy pyRootHair on a panel of 24 diverse wheat (Triticum aestivum and Triticum turgidum ssp. durum) cultivars and uncover a large, previously unresolved amount of variation in many root hair traits. We show that the overall root hair profile falls under 2 distinct shape categories and that different root hair traits often correlate with each other. We also demonstrate that pyRootHair can be deployed on a range of plant species, including oat (Avena sativa), rice (Oryza sativa), teff (Eragrostis tef), and tomato (Solanum lycopersicum).</p><p><strong>Conclusions: </strong>The application of pyRootHair enables users to rapidly screen a large number of plant germplasm resources for variation in root hair morphology, supporting high-resolution measurements and high-throughput data analysis. This facilitates downstream investigation of the impacts of root hair genetic control and morphological variation on plant performance. pyRootHair is installable via PyPI (https://pypi.org/project/pyRootHair/) and can be accessed on GitHub at https://github.com/iantsang779/pyRootHair.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12824728/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145512386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: The genomes of mangrove Acanthus species have not been reported, despite their ecological and medicinal importance. Here, we generated reference genomes for three shrub mangroves in the genus Acanthus to clarify their whole-genome duplication and hybridization events and identify genomic features underlying their evolution.
Results: Using PacBio and Hi-C data, we generated a chromosome-scale genome assembly of the recently identified allotetraploid species Acanthus tetraploideus (2n = 96). The genomes of diploid progenitors, Acanthus ilicifolius and Acanthus ebracteatus (2n = 48), were assembled from single-tube long fragment read data. We identified an Acanthus-specific whole-genome duplication (WGD) event that occurred ∼43 million years ago (Mya). Ancestral karyotype reconstruction revealed a shift in haploid chromosome number from 11 to 24 in the progenitors, following the WGD and subsequent chromosomal fission events. The hybridization that formed A. tetraploideus was estimated to have occurred 0.7-1.8 Mya. Phylogenomic and synteny analyses clearly showed that A. tetraploideus inherited subgenomes SG1 and SG2 from A. ilicifolius and A. ebracteatus, respectively. Gene structure and retention analyses revealed a smaller and more structurally flexible genome in A. ebracteatus and SG2 compared with A. ilicifolius and SG1. Gene family and machine learning analyses identified expansions in protein families related to Casparian strip formation, root development, and salt stress response. Several of these families were expanded in A. ilicifolius and SG1 but contracted in A. ebracteatus and SG2. These genomic patterns might have contributed to the establishment of A. tetraploideus within the habitat of A. ebracteatus. For all three species, population analysis revealed clear genetic divergence between samples from the eastern and western coasts of Thailand.
Conclusions: These genome assemblies clarify the polyploidy and hybridization history of Acanthus and highlight gene family changes potentially associated with coastal root adaptation and habitat establishment in intertidal environments. This study provides valuable genomic resources and insights into the evolutionary adaptation of plants to intertidal environments.
{"title":"Genome Assembly of Three Shrub Mangroves in the Genus Acanthus Reveals Two Polyploidy Events and Expansion of Genes Linked to Root Adaptation in Coastal Habitats.","authors":"Wanapinun Nawae, Chaiwat Naktang, Peeraphat Paenpong, Duangjai Sangsrakru, Thippawan Yoocha, Sonicha U-Thoomporn, Wasitthee Kongkachana, Poonsri Wanthongchai, Suchart Yamprasai, Chonlawit Samart, Sithichoke Tangphatsornruang, Wirulda Pootakham","doi":"10.1093/gigascience/giaf162","DOIUrl":"10.1093/gigascience/giaf162","url":null,"abstract":"<p><strong>Background: </strong>The genomes of mangrove Acanthus species have not been reported, despite their ecological and medicinal importance. Here, we generated reference genomes for three shrub mangroves in the genus Acanthus to clarify their whole-genome duplication and hybridization events and identify genomic features underlying their evolution.</p><p><strong>Results: </strong>Using PacBio and Hi-C data, we generated a chromosome-scale genome assembly of the recently identified allotetraploid species Acanthus tetraploideus (2n = 96). The genomes of diploid progenitors, Acanthus ilicifolius and Acanthus ebracteatus (2n = 48), were assembled from single-tube long fragment read data. We identified an Acanthus-specific whole-genome duplication (WGD) event that occurred ∼43 million years ago (Mya). Ancestral karyotype reconstruction revealed a shift in haploid chromosome number from 11 to 24 in the progenitors, following the WGD and subsequent chromosomal fission events. The hybridization that formed A. tetraploideus was estimated to have occurred 0.7-1.8 Mya. Phylogenomic and synteny analyses clearly showed that A. tetraploideus inherited subgenomes SG1 and SG2 from A. ilicifolius and A. ebracteatus, respectively. Gene structure and retention analyses revealed a smaller and more structurally flexible genome in A. ebracteatus and SG2 compared with A. ilicifolius and SG1. Gene family and machine learning analyses identified expansions in protein families related to Casparian strip formation, root development, and salt stress response. Several of these families were expanded in A. ilicifolius and SG1 but contracted in A. ebracteatus and SG2. These genomic patterns might have contributed to the establishment of A. tetraploideus within the habitat of A. ebracteatus. For all three species, population analysis revealed clear genetic divergence between samples from the eastern and western coasts of Thailand.</p><p><strong>Conclusions: </strong>These genome assemblies clarify the polyploidy and hybridization history of Acanthus and highlight gene family changes potentially associated with coastal root adaptation and habitat establishment in intertidal environments. This study provides valuable genomic resources and insights into the evolutionary adaptation of plants to intertidal environments.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12903786/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145888972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurate genetic variant interpretation is crucial for disease research and the development of targeted therapies. Artificial intelligence is transforming this field by integrating computational methodologies across structural biology, evolutionary analysis, and multimodal genomic data. This review examines the evolution from traditional rule-based systems and statistical models to contemporary machine learning, deep learning, and protein language models, while addressing critical challenges in variant classification. Key obstacles include data heterogeneity, interpretability, and the persistence of variants of uncertain significance, emphasizing the critical need for explainable artificial intelligence frameworks and more inclusive genomic databases to improve predictive accuracy across diverse populations. Based on the assessment of current variant impact predictors, we propose strategies for enhanced predictor selection, effective multi-omics data integration, and optimized computational workflows. These recommendations aim to enhance variant interpretation accuracy in both research settings and clinical practice, ultimately contributing to advances in personalized medicine.
{"title":"Harnessing artificial intelligence for genomic variant prediction: advances, challenges, and future directions.","authors":"Indah Pakpahan, Mentari Sihombing, Haohan Liu, Mengyao Wang, Zheng Su, Mingyan Fang","doi":"10.1093/gigascience/giag004","DOIUrl":"10.1093/gigascience/giag004","url":null,"abstract":"<p><p>Accurate genetic variant interpretation is crucial for disease research and the development of targeted therapies. Artificial intelligence is transforming this field by integrating computational methodologies across structural biology, evolutionary analysis, and multimodal genomic data. This review examines the evolution from traditional rule-based systems and statistical models to contemporary machine learning, deep learning, and protein language models, while addressing critical challenges in variant classification. Key obstacles include data heterogeneity, interpretability, and the persistence of variants of uncertain significance, emphasizing the critical need for explainable artificial intelligence frameworks and more inclusive genomic databases to improve predictive accuracy across diverse populations. Based on the assessment of current variant impact predictors, we propose strategies for enhanced predictor selection, effective multi-omics data integration, and optimized computational workflows. These recommendations aim to enhance variant interpretation accuracy in both research settings and clinical practice, ultimately contributing to advances in personalized medicine.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12888390/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145948740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1093/gigascience/giaf161
Joseph E Powell
Large-scale single-cell genomics projects have revolutionized our understanding of human immune variation. Yet most studies to date have been Eurocentric, limited in cell-type resolution, or restricted to a single data modality. The newly published Chinese Immune Multi-Omics Atlas helps address these gaps by profiling 428 healthy Chinese adults using a multiomics single-cell approach that combines single-cell RNA sequencing and single-cell chromatin accessibility sequencing across over 10 million immune cells. This integrated strategy enabled the identification of 73 distinct immune cell subsets and the construction of cell-type-specific gene regulatory networks linking noncoding enhancers to target genes. The atlas delineated hundreds of enhancer modules (eRegulons), highlighting both established and novel regulators of immune cell identity. By aligning transcriptomic and epigenomic maps, Yin et al. show how expanding both the ancestral diversity and data modalities of immune cell genomics can reveal new biology and provide a valuable addition to global reference cell atlases.
{"title":"Charting immune variation through genetics and single-cell genomics.","authors":"Joseph E Powell","doi":"10.1093/gigascience/giaf161","DOIUrl":"10.1093/gigascience/giaf161","url":null,"abstract":"<p><p>Large-scale single-cell genomics projects have revolutionized our understanding of human immune variation. Yet most studies to date have been Eurocentric, limited in cell-type resolution, or restricted to a single data modality. The newly published Chinese Immune Multi-Omics Atlas helps address these gaps by profiling 428 healthy Chinese adults using a multiomics single-cell approach that combines single-cell RNA sequencing and single-cell chromatin accessibility sequencing across over 10 million immune cells. This integrated strategy enabled the identification of 73 distinct immune cell subsets and the construction of cell-type-specific gene regulatory networks linking noncoding enhancers to target genes. The atlas delineated hundreds of enhancer modules (eRegulons), highlighting both established and novel regulators of immune cell identity. By aligning transcriptomic and epigenomic maps, Yin et al. show how expanding both the ancestral diversity and data modalities of immune cell genomics can reveal new biology and provide a valuable addition to global reference cell atlases.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12821369/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145855259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1093/gigascience/giaf133
Cenny Taslim, Yuan Zhang, Galen Rask, Genevieve C Kendall, Emily R Theisen
Background: RNA sequencing (RNA-seq) analysis has become a routine task in numerous genomic research labs, driven by the reduced cost of bulk RNA sequencing experiments. These studies generate billions of reads that require easy-to-run, comprehensive, and reproducible analysis. However, many labs rely on in-house scripts, which can be challenging for bench scientists to use and hinder standardization and reproducibility. While existing RNA-seq pipelines attempt to address these challenges, they often lack a complete end-to-end user interface.
Findings: To bridge this gap, we developed RNA-SeqEZPZ, an automated pipeline with a user-friendly point-and-click interface, enabling rigorous and reproducible RNA-seq analysis without requiring programming or bioinformatics expertise. For advanced users, the pipeline can also be executed from the command line, allowing customization of steps to suit specific applications. The innovation of this pipeline lies in the combination of 3 key features: (i) all software is packaged within a Singularity container, eliminating installation issues; (ii) it offers a graphical, point-and-click interface from raw FASTQ files through differential expression and pathway analysis; and (iii) it includes a Nextflow implementation, enabling scalability and portability for seamless execution across various platforms, including job submission in the cloud and cluster computing. Additionally, RNA-SeqEZPZ generates a comprehensive statistical report and offers an option for batch adjustment to minimize effects of noise due to technical variation across replicates. Reports can also be reviewed by a bioinformatician to ensure the overall quality of the analysis.
Conclusions: RNA-SeqEZPZ is a robust, accessible, and scalable solution for comprehensive RNA-seq analysis, enabling researchers to focus on biological insights rather than computational challenges.
{"title":"RNA-SeqEZPZ: a point-and-click pipeline for comprehensive transcriptomics analysis with interactive visualizations.","authors":"Cenny Taslim, Yuan Zhang, Galen Rask, Genevieve C Kendall, Emily R Theisen","doi":"10.1093/gigascience/giaf133","DOIUrl":"10.1093/gigascience/giaf133","url":null,"abstract":"<p><strong>Background: </strong>RNA sequencing (RNA-seq) analysis has become a routine task in numerous genomic research labs, driven by the reduced cost of bulk RNA sequencing experiments. These studies generate billions of reads that require easy-to-run, comprehensive, and reproducible analysis. However, many labs rely on in-house scripts, which can be challenging for bench scientists to use and hinder standardization and reproducibility. While existing RNA-seq pipelines attempt to address these challenges, they often lack a complete end-to-end user interface.</p><p><strong>Findings: </strong>To bridge this gap, we developed RNA-SeqEZPZ, an automated pipeline with a user-friendly point-and-click interface, enabling rigorous and reproducible RNA-seq analysis without requiring programming or bioinformatics expertise. For advanced users, the pipeline can also be executed from the command line, allowing customization of steps to suit specific applications. The innovation of this pipeline lies in the combination of 3 key features: (i) all software is packaged within a Singularity container, eliminating installation issues; (ii) it offers a graphical, point-and-click interface from raw FASTQ files through differential expression and pathway analysis; and (iii) it includes a Nextflow implementation, enabling scalability and portability for seamless execution across various platforms, including job submission in the cloud and cluster computing. Additionally, RNA-SeqEZPZ generates a comprehensive statistical report and offers an option for batch adjustment to minimize effects of noise due to technical variation across replicates. Reports can also be reviewed by a bioinformatician to ensure the overall quality of the analysis.</p><p><strong>Conclusions: </strong>RNA-SeqEZPZ is a robust, accessible, and scalable solution for comprehensive RNA-seq analysis, enabling researchers to focus on biological insights rather than computational challenges.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12857227/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145495174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1093/gigascience/giaf158
Hangwei Liu, Lihong Lei, Fan Jiang, Bo Zhang, Hengchao Wang, Yutong Zhang, Hanbo Zhao, Guirong Wang, Wei Fan
Background: Praying mantises, members of the order Mantodea, play important roles in agriculture, medicine, bionics, and entertainment. However, the scarcity of genomic resources has hindered extensive studies on mantis evolution and behavior.
Results: Here, we present the chromosome-scale reference genomes of 5 mantis species: the European mantis (Mantis religiosa), Chinese mantis (Tenodera sinensis), triangle dead leaf mantis (Deroplatys truncata), orchid mantis (Hymenopus coronatus), and metallic mantis (Metallyticus violacea). The assembled genome sizes range from ∼2.3 to 4.2 Gb, with contig N50 size 1-109 Mb and 85%-99% of sequence anchored to chromosomes. The annotated protein-coding gene number ranges from 17,804 to 19,017, with a BUSCO complete rate of 96.7%-98.4%. We found that transposable element expansion is the major force governing genome size in Mantodea and suggest that translocations between the X chromosome and an autosome have occurred in the lineage of the family Mantidae. In addition, we found that the lineage of M. violacea has accumulated fewer substitutions than the lineages of other mantises. Furthermore, our genome-wide analyses showed that D. truncata is sister to H. coronatus compared with M. religiosa and T. sinensis, which helps resolve the phylogenic controversies of the Deroplatys genus.
Conclusions: The high-quality genome assemblies of the 5 mantises provide a valuable resource for evolution studies of Mantodea and genetic improvement and breeding of beneficial biological control agents.
{"title":"The genomes of 5 mantises provide insights into sex chromosome evolution and Mantodea phylogeny clarification.","authors":"Hangwei Liu, Lihong Lei, Fan Jiang, Bo Zhang, Hengchao Wang, Yutong Zhang, Hanbo Zhao, Guirong Wang, Wei Fan","doi":"10.1093/gigascience/giaf158","DOIUrl":"10.1093/gigascience/giaf158","url":null,"abstract":"<p><strong>Background: </strong>Praying mantises, members of the order Mantodea, play important roles in agriculture, medicine, bionics, and entertainment. However, the scarcity of genomic resources has hindered extensive studies on mantis evolution and behavior.</p><p><strong>Results: </strong>Here, we present the chromosome-scale reference genomes of 5 mantis species: the European mantis (Mantis religiosa), Chinese mantis (Tenodera sinensis), triangle dead leaf mantis (Deroplatys truncata), orchid mantis (Hymenopus coronatus), and metallic mantis (Metallyticus violacea). The assembled genome sizes range from ∼2.3 to 4.2 Gb, with contig N50 size 1-109 Mb and 85%-99% of sequence anchored to chromosomes. The annotated protein-coding gene number ranges from 17,804 to 19,017, with a BUSCO complete rate of 96.7%-98.4%. We found that transposable element expansion is the major force governing genome size in Mantodea and suggest that translocations between the X chromosome and an autosome have occurred in the lineage of the family Mantidae. In addition, we found that the lineage of M. violacea has accumulated fewer substitutions than the lineages of other mantises. Furthermore, our genome-wide analyses showed that D. truncata is sister to H. coronatus compared with M. religiosa and T. sinensis, which helps resolve the phylogenic controversies of the Deroplatys genus.</p><p><strong>Conclusions: </strong>The high-quality genome assemblies of the 5 mantises provide a valuable resource for evolution studies of Mantodea and genetic improvement and breeding of beneficial biological control agents.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12908712/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145774156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}