Pub Date : 2026-02-09DOI: 10.1093/gigascience/giag012
Zexuan Wang, Qipeng Zhan, Shu Yang, Zhuoping Zhou, Mengyuan Kan, Tianhuan Zhai, Li Shen
Background: Recent advancements in single-cell omics technologies have enabled detailed characterization of cellular processes. However, coassay sequencing technologies remain limited, resulting in un-paired single-cell omics datasets with differing feature dimensions.
Finding: we present GROTIA (Graph-Regularized Optimal Transport Framework for Diagonal Single-Cell Integrative Analysis), a computational method to align multi-omics datasets without requiring any prior correspondence information. GROTIA achieves global alignment through optimal transport while preserving local relationships via graph regularization. Additionally, our approach provides interpretability by deriving domain-specific feature importance from partial derivatives, highlighting key biological markers. Moreover, the transport plan between modalities can be leveraged for post-integration clustering, enabling a data-driven approach to discover novel cell subpopulations.
Conclusions: We demonstrate GROTIA's superior performance on four simulated and four real-world datasets, surpassing state-of-the-art unsupervised alignment methods and confirming the biological significance of the top features identified in each domain.
{"title":"An Interpretable Graph-Regularized Optimal Transport Framework for Diagonal Single-Cell Integrative Analysis.","authors":"Zexuan Wang, Qipeng Zhan, Shu Yang, Zhuoping Zhou, Mengyuan Kan, Tianhuan Zhai, Li Shen","doi":"10.1093/gigascience/giag012","DOIUrl":"https://doi.org/10.1093/gigascience/giag012","url":null,"abstract":"<p><strong>Background: </strong>Recent advancements in single-cell omics technologies have enabled detailed characterization of cellular processes. However, coassay sequencing technologies remain limited, resulting in un-paired single-cell omics datasets with differing feature dimensions.</p><p><strong>Finding: </strong>we present GROTIA (Graph-Regularized Optimal Transport Framework for Diagonal Single-Cell Integrative Analysis), a computational method to align multi-omics datasets without requiring any prior correspondence information. GROTIA achieves global alignment through optimal transport while preserving local relationships via graph regularization. Additionally, our approach provides interpretability by deriving domain-specific feature importance from partial derivatives, highlighting key biological markers. Moreover, the transport plan between modalities can be leveraged for post-integration clustering, enabling a data-driven approach to discover novel cell subpopulations.</p><p><strong>Conclusions: </strong>We demonstrate GROTIA's superior performance on four simulated and four real-world datasets, surpassing state-of-the-art unsupervised alignment methods and confirming the biological significance of the top features identified in each domain.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146141754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.1093/gigascience/giag014
Yawako W Kawaguchi, Rui Matsumoto, Shigehiro Kuraku
High-quality chromosome-level assemblies are essential for understanding genome evolution but remain difficult to obtain for large and complex genomes. Here we present a near gap-free genome assembly of the whale shark (Rhincodon typus) generated with long-read sequencing and Hi-C scaffolding, markedly improving contiguity and completeness. In particular, the X chromosome was extended to nearly twice its previous length, and putative pseudoautosomal regions were identified. Moreover, we report the first Y-linked scaffolds for this species. Comparative analyses with the zebra shark revealed exceptionally low substitution rates across the genome. We further detected a negative correlation between chromosome length and synonymous substitution rate (dS), explained by a positional gradient, here referred to as "chromocline", in which substitution rates gradually decrease from chromosomal ends toward central regions. Notably, the X chromosome exhibited low dS compared with autosomes of similar size, consistent with male-driven evolution. Our results highlight positional and sex-chromosome effects as key determinants of molecular evolutionary rates. The improved assembly will enable broad application to population-genetic and conservation genomic analyses in the whale shark.
{"title":"Improved genome assembly of whale shark, the world's biggest fish: revealing intragenomic heterogeneity in molecular evolution.","authors":"Yawako W Kawaguchi, Rui Matsumoto, Shigehiro Kuraku","doi":"10.1093/gigascience/giag014","DOIUrl":"https://doi.org/10.1093/gigascience/giag014","url":null,"abstract":"<p><p>High-quality chromosome-level assemblies are essential for understanding genome evolution but remain difficult to obtain for large and complex genomes. Here we present a near gap-free genome assembly of the whale shark (Rhincodon typus) generated with long-read sequencing and Hi-C scaffolding, markedly improving contiguity and completeness. In particular, the X chromosome was extended to nearly twice its previous length, and putative pseudoautosomal regions were identified. Moreover, we report the first Y-linked scaffolds for this species. Comparative analyses with the zebra shark revealed exceptionally low substitution rates across the genome. We further detected a negative correlation between chromosome length and synonymous substitution rate (dS), explained by a positional gradient, here referred to as \"chromocline\", in which substitution rates gradually decrease from chromosomal ends toward central regions. Notably, the X chromosome exhibited low dS compared with autosomes of similar size, consistent with male-driven evolution. Our results highlight positional and sex-chromosome effects as key determinants of molecular evolutionary rates. The improved assembly will enable broad application to population-genetic and conservation genomic analyses in the whale shark.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146131532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The chromatin accessibility landscape is the basis of cell-specific gene expression. We generated a multi organ, single-nucleus chromatin accessibility landscape from the model organism Rattus norvegicus. For this single-cell atlas, we constructed 25 libraries via snATAC-seq from nine organs in the rat, with a total of over 110,000 cells. Cell classification integrating gene activity scores with known marker genes identified 77 cell types, which were strongly correlated with those in published mouse single-cell transcriptome atlases. We further investigated the enrichment of cell type- and organ-specific transcription factors (TFs), Shared and organ-specific features of endothelial and stromal cells, as well as cross-organ macrophage regulatory states, and the conservation and specificity of gene regulatory programs across species. Together, these findings provide a valuable foundation for dissecting tissue-specific regulatory logic and for advancing cross-organ and cross-species cell type annotation and functional inference in the rat model.
{"title":"Single-nucleus multiple-organ chromatin accessibility landscape in the adult rat.","authors":"Ronghai Li, Shanshan Duan, Qiuting Deng, Wen Ma, Chang Liu, Peng Gao, Li Lu, Yue Yuan","doi":"10.1093/gigascience/giag013","DOIUrl":"https://doi.org/10.1093/gigascience/giag013","url":null,"abstract":"<p><p>The chromatin accessibility landscape is the basis of cell-specific gene expression. We generated a multi organ, single-nucleus chromatin accessibility landscape from the model organism Rattus norvegicus. For this single-cell atlas, we constructed 25 libraries via snATAC-seq from nine organs in the rat, with a total of over 110,000 cells. Cell classification integrating gene activity scores with known marker genes identified 77 cell types, which were strongly correlated with those in published mouse single-cell transcriptome atlases. We further investigated the enrichment of cell type- and organ-specific transcription factors (TFs), Shared and organ-specific features of endothelial and stromal cells, as well as cross-organ macrophage regulatory states, and the conservation and specificity of gene regulatory programs across species. Together, these findings provide a valuable foundation for dissecting tissue-specific regulatory logic and for advancing cross-organ and cross-species cell type annotation and functional inference in the rat model.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146112898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-31DOI: 10.1093/gigascience/giaf153
Shaadi Mehr, Todd Castoe, Marymegan Daly, Florence Jungo, Kim N Kirchhoff, Ivan Koludarov, Stephen P Mackessy, Jason Macrander, Praveena Naidu, Maria Vittoria Modica, Elda E Sanchez, Giulia Zancolli, Mandë Holford
Venomous animal research is hampered by fragmented, specialized, and non-interoperable databases (isolated genomic, proteomic, and ecological data). Despite the immense promise of venomous organisms to yield novel bioactive compounds for pharmacological and evolutionary applications, the informatics landscape for such taxa has remained patchy, lacking macro-scale integration across species. We present VenomsBase, an integrated, modular resource that synthesizes multi-omics data, ecological metadata, and functional annotations for venom-bearing organisms. Following the FAIR guidelines, VenomsBase combines an ontology-driven architecture with big-data cloud workflows for sequence integration, motif clustering, 3D display, and linking ecological metadata. Standardized tools and training modules facilitate worldwide access to resources for both researchers in developed countries and in resource-limited areas. Its plug-and-play design allows for integration of additional analytical modules and extension to other species. One can also examine evolutionary trends and connect venom chemistry to ecological niches. VenomsBase would (i) accelerate the pace of venom discovery, whether for therapeutic purposes or evolutionary significance, by providing validated, cross-referenced data sets and community-driven curation, and (ii) foster an open, just, and innovation-ready venom research ecosystem.
{"title":"A Proposed Unified, Scalable Platform for Integrative Research on Venomous Species.","authors":"Shaadi Mehr, Todd Castoe, Marymegan Daly, Florence Jungo, Kim N Kirchhoff, Ivan Koludarov, Stephen P Mackessy, Jason Macrander, Praveena Naidu, Maria Vittoria Modica, Elda E Sanchez, Giulia Zancolli, Mandë Holford","doi":"10.1093/gigascience/giaf153","DOIUrl":"https://doi.org/10.1093/gigascience/giaf153","url":null,"abstract":"<p><p>Venomous animal research is hampered by fragmented, specialized, and non-interoperable databases (isolated genomic, proteomic, and ecological data). Despite the immense promise of venomous organisms to yield novel bioactive compounds for pharmacological and evolutionary applications, the informatics landscape for such taxa has remained patchy, lacking macro-scale integration across species. We present VenomsBase, an integrated, modular resource that synthesizes multi-omics data, ecological metadata, and functional annotations for venom-bearing organisms. Following the FAIR guidelines, VenomsBase combines an ontology-driven architecture with big-data cloud workflows for sequence integration, motif clustering, 3D display, and linking ecological metadata. Standardized tools and training modules facilitate worldwide access to resources for both researchers in developed countries and in resource-limited areas. Its plug-and-play design allows for integration of additional analytical modules and extension to other species. One can also examine evolutionary trends and connect venom chemistry to ecological niches. VenomsBase would (i) accelerate the pace of venom discovery, whether for therapeutic purposes or evolutionary significance, by providing validated, cross-referenced data sets and community-driven curation, and (ii) foster an open, just, and innovation-ready venom research ecosystem.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146092995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1093/gigascience/giag011
Abdulkadir Elmas, Hillary M Layden, Jacob D Ellis, Luke N Bartlett, Xian Zhao, Reika Kawabata-Iwakawa, Zishan Wang, Hideru Obinata, Scott W Hiebert, Kuan-Lin Huang
Background: Cancer cells are heterogeneous, each harboring distinct molecular aberrations and being dependent on different genes for their survival and proliferation. While targeted therapies based on driver DNA mutations have shown success, many tumors lack druggable mutations, limiting treatment options. We hypothesize that new precision oncology targets may be identified through "expression-driven dependency," where cancer cells with high expression of specific genes are more vulnerable to the knockout of those same genes.
Results: We developed BEACON, a Bayesian approach to identify expression-driven dependency targets by analyzing global transcriptomic and proteomic profiles alongside genetic dependency data from cancer cell lines across 17 tissue lineages. BEACON successfully identified known druggable genes, including BCL2, ERBB2, EGFR, ESR1, and MYC, while revealing novel targets confirmed by both mRNA and protein-expression driven dependency. The identified genes showed a 3.8-fold enrichment for approved drug targets and a 7 to 10-fold enrichment for druggable oncology targets. Experimental validation demonstrated that depletion of GRHL2, TP63, and PAX5 effectively reduced tumor cell growth and survival in their dependent cells.
Conclusions: Our approach provides a systematic method to identify precision oncology targets based on expression-driven dependency patterns. By integrating multi-omics data with genetic dependency screens, we've created a comprehensive catalog of potential therapeutic targets that may expand treatment options for cancer patients lacking druggable mutations. This resource offers new opportunities for precision oncology target discovery beyond mutation-based approaches.
{"title":"Expression-Driven Genetic Dependency Reveals Targets for Precision Oncology.","authors":"Abdulkadir Elmas, Hillary M Layden, Jacob D Ellis, Luke N Bartlett, Xian Zhao, Reika Kawabata-Iwakawa, Zishan Wang, Hideru Obinata, Scott W Hiebert, Kuan-Lin Huang","doi":"10.1093/gigascience/giag011","DOIUrl":"10.1093/gigascience/giag011","url":null,"abstract":"<p><strong>Background: </strong>Cancer cells are heterogeneous, each harboring distinct molecular aberrations and being dependent on different genes for their survival and proliferation. While targeted therapies based on driver DNA mutations have shown success, many tumors lack druggable mutations, limiting treatment options. We hypothesize that new precision oncology targets may be identified through \"expression-driven dependency,\" where cancer cells with high expression of specific genes are more vulnerable to the knockout of those same genes.</p><p><strong>Results: </strong>We developed BEACON, a Bayesian approach to identify expression-driven dependency targets by analyzing global transcriptomic and proteomic profiles alongside genetic dependency data from cancer cell lines across 17 tissue lineages. BEACON successfully identified known druggable genes, including BCL2, ERBB2, EGFR, ESR1, and MYC, while revealing novel targets confirmed by both mRNA and protein-expression driven dependency. The identified genes showed a 3.8-fold enrichment for approved drug targets and a 7 to 10-fold enrichment for druggable oncology targets. Experimental validation demonstrated that depletion of GRHL2, TP63, and PAX5 effectively reduced tumor cell growth and survival in their dependent cells.</p><p><strong>Conclusions: </strong>Our approach provides a systematic method to identify precision oncology targets based on expression-driven dependency patterns. By integrating multi-omics data with genetic dependency screens, we've created a comprehensive catalog of potential therapeutic targets that may expand treatment options for cancer patients lacking druggable mutations. This resource offers new opportunities for precision oncology target discovery beyond mutation-based approaches.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146085410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-24DOI: 10.1093/gigascience/giag010
A M B Amorim, C Marques-Pereira, T Almeida, N Rosário-Ferreira, H S Pinto, C Vaz, A Francisco, I S Moreira
Background: The development of a single therapeutic compound can exceed 1.8 billion USD and take more than a decade, underscoring the urgent need to accelerate drug discovery. Computational methods have become indispensable; however, traditional approaches, such as docking simulations, face limitations because they depend on protein and ligand structures that may be unavailable, incomplete, or of low accuracy. Even recent breakthroughs, such as AlphaFold, do not consistently provide models precise enough to identify ligand-binding sites or drug-target interactions.
Results: We present ViralBindPredict, a deep learning framework that predicts viral protein-ligand binding sites directly from sequence. We also introduce the first curated large-scale benchmark of viral protein-ligand interactions, comprising >10,000 viral chains and ≈13,000 interactions processed using a 4.5 Å heavy-atom contact threshold. ViralBindPredict combines Mordred ligand descriptors with contextual protein embeddings from ESM2 or ProtTrans, enabling structure-free learning of binding preferences. Leakage-controlled data splits were applied to prevent overlap across protein sequence clusters and ligand scaffolds (Cluster90%, NoRed90%→Cluster90%, Cluster40%, NoRed90%→Cluster40%). Across most regimes, multilayer perceptrons, especially with ESM-2 embeddings, outperformed LightGBM baselines, maintaining strong precision-recall for unseen ligands but showing larger drops for unseen proteins, indicating that the protein context dominates generalization.
Conclusions: ViralBindPredict introduces the first leakage-controlled benchmark for viral protein-ligand interactions and demonstrates accurate ligand-binding residue prediction directly from protein sequence. Together, these advances establish ViralBindPredict as a robust and extensible workflow for sequence-based antiviral discovery, supporting rapid target prioritization, compound repurposing, and de novo drug design, even in the absence of structural data.
{"title":"ViralBindPredict: Empowering Viral Protein-Ligand Binding Sites through Deep Learning and Protein Sequence-Derived Insights.","authors":"A M B Amorim, C Marques-Pereira, T Almeida, N Rosário-Ferreira, H S Pinto, C Vaz, A Francisco, I S Moreira","doi":"10.1093/gigascience/giag010","DOIUrl":"https://doi.org/10.1093/gigascience/giag010","url":null,"abstract":"<p><strong>Background: </strong>The development of a single therapeutic compound can exceed 1.8 billion USD and take more than a decade, underscoring the urgent need to accelerate drug discovery. Computational methods have become indispensable; however, traditional approaches, such as docking simulations, face limitations because they depend on protein and ligand structures that may be unavailable, incomplete, or of low accuracy. Even recent breakthroughs, such as AlphaFold, do not consistently provide models precise enough to identify ligand-binding sites or drug-target interactions.</p><p><strong>Results: </strong>We present ViralBindPredict, a deep learning framework that predicts viral protein-ligand binding sites directly from sequence. We also introduce the first curated large-scale benchmark of viral protein-ligand interactions, comprising >10,000 viral chains and ≈13,000 interactions processed using a 4.5 Å heavy-atom contact threshold. ViralBindPredict combines Mordred ligand descriptors with contextual protein embeddings from ESM2 or ProtTrans, enabling structure-free learning of binding preferences. Leakage-controlled data splits were applied to prevent overlap across protein sequence clusters and ligand scaffolds (Cluster90%, NoRed90%→Cluster90%, Cluster40%, NoRed90%→Cluster40%). Across most regimes, multilayer perceptrons, especially with ESM-2 embeddings, outperformed LightGBM baselines, maintaining strong precision-recall for unseen ligands but showing larger drops for unseen proteins, indicating that the protein context dominates generalization.</p><p><strong>Conclusions: </strong>ViralBindPredict introduces the first leakage-controlled benchmark for viral protein-ligand interactions and demonstrates accurate ligand-binding residue prediction directly from protein sequence. Together, these advances establish ViralBindPredict as a robust and extensible workflow for sequence-based antiviral discovery, supporting rapid target prioritization, compound repurposing, and de novo drug design, even in the absence of structural data.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146040769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Genetic colocalization analysis is essential for understanding the shared genetic basis between phenotypic traits. Such an analysis is particularly useful for identifying plasma proteins with potential as therapeutic targets or clinical biomarkers. Improvements to existing tools are needed for more accurate inference of potentially causal biomarkers.
Findings: We develop HDL-C, a high-definition likelihood inference method for genetic colocalization analysis. Based on simulations and observed rediscovery rates in real data analyses, we demonstrate that the HDL-C approach outperforms state-of-the-art methods, COLOC, SuSiE, and SharePro, in detecting genetic colocalization, thus enabling a more complete understanding of genetic connections at specific loci. Analyses of the top 50 protein-disease pairs identified by HDL-C in the male and female cohorts of the UK Biobank uncovered 40 previously validated drug-protein-disease combinations with approved drugs matching the phenotypes and 62 combinations with potential drug repurposing opportunities. Additionally, we identified 63 novel protein-disease pairs that suggest promising candidates for future therapeutic interventions.
Conclusion: This research establishes a robust framework for detecting genetic colocalization signals, enabling the prioritization of disease-relevant protein targets and informing therapeutic development strategies.
{"title":"High-definition likelihood inference of genetic colocalization reveals protein biomarkers for human complex diseases.","authors":"Yuying Li, Ranran Zhai, Zhijian Yang, Ting Li, Yudi Pawitan, Xia Shen","doi":"10.1093/gigascience/giaf155","DOIUrl":"https://doi.org/10.1093/gigascience/giaf155","url":null,"abstract":"<p><strong>Background: </strong>Genetic colocalization analysis is essential for understanding the shared genetic basis between phenotypic traits. Such an analysis is particularly useful for identifying plasma proteins with potential as therapeutic targets or clinical biomarkers. Improvements to existing tools are needed for more accurate inference of potentially causal biomarkers.</p><p><strong>Findings: </strong>We develop HDL-C, a high-definition likelihood inference method for genetic colocalization analysis. Based on simulations and observed rediscovery rates in real data analyses, we demonstrate that the HDL-C approach outperforms state-of-the-art methods, COLOC, SuSiE, and SharePro, in detecting genetic colocalization, thus enabling a more complete understanding of genetic connections at specific loci. Analyses of the top 50 protein-disease pairs identified by HDL-C in the male and female cohorts of the UK Biobank uncovered 40 previously validated drug-protein-disease combinations with approved drugs matching the phenotypes and 62 combinations with potential drug repurposing opportunities. Additionally, we identified 63 novel protein-disease pairs that suggest promising candidates for future therapeutic interventions.</p><p><strong>Conclusion: </strong>This research establishes a robust framework for detecting genetic colocalization signals, enabling the prioritization of disease-relevant protein targets and informing therapeutic development strategies.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146029313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1093/gigascience/giaf148
Wei Zhang, Hanchen Huang, Lily Wang, Brian D Lehmann, X Steven Chen
Background: High-throughput technologies now produce a wide array of omics data, from genomic and transcriptomic profiles to epigenomic and proteomic measurements. Integrating multiple omics layers measured on the same samples can reveal cross-layer molecular hubs that single-layer analyses miss. However, many existing integrative methods rely on linear assumptions or univariate feature importance, limiting their ability to capture nonlinear and interaction-driven dependencies across data modalities.
Results: We present an unsupervised, multivariate random forest (MRF) framework with an inverse minimal depth (IMD) importance to prioritize shared biomarkers across omics. In each forest, one layer serves as a multivariate response and the other as predictors; IMD summarizes how early a predictor (or response maximal splitting response variable) appears across trees, yielding interpretable, cross-layer feature rankings. We provide two IMD-based selection strategies and introduce an optional IMD power transform to enhance sensitivity to interaction signals. In extensive simulations spanning linear, nonlinear, and interaction regimes, our method matches sparse partial least squares/canonical correlation analysis under linear settings and outperforms them as nonlinearity increases, while adapted univariate ensemble learners (random forest, gradient boosting machine, XGBoost) underperform in the multivariate, unsupervised context. Applied to breast invasive carcinoma and colon adenocarcinoma in The Cancer Genome Atlas (TCGA), MRF-IMD identifies genes, CpGs, and microRNAs enriched for cancer-relevant pathways and yields more robust survival stratification than linear integrators with matched model sizes. In a TCGA pan-cancer analysis, MRF-IMD features achieve a higher Adjusted Rand Index than alternatives and recover coherent tumor-type clusters; in the Alzheimer's Disease Neuroimaging Initiative (ADNI), the integrative signature improves dementia progression stratification over a published methylation risk score.
Conclusions: MRF-IMD provides a scalable and interpretable framework for multiomics integration that reliably identifies cross-layer biomarkers when nonlinear and interaction-driven dependencies are present. This approach advances robust biomarker discovery beyond the limits of linear integrative methods.
{"title":"An integrative multiomics random forest framework for robust biomarker discovery.","authors":"Wei Zhang, Hanchen Huang, Lily Wang, Brian D Lehmann, X Steven Chen","doi":"10.1093/gigascience/giaf148","DOIUrl":"10.1093/gigascience/giaf148","url":null,"abstract":"<p><strong>Background: </strong>High-throughput technologies now produce a wide array of omics data, from genomic and transcriptomic profiles to epigenomic and proteomic measurements. Integrating multiple omics layers measured on the same samples can reveal cross-layer molecular hubs that single-layer analyses miss. However, many existing integrative methods rely on linear assumptions or univariate feature importance, limiting their ability to capture nonlinear and interaction-driven dependencies across data modalities.</p><p><strong>Results: </strong>We present an unsupervised, multivariate random forest (MRF) framework with an inverse minimal depth (IMD) importance to prioritize shared biomarkers across omics. In each forest, one layer serves as a multivariate response and the other as predictors; IMD summarizes how early a predictor (or response maximal splitting response variable) appears across trees, yielding interpretable, cross-layer feature rankings. We provide two IMD-based selection strategies and introduce an optional IMD power transform to enhance sensitivity to interaction signals. In extensive simulations spanning linear, nonlinear, and interaction regimes, our method matches sparse partial least squares/canonical correlation analysis under linear settings and outperforms them as nonlinearity increases, while adapted univariate ensemble learners (random forest, gradient boosting machine, XGBoost) underperform in the multivariate, unsupervised context. Applied to breast invasive carcinoma and colon adenocarcinoma in The Cancer Genome Atlas (TCGA), MRF-IMD identifies genes, CpGs, and microRNAs enriched for cancer-relevant pathways and yields more robust survival stratification than linear integrators with matched model sizes. In a TCGA pan-cancer analysis, MRF-IMD features achieve a higher Adjusted Rand Index than alternatives and recover coherent tumor-type clusters; in the Alzheimer's Disease Neuroimaging Initiative (ADNI), the integrative signature improves dementia progression stratification over a published methylation risk score.</p><p><strong>Conclusions: </strong>MRF-IMD provides a scalable and interpretable framework for multiomics integration that reliably identifies cross-layer biomarkers when nonlinear and interaction-driven dependencies are present. This approach advances robust biomarker discovery beyond the limits of linear integrative methods.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12821379/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145707663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1093/gigascience/giaf163
Yanming Wei, Zhaoyang Huang, Pinglu Zhang, Yizheng Wang, Yan Li, Liang Yu, Quan Zou
Background: Multiple sequence alignment (MSA) continues to be a central challenge in comparative genomics, where the quality of alignment plays a crucial role in determining the accuracy of downstream analyses. However, the challenge of large-scale alignment remains significant.
Findings: This article introduces deMEM, a novel and effective framework for DNA multiple sequence alignment, which enables existing MSA methods such as MAFFT to handle extremely large sequences. deMEM is a 3-stage alignment process: (i) representing maximum exact matches using a de Bruijn graph and clustering them based on their area, (ii) employing a novel divide-and-conquer framework for alignment, and (iii) providing profile-profile alignment between different clusters.
Conclusions: DeMEM enables existing methods like MAFFT to align an extremely large number of sequences, including long sequences that cannot be directly aligned, such as those in a dataset of a thousand monkeypox virus genomes. The deMEM package is free and available at https://github.com/malabz/deMEM.
{"title":"deMEM: a novel divide-and-conquer framework based on de Bruijn graph for scalable multiple sequence alignment.","authors":"Yanming Wei, Zhaoyang Huang, Pinglu Zhang, Yizheng Wang, Yan Li, Liang Yu, Quan Zou","doi":"10.1093/gigascience/giaf163","DOIUrl":"10.1093/gigascience/giaf163","url":null,"abstract":"<p><strong>Background: </strong>Multiple sequence alignment (MSA) continues to be a central challenge in comparative genomics, where the quality of alignment plays a crucial role in determining the accuracy of downstream analyses. However, the challenge of large-scale alignment remains significant.</p><p><strong>Findings: </strong>This article introduces deMEM, a novel and effective framework for DNA multiple sequence alignment, which enables existing MSA methods such as MAFFT to handle extremely large sequences. deMEM is a 3-stage alignment process: (i) representing maximum exact matches using a de Bruijn graph and clustering them based on their area, (ii) employing a novel divide-and-conquer framework for alignment, and (iii) providing profile-profile alignment between different clusters.</p><p><strong>Conclusions: </strong>DeMEM enables existing methods like MAFFT to align an extremely large number of sequences, including long sequences that cannot be directly aligned, such as those in a dataset of a thousand monkeypox virus genomes. The deMEM package is free and available at https://github.com/malabz/deMEM.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12878729/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145900220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1093/gigascience/giaf145
Joanna Szablińska-Piernik, Paweł Sulima, Jakub Sawicki
Background: The liverwort Apopellia endiviifolia, a dioicous, simple thalloid species, is notable for its cryptic diversity, habitat adaptability, and genomic innovation, and it represents a clade that is sister to all other Jungermanniopsida. These features make A. endiviifolia an essential model for exploring speciation mechanisms and the evolution of genome structures within liverworts.
Findings: We present the genome assembly of a haploid A. endiviifolia isolate with a total size of 2,914,960,273 bp and an N50 of 468,157,909 bp, demonstrating high completeness (99.2% BUSCO) and a high consensus quality (quality value 47.6). The assembly consisted of 9 chromosomes, which included 18 telomeres and 9 centromeres (ranging from 1.9 to 5 Mbp in length). RNA sequencing-based annotation identified 34,615 genes, predominantly protein coding. The transposable elements comprised 12.16% long terminal repeat elements and 57 Helitrons. Among the retroelements, the Copia and Gypsy superfamilies comprised 8.94% and 2.95% of the genome, respectively. The Ty3/Gypsy superfamily was significantly enriched in centromeric regions. The average GC content ranged from 38.8% to 39.6%, with gene density varying between 5.52 and 9.78 genes per 500 kbp. Synteny analysis of related liverwort species has revealed complex chromosomal relationships, indicating extensive genome rearrangements among species.
Conclusions: This study provides the first high-quality reference genome assembly of the haploid liverwort A. endiviifolia. Assembly and annotation offer valuable resources for investigating liverwort evolution, centromere biology, and genome expansion in simple thalloid liverworts.
{"title":"Giant chromosomes of a tiny plant-the complete telomere-to-telomere genome assembly of the simple thalloid liverwort Apopellia endiviifolia (Jungermanniopsida, Marchantiophyta).","authors":"Joanna Szablińska-Piernik, Paweł Sulima, Jakub Sawicki","doi":"10.1093/gigascience/giaf145","DOIUrl":"10.1093/gigascience/giaf145","url":null,"abstract":"<p><strong>Background: </strong>The liverwort Apopellia endiviifolia, a dioicous, simple thalloid species, is notable for its cryptic diversity, habitat adaptability, and genomic innovation, and it represents a clade that is sister to all other Jungermanniopsida. These features make A. endiviifolia an essential model for exploring speciation mechanisms and the evolution of genome structures within liverworts.</p><p><strong>Findings: </strong>We present the genome assembly of a haploid A. endiviifolia isolate with a total size of 2,914,960,273 bp and an N50 of 468,157,909 bp, demonstrating high completeness (99.2% BUSCO) and a high consensus quality (quality value 47.6). The assembly consisted of 9 chromosomes, which included 18 telomeres and 9 centromeres (ranging from 1.9 to 5 Mbp in length). RNA sequencing-based annotation identified 34,615 genes, predominantly protein coding. The transposable elements comprised 12.16% long terminal repeat elements and 57 Helitrons. Among the retroelements, the Copia and Gypsy superfamilies comprised 8.94% and 2.95% of the genome, respectively. The Ty3/Gypsy superfamily was significantly enriched in centromeric regions. The average GC content ranged from 38.8% to 39.6%, with gene density varying between 5.52 and 9.78 genes per 500 kbp. Synteny analysis of related liverwort species has revealed complex chromosomal relationships, indicating extensive genome rearrangements among species.</p><p><strong>Conclusions: </strong>This study provides the first high-quality reference genome assembly of the haploid liverwort A. endiviifolia. Assembly and annotation offer valuable resources for investigating liverwort evolution, centromere biology, and genome expansion in simple thalloid liverworts.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145632216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}