Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf050
January Adams, Rafal Cymerys, Karol Szuster, Daniel Hekman, Zoryana Salo, Rutvik Solanki, Muhammad Mamdani, Alistair Johnson, Katarzyna Ryniak, Tom Pollard, David Rotenberg, Benjamin Haibe-Kains
We outline the development of the Health Data Nexus, a data platform that enables data storage and access management with a cloud-based computational environment. We describe the importance of this secure platform in an evolving public-sector research landscape that utilizes significant quantities of data, particularly clinical data acquired from health systems, as well as the importance of providing meaningful benefits for three targeted user groups: data providers, researchers, and educators. We then describe the implementation of governance practices, technical standards, and data security, and the privacy protections needed to build this platform, as well as example use-cases highlighting the strengths of the platform in facilitating dataset acquisition, novel research, and hosting educational courses, workshops, and datathons. Finally, we discuss the key principles that informed the platform's development, highlighting the importance of flexible uses, collaborative development, and open-source science.
{"title":"Health Data Nexus: an open data platform for AI research and education in medicine.","authors":"January Adams, Rafal Cymerys, Karol Szuster, Daniel Hekman, Zoryana Salo, Rutvik Solanki, Muhammad Mamdani, Alistair Johnson, Katarzyna Ryniak, Tom Pollard, David Rotenberg, Benjamin Haibe-Kains","doi":"10.1093/gigascience/giaf050","DOIUrl":"10.1093/gigascience/giaf050","url":null,"abstract":"<p><p>We outline the development of the Health Data Nexus, a data platform that enables data storage and access management with a cloud-based computational environment. We describe the importance of this secure platform in an evolving public-sector research landscape that utilizes significant quantities of data, particularly clinical data acquired from health systems, as well as the importance of providing meaningful benefits for three targeted user groups: data providers, researchers, and educators. We then describe the implementation of governance practices, technical standards, and data security, and the privacy protections needed to build this platform, as well as example use-cases highlighting the strengths of the platform in facilitating dataset acquisition, novel research, and hosting educational courses, workshops, and datathons. Finally, we discuss the key principles that informed the platform's development, highlighting the importance of flexible uses, collaborative development, and open-source science.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12131319/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144208238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cultivated tomato (Solanum lycopersicum) is a major vegetable crop of high economic value that serves as an important model for studying flowering time in day-neutral plants. A complete, continuous, and gapless genome of cultivated tomato is essential for genetic research and breeding programs. Here, we report the construction of a telomere-to-telomere (T2T) gap-free genome of S. lycopersicum cv. VF36 using a combination of sequencing technologies. The 815.27-Mb T2T "VF36" genome contained 600.23 Mb of transposable elements. Through comparative genomics and phylogenetic analysis, we identified structural variations between the "VF36" and "Heinz 1706" genomes and found no evidence of a recent species-specific whole-genome duplication in the "VF36" tomato. Furthermore, a core circadian oscillator, SlPRR1, was identified, which peaked at night in a circadian rhythm. CRISPR/Cas9-mediated knockdown of SlPRR1 in tomatoes demonstrated that slprr1 mutant lines exhibited significantly earlier flowering under long-day condition than wild type. We present a hypothetical model of how SlPRR1 regulates flowering time and chlorophyll biosynthesis in response to photoperiod. This T2T genomic resource will accelerate the genetic improvement of large-fruited tomatoes, and the SlPRR1-related hypothetical model will enhance our understanding of the photoperiodic response in cultivated tomatoes, revealing a regulatory mechanism for manipulating flowering time.
{"title":"A telomere-to-telomere gapless genome reveals SlPRR1 control of circadian rhythm and photoperiodic flowering in tomato.","authors":"Hui Liu, Jia-Qi Zhang, Jian-Ping Tao, Chen Chen, Li-Yao Su, Jin-Song Xiong, Ai-Sheng Xiong","doi":"10.1093/gigascience/giaf058","DOIUrl":"10.1093/gigascience/giaf058","url":null,"abstract":"<p><p>Cultivated tomato (Solanum lycopersicum) is a major vegetable crop of high economic value that serves as an important model for studying flowering time in day-neutral plants. A complete, continuous, and gapless genome of cultivated tomato is essential for genetic research and breeding programs. Here, we report the construction of a telomere-to-telomere (T2T) gap-free genome of S. lycopersicum cv. VF36 using a combination of sequencing technologies. The 815.27-Mb T2T \"VF36\" genome contained 600.23 Mb of transposable elements. Through comparative genomics and phylogenetic analysis, we identified structural variations between the \"VF36\" and \"Heinz 1706\" genomes and found no evidence of a recent species-specific whole-genome duplication in the \"VF36\" tomato. Furthermore, a core circadian oscillator, SlPRR1, was identified, which peaked at night in a circadian rhythm. CRISPR/Cas9-mediated knockdown of SlPRR1 in tomatoes demonstrated that slprr1 mutant lines exhibited significantly earlier flowering under long-day condition than wild type. We present a hypothetical model of how SlPRR1 regulates flowering time and chlorophyll biosynthesis in response to photoperiod. This T2T genomic resource will accelerate the genetic improvement of large-fruited tomatoes, and the SlPRR1-related hypothetical model will enhance our understanding of the photoperiodic response in cultivated tomatoes, revealing a regulatory mechanism for manipulating flowering time.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12218202/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144553222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Vertebrate sex is typically determined either by genetic factors, such as sex chromosomes, or by environmental cues like temperature. Therefore, the agamid dragon lizard Pogona vitticeps is remarkable in this regard, as it exhibits both ZZ/ZW genetic and temperature-dependent sex determination. However, complete sequence and full gene content of P. vitticeps sex chromosomes remain unclear, hindering the investigation of sex-determining cascade in this model lizard.
Results: Using CycloneSEQ and DNBSEQ sequencing technologies, we generated a near-complete chromosome-scale genome assembly for a ZZ male P. vitticeps. Compared with previous reference genome (GCF_900067755.1/Pvi1.1), this ∼1.8-Gb new assembly displayed >5,700-fold improvement in contiguity (contig N50: 202.5 Mb vs. 35.5 kb) and achieved complete chromosome anchoring (16 vs. 13,749 scaffolds). We found that over 80% of the P. vitticeps Z chromosome remains as a pseudo-autosomal region, where recombination is not suppressed. The sexually differentiated region (SDR) is small and occupied mostly by transposons, yet it aggregates genes involved in male development, such as AMH, AMHR2, and BMPR1A. Finally, by tracking the evolutionary origin and developmental expression of SDR genes, we proposed a model for the origin of P. vitticeps sex chromosomes that considered the Z-linked AMH as the master sex-determining gene.
Conclusions: In this study, we fully characterized the Z sex chromosome of P. vitticeps, identified AMH as the candidate sex-determining gene, and proposed a new model for the origin of P. vitticeps sex chromosomes. The near-complete P. vitticeps reference genome will also benefit future study of reptile evolution.
{"title":"A near-complete genome assembly of the bearded dragon Pogona vitticeps provides insights into the origin of Pogona sex chromosomes.","authors":"Qunfei Guo, Youliang Pan, Wei Dai, Fei Guo, Tao Zeng, Wanyi Chen, Yaping Mi, Yanshu Zhang, Shuaizhen Shi, Wei Jiang, Huimin Cai, Beiying Wu, Yang Zhou, Ying Wang, Chentao Yang, Xiao Shi, Xu Yan, Junyi Chen, Chongyang Cai, Jingnan Yang, Xun Xu, Ying Gu, Yuliang Dong, Qiye Li","doi":"10.1093/gigascience/giaf079","DOIUrl":"10.1093/gigascience/giaf079","url":null,"abstract":"<p><strong>Background: </strong>Vertebrate sex is typically determined either by genetic factors, such as sex chromosomes, or by environmental cues like temperature. Therefore, the agamid dragon lizard Pogona vitticeps is remarkable in this regard, as it exhibits both ZZ/ZW genetic and temperature-dependent sex determination. However, complete sequence and full gene content of P. vitticeps sex chromosomes remain unclear, hindering the investigation of sex-determining cascade in this model lizard.</p><p><strong>Results: </strong>Using CycloneSEQ and DNBSEQ sequencing technologies, we generated a near-complete chromosome-scale genome assembly for a ZZ male P. vitticeps. Compared with previous reference genome (GCF_900067755.1/Pvi1.1), this ∼1.8-Gb new assembly displayed >5,700-fold improvement in contiguity (contig N50: 202.5 Mb vs. 35.5 kb) and achieved complete chromosome anchoring (16 vs. 13,749 scaffolds). We found that over 80% of the P. vitticeps Z chromosome remains as a pseudo-autosomal region, where recombination is not suppressed. The sexually differentiated region (SDR) is small and occupied mostly by transposons, yet it aggregates genes involved in male development, such as AMH, AMHR2, and BMPR1A. Finally, by tracking the evolutionary origin and developmental expression of SDR genes, we proposed a model for the origin of P. vitticeps sex chromosomes that considered the Z-linked AMH as the master sex-determining gene.</p><p><strong>Conclusions: </strong>In this study, we fully characterized the Z sex chromosome of P. vitticeps, identified AMH as the candidate sex-determining gene, and proposed a new model for the origin of P. vitticeps sex chromosomes. The near-complete P. vitticeps reference genome will also benefit future study of reptile evolution.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12360845/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144872647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf109
Yichun Feng, Jiawei Wang, Ruikun He, Lu Zhou, Yixue Li
Background: Knowledge graphs and large language models (LLMs) are key tools for biomedical knowledge integration and reasoning, facilitating structured organization of scientific articles and discovery of complex semantic relationships. However, current methods face challenges: knowledge graph construction is limited by complex terminology, data heterogeneity, and rapid knowledge evolution, while LLMs show limitations in retrieval and reasoning, making it difficult to uncover cross-document associations and reasoning pathways.
Results: We propose a pipeline that uses LLMs to construct a Biomedical Stratified Knowledge Graph (BioStrataKG) from large-scale articles and builds the Biomedical Cross-Document Question Answering Dataset (BioCDQA) to evaluate latent knowledge retrieval and multihop reasoning. We then introduce Integrated and Progressive Retrieval-Augmented Reasoning (IP-RAR) to enhance retrieval accuracy and knowledge reasoning. IP-RAR maximizes information recall through integrated reasoning-based retrieval and refines knowledge via progressive reasoning-based generation, using self-reflection to achieve deep thinking and precise contextual understanding. Experiments show that IP-RAR improves document retrieval F1 score by 20% and answer generation accuracy by 25% over existing methods.
Conclusions: The IP-RAR helps doctors efficiently integrate treatment evidence to inform the development of personalized medication plans and enables researchers to analyze advancements and research gaps, accelerating the hypothesis generation phase of scientific discovery and decision-making.
{"title":"A retrieval-augmented knowledge mining method with deep thinking LLMs for biomedical research and clinical support.","authors":"Yichun Feng, Jiawei Wang, Ruikun He, Lu Zhou, Yixue Li","doi":"10.1093/gigascience/giaf109","DOIUrl":"10.1093/gigascience/giaf109","url":null,"abstract":"<p><strong>Background: </strong>Knowledge graphs and large language models (LLMs) are key tools for biomedical knowledge integration and reasoning, facilitating structured organization of scientific articles and discovery of complex semantic relationships. However, current methods face challenges: knowledge graph construction is limited by complex terminology, data heterogeneity, and rapid knowledge evolution, while LLMs show limitations in retrieval and reasoning, making it difficult to uncover cross-document associations and reasoning pathways.</p><p><strong>Results: </strong>We propose a pipeline that uses LLMs to construct a Biomedical Stratified Knowledge Graph (BioStrataKG) from large-scale articles and builds the Biomedical Cross-Document Question Answering Dataset (BioCDQA) to evaluate latent knowledge retrieval and multihop reasoning. We then introduce Integrated and Progressive Retrieval-Augmented Reasoning (IP-RAR) to enhance retrieval accuracy and knowledge reasoning. IP-RAR maximizes information recall through integrated reasoning-based retrieval and refines knowledge via progressive reasoning-based generation, using self-reflection to achieve deep thinking and precise contextual understanding. Experiments show that IP-RAR improves document retrieval F1 score by 20% and answer generation accuracy by 25% over existing methods.</p><p><strong>Conclusions: </strong>The IP-RAR helps doctors efficiently integrate treatment evidence to inform the development of personalized medication plans and enables researchers to analyze advancements and research gaps, accelerating the hypothesis generation phase of scientific discovery and decision-making.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448786/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145091620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: While cell-free DNA (cfDNA) is a promising biomarker for cancer diagnosis and monitoring, there is limited agreement on optimal cfDNA collection and extraction protocols as well as analysis pipelines of the corresponding cfDNA sequencing data. In this article, we address the latter by studying the effect of various bioinformatics preprocessing choices on derived genetic and epigenetic cfDNA features and study how observed feature differences influence the downstream task of separating between healthy and cancer cfDNA samples.
Results: Using low-pass whole-genome cfDNA sequencing data from 20 lung cancer and 20 healthy samples, we assessed the influence of various preprocessing settings, such as read trimming, filtering of secondary alignments, and choice of genome build, as well as practices such as downsampling or selecting for a short fragment on derived cfDNA features, including cfDNA fragment size, fragment end motifs, copy number alterations, and nucleosome footprints. Our results demonstrate that the analyzed features are robust to common preprocessing choices but exhibit variable sensitivity to sequencing coverage. Fragment length statistics and end motifs are the least affected by low coverages, whereas nucleosome footprint analysis is very sensitive to them. Our findings confirm that selecting for shorter fragments enhances cancer-specific signals but, by removing data, also reduces signals in general. Interestingly, we find that fragment end motif analysis benefits the most from in silico size selection. We also observe that the filtering of low-quality and secondary alignments and choice of genome build result in slight improvements in cancer classification performance based on nucleosome coverage and copy number features.
Conclusions: Altogether, we conclude that cfDNA analysis is minimally affected by different bioinformatics preprocessing settings, but we describe some synergistic effects between analytical approaches, which can be leveraged to improve cancer detection.
{"title":"The effects of bioinformatics preprocessing on cell-free DNA fragment analysis.","authors":"Ivna Ivanković, Zsolt Balázs, Todor Gitchev, Cécile Trottet, Norbert Moldovan, Idris Bahce, Florent Mouliere, Michael Krauthammer","doi":"10.1093/gigascience/giaf139","DOIUrl":"10.1093/gigascience/giaf139","url":null,"abstract":"<p><strong>Background: </strong>While cell-free DNA (cfDNA) is a promising biomarker for cancer diagnosis and monitoring, there is limited agreement on optimal cfDNA collection and extraction protocols as well as analysis pipelines of the corresponding cfDNA sequencing data. In this article, we address the latter by studying the effect of various bioinformatics preprocessing choices on derived genetic and epigenetic cfDNA features and study how observed feature differences influence the downstream task of separating between healthy and cancer cfDNA samples.</p><p><strong>Results: </strong>Using low-pass whole-genome cfDNA sequencing data from 20 lung cancer and 20 healthy samples, we assessed the influence of various preprocessing settings, such as read trimming, filtering of secondary alignments, and choice of genome build, as well as practices such as downsampling or selecting for a short fragment on derived cfDNA features, including cfDNA fragment size, fragment end motifs, copy number alterations, and nucleosome footprints. Our results demonstrate that the analyzed features are robust to common preprocessing choices but exhibit variable sensitivity to sequencing coverage. Fragment length statistics and end motifs are the least affected by low coverages, whereas nucleosome footprint analysis is very sensitive to them. Our findings confirm that selecting for shorter fragments enhances cancer-specific signals but, by removing data, also reduces signals in general. Interestingly, we find that fragment end motif analysis benefits the most from in silico size selection. We also observe that the filtering of low-quality and secondary alignments and choice of genome build result in slight improvements in cancer classification performance based on nucleosome coverage and copy number features.</p><p><strong>Conclusions: </strong>Altogether, we conclude that cfDNA analysis is minimally affected by different bioinformatics preprocessing settings, but we describe some synergistic effects between analytical approaches, which can be leveraged to improve cancer detection.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12720587/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145400435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf119
Caroline Howard, Amy Denton, Benjamin W Jackson, Adam Bates, Jessie Jay, Halyna Yatsenko, Priyanka Sethu Raman, Abitha Thomas, Graeme Oatley, Raquel Vionette do Amaral, Zeynep Ene Göktan, Juan Pablo Narváez Gómez, Isabelle Clayton Lucey, Elizabeth Sinclair, Michael A Quail, Mark Blaxter, Kerstin Howe, Mara K N Lawniczak
Since its inception in 2019, the Tree of Life programme at the Wellcome Sanger Institute has released high-quality, chromosomally resolved reference genome assemblies for over 2,000 species. Tree of Life has at its core multiple teams, each of which is responsible for key components of the "genome engine." One of these teams is the Tree of Life core laboratory, which is responsible for processing tissues across a wide range of species into high-quality, high molecular weight DNA and intact RNA, and preparing tissues for Hi-C. Here, we detail the different workflows we have developed to successfully process a wide variety of species, covering plants, fungi, chordates, protists, arthropods, meiofauna, and other metazoa. We summarise our success rates and describe how to best apply and combine the suite of current protocols, which are all publicly available at https://dx.doi.org/10.17504/protocols.io.8epv5xxy6g1b/v2.
{"title":"On the path to reference genomes for all biodiversity: laboratory protocols and lessons learned from processing over 2,000 species in the Sanger Tree of Life.","authors":"Caroline Howard, Amy Denton, Benjamin W Jackson, Adam Bates, Jessie Jay, Halyna Yatsenko, Priyanka Sethu Raman, Abitha Thomas, Graeme Oatley, Raquel Vionette do Amaral, Zeynep Ene Göktan, Juan Pablo Narváez Gómez, Isabelle Clayton Lucey, Elizabeth Sinclair, Michael A Quail, Mark Blaxter, Kerstin Howe, Mara K N Lawniczak","doi":"10.1093/gigascience/giaf119","DOIUrl":"10.1093/gigascience/giaf119","url":null,"abstract":"<p><p>Since its inception in 2019, the Tree of Life programme at the Wellcome Sanger Institute has released high-quality, chromosomally resolved reference genome assemblies for over 2,000 species. Tree of Life has at its core multiple teams, each of which is responsible for key components of the \"genome engine.\" One of these teams is the Tree of Life core laboratory, which is responsible for processing tissues across a wide range of species into high-quality, high molecular weight DNA and intact RNA, and preparing tissues for Hi-C. Here, we detail the different workflows we have developed to successfully process a wide variety of species, covering plants, fungi, chordates, protists, arthropods, meiofauna, and other metazoa. We summarise our success rates and describe how to best apply and combine the suite of current protocols, which are all publicly available at https://dx.doi.org/10.17504/protocols.io.8epv5xxy6g1b/v2.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12548527/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145354593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf137
Xiuyun Liu, Fangfang Li, Marek Czosnyka, Zofia Czosnyka, Huijie Yu, Xiaoguang Tong, Yan Xing, Hongliang Li, Ke Pu, Keke Feng, Kuo Zhang, Meijun Pang, Dong Ming
Background: The world has witnessed a steady rise in neurological diseases, which represent a heterogeneous group of disorders characterized by complex pathogenesis involving disruptions at multiple molecular levels, including genomic, transcriptomic, proteomic, and metabolomic levels. These disorders, often caused by genetic mutations, metabolic imbalances, immune dysregulation, and environmental factors, pose significant challenges to global public health due to their high prevalence, mortality, and disability burden.
Results: The advent of high-throughput technologies, such as next-generation sequencing and mass spectrometry, has provided valuable insights into the underlying mechanisms of disease, especially the development of multi- and high-spatial-resolution omics technologies, enabling the interaction of multiple levels of biology and analysis of the complex molecular networks and pathophysiological processes.
Conclusions: This review provides a comprehensive analysis of the latest advancements in multi- and high-spatial-resolution omics, with a focus on their applications in precision diagnostics, biomarker discovery, and therapeutic target identification in brain diseases. The study also highlights the current challenges in the clinical implementation and discusses the future directions, with artificial intelligence being anticipated to enhance clinical translation and diagnostic accuracy significantly.
{"title":"Multi-omics and high-spatial-resolution omics: deciphering complexity in neurological disorders.","authors":"Xiuyun Liu, Fangfang Li, Marek Czosnyka, Zofia Czosnyka, Huijie Yu, Xiaoguang Tong, Yan Xing, Hongliang Li, Ke Pu, Keke Feng, Kuo Zhang, Meijun Pang, Dong Ming","doi":"10.1093/gigascience/giaf137","DOIUrl":"10.1093/gigascience/giaf137","url":null,"abstract":"<p><strong>Background: </strong>The world has witnessed a steady rise in neurological diseases, which represent a heterogeneous group of disorders characterized by complex pathogenesis involving disruptions at multiple molecular levels, including genomic, transcriptomic, proteomic, and metabolomic levels. These disorders, often caused by genetic mutations, metabolic imbalances, immune dysregulation, and environmental factors, pose significant challenges to global public health due to their high prevalence, mortality, and disability burden.</p><p><strong>Results: </strong>The advent of high-throughput technologies, such as next-generation sequencing and mass spectrometry, has provided valuable insights into the underlying mechanisms of disease, especially the development of multi- and high-spatial-resolution omics technologies, enabling the interaction of multiple levels of biology and analysis of the complex molecular networks and pathophysiological processes.</p><p><strong>Conclusions: </strong>This review provides a comprehensive analysis of the latest advancements in multi- and high-spatial-resolution omics, with a focus on their applications in precision diagnostics, biomarker discovery, and therapeutic target identification in brain diseases. The study also highlights the current challenges in the clinical implementation and discusses the future directions, with artificial intelligence being anticipated to enhance clinical translation and diagnostic accuracy significantly.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12723665/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145687130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf116
Thomas Barba, Bryce A Bagley, Sandra Steyaert, Francisco Carrillo-Perez, Christoph Sadée, Michael Iv, Olivier Gevaert
Background: Magnetic resonance imaging (MRI) of the brain contains complex data that pose significant challenges for computational analysis. While models proposed for brain MRI analyses yield encouraging results, the high complexity of neuroimaging data hinders generalizability and clinical application. We introduce DUNE, a neuroimaging-oriented workflow that transforms raw brain MRI scans into standardized compact patient-level embeddings through integrated preprocessing and deep feature extraction, thereby enabling their processing by basic machine learning algorithms. A UNet-based autoencoder was trained using 3,814 selected scans of morphologically normal (healthy volunteers) or abnormal (glioma patients) brains, to generate comprehensive compact representations of the full-sized images. To evaluate their quality, these embeddings were utilized to train machine learning models to predict a wide range of clinical variables.
Results: Embeddings were extracted for cohorts used for the model development (21,102 individuals), along with 3 additional independent cohorts (Alzheimer's disease, schizophrenia, and glioma cohorts, 1,322 individuals), to evaluate the model's generalization capabilities. The embeddings extracted from healthy volunteers' scans could predict a broad spectrum of clinical parameters, including volumetry metrics, cardiovascular disease (area under the receiver operating characteristic curve [AUROC] = 0.80) and alcohol consumption (AUROC = 0.99), and more nuanced parameters such as the Alzheimer's predisposing APOE4 allele (AUROC = 0.67). Embeddings derived from the validation cohorts successfully predicted the diagnoses of Alzheimer's dementia (AUROC = 0.92) and schizophrenia (AUROC = 0.64). Embeddings extracted from glioma scans successfully predicted survival (C-index = 0.608) and IDH molecular status (AUROC = 0.92), matching the performances of previous task-oriented models.
Conclusion: DUNE efficiently represents clinically relevant patterns from full-size brain MRI scans across several disease areas, opening ways for innovative clinical applications in neurology.
{"title":"DUNE: a versatile neuroimaging encoder captures brain complexity across 3 major diseases: cancer, dementia, and schizophrenia.","authors":"Thomas Barba, Bryce A Bagley, Sandra Steyaert, Francisco Carrillo-Perez, Christoph Sadée, Michael Iv, Olivier Gevaert","doi":"10.1093/gigascience/giaf116","DOIUrl":"10.1093/gigascience/giaf116","url":null,"abstract":"<p><strong>Background: </strong>Magnetic resonance imaging (MRI) of the brain contains complex data that pose significant challenges for computational analysis. While models proposed for brain MRI analyses yield encouraging results, the high complexity of neuroimaging data hinders generalizability and clinical application. We introduce DUNE, a neuroimaging-oriented workflow that transforms raw brain MRI scans into standardized compact patient-level embeddings through integrated preprocessing and deep feature extraction, thereby enabling their processing by basic machine learning algorithms. A UNet-based autoencoder was trained using 3,814 selected scans of morphologically normal (healthy volunteers) or abnormal (glioma patients) brains, to generate comprehensive compact representations of the full-sized images. To evaluate their quality, these embeddings were utilized to train machine learning models to predict a wide range of clinical variables.</p><p><strong>Results: </strong>Embeddings were extracted for cohorts used for the model development (21,102 individuals), along with 3 additional independent cohorts (Alzheimer's disease, schizophrenia, and glioma cohorts, 1,322 individuals), to evaluate the model's generalization capabilities. The embeddings extracted from healthy volunteers' scans could predict a broad spectrum of clinical parameters, including volumetry metrics, cardiovascular disease (area under the receiver operating characteristic curve [AUROC] = 0.80) and alcohol consumption (AUROC = 0.99), and more nuanced parameters such as the Alzheimer's predisposing APOE4 allele (AUROC = 0.67). Embeddings derived from the validation cohorts successfully predicted the diagnoses of Alzheimer's dementia (AUROC = 0.92) and schizophrenia (AUROC = 0.64). Embeddings extracted from glioma scans successfully predicted survival (C-index = 0.608) and IDH molecular status (AUROC = 0.92), matching the performances of previous task-oriented models.</p><p><strong>Conclusion: </strong>DUNE efficiently represents clinically relevant patterns from full-size brain MRI scans across several disease areas, opening ways for innovative clinical applications in neurology.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527335/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145299561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: With the growing recognition of the important roles noncoding RNAs (ncRNAs) play in various biological functions, especially their potential involvement in many human diseases, predicting ncRNA-disease associations has become a key challenge in biomedical research.
Results: Although many computational methods have been proposed to predict ncRNA-disease associations, most of these methods focus on a single type of ncRNA. However, the competitive and cooperative interactions among different types of ncRNAs are closely related to their functional roles in disease associations. To address this limitation, we propose a novel computational framework, PanGIA (Pan-ncRNA Graph-Interaction Attention network), designed to simultaneously predict potential associations between multiple types of noncoding RNAs, including microRNAs (miRNAs), long noncoding RNAs (lncRNAs), circular RNAs (circRNAs), and PIWI-interacting RNAs (piRNAs), and diseases. Experimental results show that PanGIA outperforms type-specific SOTA methods in both individual and comprehensive predictions. It remains robust even when nodes or ncRNA types are removed, and ablation studies confirm the benefits of cross-type information. PanGIA also outperforms several single-type state-of-the-art methods across multiple metrics.
Conclusions: PanGIA demonstrates significant advantages in predicting disease associations for different types of ncRNAs, including miRNAs, lncRNAs, circRNAs, and piRNAs. Case studies further confirm the accuracy of the model's predictions, as all high-confidence associations were supported by literature evidence. This demonstrates the model's strong biological interpretability and promising potential for practical applications. The successful application of PanGIA provides a new paradigm for exploring disease-associated ncRNAs, highlighting their immense potential in the field of biomedical research.
{"title":"PanGIA: A universal framework for identifying association between ncRNAs and diseases.","authors":"Xiaoyuan Liu, Xiye Lü, Qiuhao Chen, Jiqiu Sun, Tianyi Zhao, Yan Zhu","doi":"10.1093/gigascience/giaf123","DOIUrl":"10.1093/gigascience/giaf123","url":null,"abstract":"<p><strong>Background: </strong>With the growing recognition of the important roles noncoding RNAs (ncRNAs) play in various biological functions, especially their potential involvement in many human diseases, predicting ncRNA-disease associations has become a key challenge in biomedical research.</p><p><strong>Results: </strong>Although many computational methods have been proposed to predict ncRNA-disease associations, most of these methods focus on a single type of ncRNA. However, the competitive and cooperative interactions among different types of ncRNAs are closely related to their functional roles in disease associations. To address this limitation, we propose a novel computational framework, PanGIA (Pan-ncRNA Graph-Interaction Attention network), designed to simultaneously predict potential associations between multiple types of noncoding RNAs, including microRNAs (miRNAs), long noncoding RNAs (lncRNAs), circular RNAs (circRNAs), and PIWI-interacting RNAs (piRNAs), and diseases. Experimental results show that PanGIA outperforms type-specific SOTA methods in both individual and comprehensive predictions. It remains robust even when nodes or ncRNA types are removed, and ablation studies confirm the benefits of cross-type information. PanGIA also outperforms several single-type state-of-the-art methods across multiple metrics.</p><p><strong>Conclusions: </strong>PanGIA demonstrates significant advantages in predicting disease associations for different types of ncRNAs, including miRNAs, lncRNAs, circRNAs, and piRNAs. Case studies further confirm the accuracy of the model's predictions, as all high-confidence associations were supported by literature evidence. This demonstrates the model's strong biological interpretability and promising potential for practical applications. The successful application of PanGIA provides a new paradigm for exploring disease-associated ncRNAs, highlighting their immense potential in the field of biomedical research.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12532321/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145307641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf144
Mahnaz Mohammadi, Christina Fell, David Morrison, Sarah Bell, Gareth Bryson, Sheeba Syed, Prakash Konanahalli, David Harris-Birtill, Ognjen Arandjelovic, Clare Orange, Prishma Shahi, In Hwa Um, James D Blackwood, David J Harrison
Background: The clinical pathway for the prevention and treatment of cervical cancer depends on cytology and then the assessment of biopsy specimens, fragments of tissue removed for histological examination. This can be a significant workload and is an obvious exemplar to explore triage based on machine learning analysis of slides. Limited access to large annotated datasets of human diseased tissue is a major obstacle to developing standards and algorithms that can assist diagnosis.
Results: We present a dataset comprising 2,539 whole-slide images of cervical biopsy specimens, each annotated by several pathologists and consensus on diagnosis and individual features agreed. Each whole-slide image represents 1 slide per patient in iSyntax format, with manual annotations by pathologists in Jason format. Each whole-slide image is assigned a category label, which is the final diagnosis of the image, and a subcategory label, which declares in which subcategory the image is found.
Conclusion: This dataset has been used to build a model that accurately predicts diagnosis, allowing the possibility of automatically triaging biopsy specimens, so that the most significant pathologies can be identified rapidly and those patients selected for immediate treatment. The level of annotation, at the subslide level, and the number of cases are unique in public databases and should allow investigators to explore multiple aspects of computer vision relevant to human tissue diagnosis, with no limitation placed on access to the whole-slide images.
{"title":"Cervical whole-slide images dataset for multiclass classification.","authors":"Mahnaz Mohammadi, Christina Fell, David Morrison, Sarah Bell, Gareth Bryson, Sheeba Syed, Prakash Konanahalli, David Harris-Birtill, Ognjen Arandjelovic, Clare Orange, Prishma Shahi, In Hwa Um, James D Blackwood, David J Harrison","doi":"10.1093/gigascience/giaf144","DOIUrl":"10.1093/gigascience/giaf144","url":null,"abstract":"<p><strong>Background: </strong>The clinical pathway for the prevention and treatment of cervical cancer depends on cytology and then the assessment of biopsy specimens, fragments of tissue removed for histological examination. This can be a significant workload and is an obvious exemplar to explore triage based on machine learning analysis of slides. Limited access to large annotated datasets of human diseased tissue is a major obstacle to developing standards and algorithms that can assist diagnosis.</p><p><strong>Results: </strong>We present a dataset comprising 2,539 whole-slide images of cervical biopsy specimens, each annotated by several pathologists and consensus on diagnosis and individual features agreed. Each whole-slide image represents 1 slide per patient in iSyntax format, with manual annotations by pathologists in Jason format. Each whole-slide image is assigned a category label, which is the final diagnosis of the image, and a subcategory label, which declares in which subcategory the image is found.</p><p><strong>Conclusion: </strong>This dataset has been used to build a model that accurately predicts diagnosis, allowing the possibility of automatically triaging biopsy specimens, so that the most significant pathologies can be identified rapidly and those patients selected for immediate treatment. The level of annotation, at the subslide level, and the number of cases are unique in public databases and should allow investigators to explore multiple aspects of computer vision relevant to human tissue diagnosis, with no limitation placed on access to the whole-slide images.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751090/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145632199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}