首页 > 最新文献

GigaScience最新文献

英文 中文
Health Data Nexus: an open data platform for AI research and education in medicine. Health Data Nexus:医学领域人工智能研究和教育的开放数据平台。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf050
January Adams, Rafal Cymerys, Karol Szuster, Daniel Hekman, Zoryana Salo, Rutvik Solanki, Muhammad Mamdani, Alistair Johnson, Katarzyna Ryniak, Tom Pollard, David Rotenberg, Benjamin Haibe-Kains

We outline the development of the Health Data Nexus, a data platform that enables data storage and access management with a cloud-based computational environment. We describe the importance of this secure platform in an evolving public-sector research landscape that utilizes significant quantities of data, particularly clinical data acquired from health systems, as well as the importance of providing meaningful benefits for three targeted user groups: data providers, researchers, and educators. We then describe the implementation of governance practices, technical standards, and data security, and the privacy protections needed to build this platform, as well as example use-cases highlighting the strengths of the platform in facilitating dataset acquisition, novel research, and hosting educational courses, workshops, and datathons. Finally, we discuss the key principles that informed the platform's development, highlighting the importance of flexible uses, collaborative development, and open-source science.

我们概述了健康数据Nexus的开发,这是一个数据平台,可以通过基于云的计算环境实现数据存储和访问管理。我们描述了这一安全平台在不断发展的公共部门研究领域的重要性,该领域利用了大量数据,特别是从卫生系统获得的临床数据,以及为三个目标用户群体:数据提供者、研究人员和教育工作者提供有意义的利益的重要性。然后,我们描述了治理实践、技术标准和数据安全的实现,以及构建该平台所需的隐私保护,以及突出该平台在促进数据集获取、新颖研究和托管教育课程、研讨会和数据马拉松方面的优势的示例用例。最后,我们讨论了平台开发的关键原则,强调了灵活使用、协作开发和开源科学的重要性。
{"title":"Health Data Nexus: an open data platform for AI research and education in medicine.","authors":"January Adams, Rafal Cymerys, Karol Szuster, Daniel Hekman, Zoryana Salo, Rutvik Solanki, Muhammad Mamdani, Alistair Johnson, Katarzyna Ryniak, Tom Pollard, David Rotenberg, Benjamin Haibe-Kains","doi":"10.1093/gigascience/giaf050","DOIUrl":"10.1093/gigascience/giaf050","url":null,"abstract":"<p><p>We outline the development of the Health Data Nexus, a data platform that enables data storage and access management with a cloud-based computational environment. We describe the importance of this secure platform in an evolving public-sector research landscape that utilizes significant quantities of data, particularly clinical data acquired from health systems, as well as the importance of providing meaningful benefits for three targeted user groups: data providers, researchers, and educators. We then describe the implementation of governance practices, technical standards, and data security, and the privacy protections needed to build this platform, as well as example use-cases highlighting the strengths of the platform in facilitating dataset acquisition, novel research, and hosting educational courses, workshops, and datathons. Finally, we discuss the key principles that informed the platform's development, highlighting the importance of flexible uses, collaborative development, and open-source science.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12131319/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144208238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A telomere-to-telomere gapless genome reveals SlPRR1 control of circadian rhythm and photoperiodic flowering in tomato. 端粒-端粒无间隙基因组揭示了SlPRR1控制番茄昼夜节律和光周期开花。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf058
Hui Liu, Jia-Qi Zhang, Jian-Ping Tao, Chen Chen, Li-Yao Su, Jin-Song Xiong, Ai-Sheng Xiong

Cultivated tomato (Solanum lycopersicum) is a major vegetable crop of high economic value that serves as an important model for studying flowering time in day-neutral plants. A complete, continuous, and gapless genome of cultivated tomato is essential for genetic research and breeding programs. Here, we report the construction of a telomere-to-telomere (T2T) gap-free genome of S. lycopersicum cv. VF36 using a combination of sequencing technologies. The 815.27-Mb T2T "VF36" genome contained 600.23 Mb of transposable elements. Through comparative genomics and phylogenetic analysis, we identified structural variations between the "VF36" and "Heinz 1706" genomes and found no evidence of a recent species-specific whole-genome duplication in the "VF36" tomato. Furthermore, a core circadian oscillator, SlPRR1, was identified, which peaked at night in a circadian rhythm. CRISPR/Cas9-mediated knockdown of SlPRR1 in tomatoes demonstrated that slprr1 mutant lines exhibited significantly earlier flowering under long-day condition than wild type. We present a hypothetical model of how SlPRR1 regulates flowering time and chlorophyll biosynthesis in response to photoperiod. This T2T genomic resource will accelerate the genetic improvement of large-fruited tomatoes, and the SlPRR1-related hypothetical model will enhance our understanding of the photoperiodic response in cultivated tomatoes, revealing a regulatory mechanism for manipulating flowering time.

栽培番茄(Solanum lycopersicum)是一种重要的蔬菜作物,具有很高的经济价值,是研究日中性植物开花时间的重要模型。一个完整的、连续的、无间隙的栽培番茄基因组对于遗传研究和育种计划是必不可少的。在这里,我们报道了番茄葡萄球菌(S. lycopersicum cv)端粒到端粒(T2T)无间隙基因组的构建。VF36采用组合测序技术。T2T“VF36”基因组全长815.27 Mb,包含600.23 Mb的转座因子。通过比较基因组学和系统发育分析,我们确定了“VF36”和“Heinz 1706”基因组之间的结构差异,并没有发现“VF36”番茄最近存在物种特异性全基因组重复的证据。此外,我们还发现了一个核心昼夜节律振荡器SlPRR1,它在夜间的昼夜节律中达到峰值。CRISPR/ cas9介导的SlPRR1基因敲低表明,SlPRR1突变系在长日照条件下的开花时间明显早于野生型。我们提出了SlPRR1如何根据光周期调节开花时间和叶绿素生物合成的假设模型。这一T2T基因组资源将加速大果番茄的遗传改良,而slprr1相关的假设模型将增强我们对栽培番茄光周期反应的理解,揭示开花时间的调控机制。
{"title":"A telomere-to-telomere gapless genome reveals SlPRR1 control of circadian rhythm and photoperiodic flowering in tomato.","authors":"Hui Liu, Jia-Qi Zhang, Jian-Ping Tao, Chen Chen, Li-Yao Su, Jin-Song Xiong, Ai-Sheng Xiong","doi":"10.1093/gigascience/giaf058","DOIUrl":"10.1093/gigascience/giaf058","url":null,"abstract":"<p><p>Cultivated tomato (Solanum lycopersicum) is a major vegetable crop of high economic value that serves as an important model for studying flowering time in day-neutral plants. A complete, continuous, and gapless genome of cultivated tomato is essential for genetic research and breeding programs. Here, we report the construction of a telomere-to-telomere (T2T) gap-free genome of S. lycopersicum cv. VF36 using a combination of sequencing technologies. The 815.27-Mb T2T \"VF36\" genome contained 600.23 Mb of transposable elements. Through comparative genomics and phylogenetic analysis, we identified structural variations between the \"VF36\" and \"Heinz 1706\" genomes and found no evidence of a recent species-specific whole-genome duplication in the \"VF36\" tomato. Furthermore, a core circadian oscillator, SlPRR1, was identified, which peaked at night in a circadian rhythm. CRISPR/Cas9-mediated knockdown of SlPRR1 in tomatoes demonstrated that slprr1 mutant lines exhibited significantly earlier flowering under long-day condition than wild type. We present a hypothetical model of how SlPRR1 regulates flowering time and chlorophyll biosynthesis in response to photoperiod. This T2T genomic resource will accelerate the genetic improvement of large-fruited tomatoes, and the SlPRR1-related hypothetical model will enhance our understanding of the photoperiodic response in cultivated tomatoes, revealing a regulatory mechanism for manipulating flowering time.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12218202/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144553222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A near-complete genome assembly of the bearded dragon Pogona vitticeps provides insights into the origin of Pogona sex chromosomes. 一个近乎完整的胡须龙vitticeps Pogona基因组组装提供了对Pogona性染色体起源的见解。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf079
Qunfei Guo, Youliang Pan, Wei Dai, Fei Guo, Tao Zeng, Wanyi Chen, Yaping Mi, Yanshu Zhang, Shuaizhen Shi, Wei Jiang, Huimin Cai, Beiying Wu, Yang Zhou, Ying Wang, Chentao Yang, Xiao Shi, Xu Yan, Junyi Chen, Chongyang Cai, Jingnan Yang, Xun Xu, Ying Gu, Yuliang Dong, Qiye Li

Background: Vertebrate sex is typically determined either by genetic factors, such as sex chromosomes, or by environmental cues like temperature. Therefore, the agamid dragon lizard Pogona vitticeps is remarkable in this regard, as it exhibits both ZZ/ZW genetic and temperature-dependent sex determination. However, complete sequence and full gene content of P. vitticeps sex chromosomes remain unclear, hindering the investigation of sex-determining cascade in this model lizard.

Results: Using CycloneSEQ and DNBSEQ sequencing technologies, we generated a near-complete chromosome-scale genome assembly for a ZZ male P. vitticeps. Compared with previous reference genome (GCF_900067755.1/Pvi1.1), this ∼1.8-Gb new assembly displayed >5,700-fold improvement in contiguity (contig N50: 202.5 Mb vs. 35.5 kb) and achieved complete chromosome anchoring (16 vs. 13,749 scaffolds). We found that over 80% of the P. vitticeps Z chromosome remains as a pseudo-autosomal region, where recombination is not suppressed. The sexually differentiated region (SDR) is small and occupied mostly by transposons, yet it aggregates genes involved in male development, such as AMH, AMHR2, and BMPR1A. Finally, by tracking the evolutionary origin and developmental expression of SDR genes, we proposed a model for the origin of P. vitticeps sex chromosomes that considered the Z-linked AMH as the master sex-determining gene.

Conclusions: In this study, we fully characterized the Z sex chromosome of P. vitticeps, identified AMH as the candidate sex-determining gene, and proposed a new model for the origin of P. vitticeps sex chromosomes. The near-complete P. vitticeps reference genome will also benefit future study of reptile evolution.

背景:脊椎动物的性别通常是由遗传因素(如性染色体)或环境因素(如温度)决定的。因此,在这方面,agamid龙蜥Pogona vitticeps是值得注意的,因为它同时表现出ZZ/ZW遗传和温度依赖的性别决定。然而,由于目前尚不清楚这种模式蜥蜴性染色体的完整序列和全基因含量,阻碍了这种模式蜥蜴性别决定级联的研究。结果:利用CycloneSEQ和DNBSEQ测序技术,我们获得了一个接近完整的ZZ雄性葡萄籽的染色体规模的基因组组装。与之前的参考基因组(GCF_900067755.1/Pvi1.1)相比,这个1.8 gb的新组装显示出bbb5700倍的连续性提高(contig N50: 202.5 Mb对35.5 kb),并实现了完整的染色体锚定(16对13,749个支架)。我们发现超过80%的葡萄P. vitticeps Z染色体仍然是一个假常染色体区域,在那里重组不受抑制。性别分化区(SDR)很小,主要由转座子占据,但它聚集了参与男性发育的基因,如AMH、AMHR2和BMPR1A。最后,通过追踪SDR基因的进化起源和发育表达,我们提出了一个将z连锁AMH作为主要性别决定基因的葡萄拟南藜性染色体起源模型。结论:本研究充分表征了葡萄葡萄的Z性染色体,确定AMH为候选性别决定基因,提出了葡萄葡萄性染色体起源的新模型。这一接近完整的参考基因组也将有利于未来爬行动物进化的研究。
{"title":"A near-complete genome assembly of the bearded dragon Pogona vitticeps provides insights into the origin of Pogona sex chromosomes.","authors":"Qunfei Guo, Youliang Pan, Wei Dai, Fei Guo, Tao Zeng, Wanyi Chen, Yaping Mi, Yanshu Zhang, Shuaizhen Shi, Wei Jiang, Huimin Cai, Beiying Wu, Yang Zhou, Ying Wang, Chentao Yang, Xiao Shi, Xu Yan, Junyi Chen, Chongyang Cai, Jingnan Yang, Xun Xu, Ying Gu, Yuliang Dong, Qiye Li","doi":"10.1093/gigascience/giaf079","DOIUrl":"10.1093/gigascience/giaf079","url":null,"abstract":"<p><strong>Background: </strong>Vertebrate sex is typically determined either by genetic factors, such as sex chromosomes, or by environmental cues like temperature. Therefore, the agamid dragon lizard Pogona vitticeps is remarkable in this regard, as it exhibits both ZZ/ZW genetic and temperature-dependent sex determination. However, complete sequence and full gene content of P. vitticeps sex chromosomes remain unclear, hindering the investigation of sex-determining cascade in this model lizard.</p><p><strong>Results: </strong>Using CycloneSEQ and DNBSEQ sequencing technologies, we generated a near-complete chromosome-scale genome assembly for a ZZ male P. vitticeps. Compared with previous reference genome (GCF_900067755.1/Pvi1.1), this ∼1.8-Gb new assembly displayed >5,700-fold improvement in contiguity (contig N50: 202.5 Mb vs. 35.5 kb) and achieved complete chromosome anchoring (16 vs. 13,749 scaffolds). We found that over 80% of the P. vitticeps Z chromosome remains as a pseudo-autosomal region, where recombination is not suppressed. The sexually differentiated region (SDR) is small and occupied mostly by transposons, yet it aggregates genes involved in male development, such as AMH, AMHR2, and BMPR1A. Finally, by tracking the evolutionary origin and developmental expression of SDR genes, we proposed a model for the origin of P. vitticeps sex chromosomes that considered the Z-linked AMH as the master sex-determining gene.</p><p><strong>Conclusions: </strong>In this study, we fully characterized the Z sex chromosome of P. vitticeps, identified AMH as the candidate sex-determining gene, and proposed a new model for the origin of P. vitticeps sex chromosomes. The near-complete P. vitticeps reference genome will also benefit future study of reptile evolution.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12360845/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144872647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A retrieval-augmented knowledge mining method with deep thinking LLMs for biomedical research and clinical support. 面向生物医学研究和临床支持的深度思维法学硕士检索增强知识挖掘方法。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf109
Yichun Feng, Jiawei Wang, Ruikun He, Lu Zhou, Yixue Li

Background: Knowledge graphs and large language models (LLMs) are key tools for biomedical knowledge integration and reasoning, facilitating structured organization of scientific articles and discovery of complex semantic relationships. However, current methods face challenges: knowledge graph construction is limited by complex terminology, data heterogeneity, and rapid knowledge evolution, while LLMs show limitations in retrieval and reasoning, making it difficult to uncover cross-document associations and reasoning pathways.

Results: We propose a pipeline that uses LLMs to construct a Biomedical Stratified Knowledge Graph (BioStrataKG) from large-scale articles and builds the Biomedical Cross-Document Question Answering Dataset (BioCDQA) to evaluate latent knowledge retrieval and multihop reasoning. We then introduce Integrated and Progressive Retrieval-Augmented Reasoning (IP-RAR) to enhance retrieval accuracy and knowledge reasoning. IP-RAR maximizes information recall through integrated reasoning-based retrieval and refines knowledge via progressive reasoning-based generation, using self-reflection to achieve deep thinking and precise contextual understanding. Experiments show that IP-RAR improves document retrieval F1 score by 20% and answer generation accuracy by 25% over existing methods.

Conclusions: The IP-RAR helps doctors efficiently integrate treatment evidence to inform the development of personalized medication plans and enables researchers to analyze advancements and research gaps, accelerating the hypothesis generation phase of scientific discovery and decision-making.

背景:知识图谱和大型语言模型(llm)是生物医学知识整合和推理的关键工具,有助于科学文章的结构化组织和复杂语义关系的发现。然而,目前的方法面临着挑战:知识图谱的构建受到复杂术语、数据异质性和知识快速演变的限制,而法学硕士在检索和推理方面存在局限性,难以发现跨文档关联和推理途径。结果:我们提出了一个管道,使用llm从大规模文章中构建生物医学分层知识图(BioStrataKG),并构建生物医学跨文档问答数据集(BioCDQA)来评估潜在知识检索和多跳推理。然后,我们引入了集成和渐进式检索增强推理(IP-RAR)来提高检索精度和知识推理。IP-RAR通过基于推理的综合检索实现信息回忆最大化,通过基于推理的递进生成实现知识提炼,通过自我反思实现深度思考和精确语境理解。实验表明,与现有方法相比,IP-RAR将文档检索F1分数提高了20%,答案生成准确率提高了25%。结论:IP-RAR帮助医生有效整合治疗证据,为个性化用药计划的制定提供信息,使研究人员能够分析进展和研究差距,加快科学发现和决策的假设生成阶段。
{"title":"A retrieval-augmented knowledge mining method with deep thinking LLMs for biomedical research and clinical support.","authors":"Yichun Feng, Jiawei Wang, Ruikun He, Lu Zhou, Yixue Li","doi":"10.1093/gigascience/giaf109","DOIUrl":"10.1093/gigascience/giaf109","url":null,"abstract":"<p><strong>Background: </strong>Knowledge graphs and large language models (LLMs) are key tools for biomedical knowledge integration and reasoning, facilitating structured organization of scientific articles and discovery of complex semantic relationships. However, current methods face challenges: knowledge graph construction is limited by complex terminology, data heterogeneity, and rapid knowledge evolution, while LLMs show limitations in retrieval and reasoning, making it difficult to uncover cross-document associations and reasoning pathways.</p><p><strong>Results: </strong>We propose a pipeline that uses LLMs to construct a Biomedical Stratified Knowledge Graph (BioStrataKG) from large-scale articles and builds the Biomedical Cross-Document Question Answering Dataset (BioCDQA) to evaluate latent knowledge retrieval and multihop reasoning. We then introduce Integrated and Progressive Retrieval-Augmented Reasoning (IP-RAR) to enhance retrieval accuracy and knowledge reasoning. IP-RAR maximizes information recall through integrated reasoning-based retrieval and refines knowledge via progressive reasoning-based generation, using self-reflection to achieve deep thinking and precise contextual understanding. Experiments show that IP-RAR improves document retrieval F1 score by 20% and answer generation accuracy by 25% over existing methods.</p><p><strong>Conclusions: </strong>The IP-RAR helps doctors efficiently integrate treatment evidence to inform the development of personalized medication plans and enables researchers to analyze advancements and research gaps, accelerating the hypothesis generation phase of scientific discovery and decision-making.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448786/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145091620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The effects of bioinformatics preprocessing on cell-free DNA fragment analysis. 生物信息学预处理对游离DNA片段分析的影响。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf139
Ivna Ivanković, Zsolt Balázs, Todor Gitchev, Cécile Trottet, Norbert Moldovan, Idris Bahce, Florent Mouliere, Michael Krauthammer

Background: While cell-free DNA (cfDNA) is a promising biomarker for cancer diagnosis and monitoring, there is limited agreement on optimal cfDNA collection and extraction protocols as well as analysis pipelines of the corresponding cfDNA sequencing data. In this article, we address the latter by studying the effect of various bioinformatics preprocessing choices on derived genetic and epigenetic cfDNA features and study how observed feature differences influence the downstream task of separating between healthy and cancer cfDNA samples.

Results: Using low-pass whole-genome cfDNA sequencing data from 20 lung cancer and 20 healthy samples, we assessed the influence of various preprocessing settings, such as read trimming, filtering of secondary alignments, and choice of genome build, as well as practices such as downsampling or selecting for a short fragment on derived cfDNA features, including cfDNA fragment size, fragment end motifs, copy number alterations, and nucleosome footprints. Our results demonstrate that the analyzed features are robust to common preprocessing choices but exhibit variable sensitivity to sequencing coverage. Fragment length statistics and end motifs are the least affected by low coverages, whereas nucleosome footprint analysis is very sensitive to them. Our findings confirm that selecting for shorter fragments enhances cancer-specific signals but, by removing data, also reduces signals in general. Interestingly, we find that fragment end motif analysis benefits the most from in silico size selection. We also observe that the filtering of low-quality and secondary alignments and choice of genome build result in slight improvements in cancer classification performance based on nucleosome coverage and copy number features.

Conclusions: Altogether, we conclude that cfDNA analysis is minimally affected by different bioinformatics preprocessing settings, but we describe some synergistic effects between analytical approaches, which can be leveraged to improve cancer detection.

背景:虽然游离DNA (cell-free DNA, cfDNA)是一种很有前景的癌症诊断和监测生物标志物,但目前对cfDNA的最佳采集和提取方案以及相应cfDNA测序数据的分析管道的共识有限。在本文中,我们通过研究各种生物信息学预处理选择对衍生遗传和表观遗传cfDNA特征的影响来解决后者,并研究观察到的特征差异如何影响健康和癌症cfDNA样本分离的下游任务。结果:利用来自20例肺癌和20例健康样本的低通全基因组cfDNA测序数据,我们评估了各种预处理设置(如读取修剪、二级比对过滤和基因组构建选择)以及降采样或选择短片段等做法对衍生cfDNA特征(包括cfDNA片段大小、片段末端基序、拷贝数改变和核小体足迹)的影响。我们的研究结果表明,分析的特征对常见的预处理选择具有鲁棒性,但对测序覆盖率表现出可变的敏感性。片段长度统计和末端基序受低覆盖率的影响最小,而核小体足迹分析对低覆盖率的影响非常敏感。我们的研究结果证实,选择较短的片段,增强了癌症特异性信号,然而,通过去除数据,也减少了一般的信号。有趣的是,我们发现片段末端基序分析从芯片尺寸选择中获益最多。我们还观察到,低质量和次级比对的过滤以及基因组构建的选择导致基于核小体覆盖率和拷贝数特征的癌症分类性能略有改善。总之,我们得出结论,不同的生物信息学预处理设置对cfDNA分析的影响最小,但我们描述了分析方法之间的一些协同效应,可以利用这些协同效应来提高癌症检测。
{"title":"The effects of bioinformatics preprocessing on cell-free DNA fragment analysis.","authors":"Ivna Ivanković, Zsolt Balázs, Todor Gitchev, Cécile Trottet, Norbert Moldovan, Idris Bahce, Florent Mouliere, Michael Krauthammer","doi":"10.1093/gigascience/giaf139","DOIUrl":"10.1093/gigascience/giaf139","url":null,"abstract":"<p><strong>Background: </strong>While cell-free DNA (cfDNA) is a promising biomarker for cancer diagnosis and monitoring, there is limited agreement on optimal cfDNA collection and extraction protocols as well as analysis pipelines of the corresponding cfDNA sequencing data. In this article, we address the latter by studying the effect of various bioinformatics preprocessing choices on derived genetic and epigenetic cfDNA features and study how observed feature differences influence the downstream task of separating between healthy and cancer cfDNA samples.</p><p><strong>Results: </strong>Using low-pass whole-genome cfDNA sequencing data from 20 lung cancer and 20 healthy samples, we assessed the influence of various preprocessing settings, such as read trimming, filtering of secondary alignments, and choice of genome build, as well as practices such as downsampling or selecting for a short fragment on derived cfDNA features, including cfDNA fragment size, fragment end motifs, copy number alterations, and nucleosome footprints. Our results demonstrate that the analyzed features are robust to common preprocessing choices but exhibit variable sensitivity to sequencing coverage. Fragment length statistics and end motifs are the least affected by low coverages, whereas nucleosome footprint analysis is very sensitive to them. Our findings confirm that selecting for shorter fragments enhances cancer-specific signals but, by removing data, also reduces signals in general. Interestingly, we find that fragment end motif analysis benefits the most from in silico size selection. We also observe that the filtering of low-quality and secondary alignments and choice of genome build result in slight improvements in cancer classification performance based on nucleosome coverage and copy number features.</p><p><strong>Conclusions: </strong>Altogether, we conclude that cfDNA analysis is minimally affected by different bioinformatics preprocessing settings, but we describe some synergistic effects between analytical approaches, which can be leveraged to improve cancer detection.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12720587/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145400435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On the path to reference genomes for all biodiversity: laboratory protocols and lessons learned from processing over 2,000 species in the Sanger Tree of Life. 在为所有生物多样性提供参考基因组的道路上:实验室协议和从处理桑格生命树中2000多个物种中吸取的教训。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf119
Caroline Howard, Amy Denton, Benjamin W Jackson, Adam Bates, Jessie Jay, Halyna Yatsenko, Priyanka Sethu Raman, Abitha Thomas, Graeme Oatley, Raquel Vionette do Amaral, Zeynep Ene Göktan, Juan Pablo Narváez Gómez, Isabelle Clayton Lucey, Elizabeth Sinclair, Michael A Quail, Mark Blaxter, Kerstin Howe, Mara K N Lawniczak

Since its inception in 2019, the Tree of Life programme at the Wellcome Sanger Institute has released high-quality, chromosomally resolved reference genome assemblies for over 2,000 species. Tree of Life has at its core multiple teams, each of which is responsible for key components of the "genome engine." One of these teams is the Tree of Life core laboratory, which is responsible for processing tissues across a wide range of species into high-quality, high molecular weight DNA and intact RNA, and preparing tissues for Hi-C. Here, we detail the different workflows we have developed to successfully process a wide variety of species, covering plants, fungi, chordates, protists, arthropods, meiofauna, and other metazoa. We summarise our success rates and describe how to best apply and combine the suite of current protocols, which are all publicly available at https://dx.doi.org/10.17504/protocols.io.8epv5xxy6g1b/v2.

自2019年成立以来,维康桑格研究所的生命之树项目已经发布了2000多个物种的高质量、染色体解析的参考基因组组装。生命之树的核心有多个团队,每个团队都负责“基因组引擎”的关键组成部分。其中一个团队是生命之树核心实验室,负责将各种物种的组织加工成高质量、高分子量的DNA和完整的RNA,并为Hi-C准备组织。在这里,我们详细介绍了我们开发的不同工作流程,以成功地处理各种各样的物种,包括植物,真菌,脊索动物,原生生物,节肢动物,小型动物和其他后生动物。我们总结了我们的成功率,并描述了如何最好地应用和组合当前协议套件,这些协议都可以在https://dx.doi.org/10.17504/protocols.io.8epv5xxy6g1b/v2上公开获得。
{"title":"On the path to reference genomes for all biodiversity: laboratory protocols and lessons learned from processing over 2,000 species in the Sanger Tree of Life.","authors":"Caroline Howard, Amy Denton, Benjamin W Jackson, Adam Bates, Jessie Jay, Halyna Yatsenko, Priyanka Sethu Raman, Abitha Thomas, Graeme Oatley, Raquel Vionette do Amaral, Zeynep Ene Göktan, Juan Pablo Narváez Gómez, Isabelle Clayton Lucey, Elizabeth Sinclair, Michael A Quail, Mark Blaxter, Kerstin Howe, Mara K N Lawniczak","doi":"10.1093/gigascience/giaf119","DOIUrl":"10.1093/gigascience/giaf119","url":null,"abstract":"<p><p>Since its inception in 2019, the Tree of Life programme at the Wellcome Sanger Institute has released high-quality, chromosomally resolved reference genome assemblies for over 2,000 species. Tree of Life has at its core multiple teams, each of which is responsible for key components of the \"genome engine.\" One of these teams is the Tree of Life core laboratory, which is responsible for processing tissues across a wide range of species into high-quality, high molecular weight DNA and intact RNA, and preparing tissues for Hi-C. Here, we detail the different workflows we have developed to successfully process a wide variety of species, covering plants, fungi, chordates, protists, arthropods, meiofauna, and other metazoa. We summarise our success rates and describe how to best apply and combine the suite of current protocols, which are all publicly available at https://dx.doi.org/10.17504/protocols.io.8epv5xxy6g1b/v2.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12548527/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145354593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-omics and high-spatial-resolution omics: deciphering complexity in neurological disorders. 多组学和高空间分辨率组学:解读神经系统疾病的复杂性。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf137
Xiuyun Liu, Fangfang Li, Marek Czosnyka, Zofia Czosnyka, Huijie Yu, Xiaoguang Tong, Yan Xing, Hongliang Li, Ke Pu, Keke Feng, Kuo Zhang, Meijun Pang, Dong Ming

Background: The world has witnessed a steady rise in neurological diseases, which represent a heterogeneous group of disorders characterized by complex pathogenesis involving disruptions at multiple molecular levels, including genomic, transcriptomic, proteomic, and metabolomic levels. These disorders, often caused by genetic mutations, metabolic imbalances, immune dysregulation, and environmental factors, pose significant challenges to global public health due to their high prevalence, mortality, and disability burden.

Results: The advent of high-throughput technologies, such as next-generation sequencing and mass spectrometry, has provided valuable insights into the underlying mechanisms of disease, especially the development of multi- and high-spatial-resolution omics technologies, enabling the interaction of multiple levels of biology and analysis of the complex molecular networks and pathophysiological processes.

Conclusions: This review provides a comprehensive analysis of the latest advancements in multi- and high-spatial-resolution omics, with a focus on their applications in precision diagnostics, biomarker discovery, and therapeutic target identification in brain diseases. The study also highlights the current challenges in the clinical implementation and discusses the future directions, with artificial intelligence being anticipated to enhance clinical translation and diagnostic accuracy significantly.

背景:世界范围内神经系统疾病的发病率稳步上升,神经系统疾病是一种异质性疾病,其发病机制复杂,涉及多个分子水平的破坏,包括基因组、转录组、蛋白质组和代谢组水平。这些疾病通常由基因突变、代谢失衡、免疫失调和环境因素引起,由于其高患病率、死亡率和残疾负担,对全球公共卫生构成重大挑战。结果:高通量技术的出现,如下一代测序和质谱,为疾病的潜在机制提供了有价值的见解,特别是多分辨率和高空间分辨率组学技术的发展,使多个生物学水平的相互作用和复杂分子网络和病理生理过程的分析成为可能。结论:本文综述了多分辨率和高空间分辨率组学的最新进展,重点介绍了它们在脑部疾病的精确诊断、生物标志物发现和治疗靶点识别方面的应用。该研究还强调了临床实施中的当前挑战,并讨论了未来的方向,预计人工智能将显著提高临床翻译和诊断准确性。
{"title":"Multi-omics and high-spatial-resolution omics: deciphering complexity in neurological disorders.","authors":"Xiuyun Liu, Fangfang Li, Marek Czosnyka, Zofia Czosnyka, Huijie Yu, Xiaoguang Tong, Yan Xing, Hongliang Li, Ke Pu, Keke Feng, Kuo Zhang, Meijun Pang, Dong Ming","doi":"10.1093/gigascience/giaf137","DOIUrl":"10.1093/gigascience/giaf137","url":null,"abstract":"<p><strong>Background: </strong>The world has witnessed a steady rise in neurological diseases, which represent a heterogeneous group of disorders characterized by complex pathogenesis involving disruptions at multiple molecular levels, including genomic, transcriptomic, proteomic, and metabolomic levels. These disorders, often caused by genetic mutations, metabolic imbalances, immune dysregulation, and environmental factors, pose significant challenges to global public health due to their high prevalence, mortality, and disability burden.</p><p><strong>Results: </strong>The advent of high-throughput technologies, such as next-generation sequencing and mass spectrometry, has provided valuable insights into the underlying mechanisms of disease, especially the development of multi- and high-spatial-resolution omics technologies, enabling the interaction of multiple levels of biology and analysis of the complex molecular networks and pathophysiological processes.</p><p><strong>Conclusions: </strong>This review provides a comprehensive analysis of the latest advancements in multi- and high-spatial-resolution omics, with a focus on their applications in precision diagnostics, biomarker discovery, and therapeutic target identification in brain diseases. The study also highlights the current challenges in the clinical implementation and discusses the future directions, with artificial intelligence being anticipated to enhance clinical translation and diagnostic accuracy significantly.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12723665/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145687130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DUNE: a versatile neuroimaging encoder captures brain complexity across 3 major diseases: cancer, dementia, and schizophrenia. DUNE:一种多功能神经成像编码器,可捕获3种主要疾病的大脑复杂性:癌症、痴呆和精神分裂症。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf116
Thomas Barba, Bryce A Bagley, Sandra Steyaert, Francisco Carrillo-Perez, Christoph Sadée, Michael Iv, Olivier Gevaert

Background: Magnetic resonance imaging (MRI) of the brain contains complex data that pose significant challenges for computational analysis. While models proposed for brain MRI analyses yield encouraging results, the high complexity of neuroimaging data hinders generalizability and clinical application. We introduce DUNE, a neuroimaging-oriented workflow that transforms raw brain MRI scans into standardized compact patient-level embeddings through integrated preprocessing and deep feature extraction, thereby enabling their processing by basic machine learning algorithms. A UNet-based autoencoder was trained using 3,814 selected scans of morphologically normal (healthy volunteers) or abnormal (glioma patients) brains, to generate comprehensive compact representations of the full-sized images. To evaluate their quality, these embeddings were utilized to train machine learning models to predict a wide range of clinical variables.

Results: Embeddings were extracted for cohorts used for the model development (21,102 individuals), along with 3 additional independent cohorts (Alzheimer's disease, schizophrenia, and glioma cohorts, 1,322 individuals), to evaluate the model's generalization capabilities. The embeddings extracted from healthy volunteers' scans could predict a broad spectrum of clinical parameters, including volumetry metrics, cardiovascular disease (area under the receiver operating characteristic curve [AUROC] = 0.80) and alcohol consumption (AUROC = 0.99), and more nuanced parameters such as the Alzheimer's predisposing APOE4 allele (AUROC = 0.67). Embeddings derived from the validation cohorts successfully predicted the diagnoses of Alzheimer's dementia (AUROC = 0.92) and schizophrenia (AUROC = 0.64). Embeddings extracted from glioma scans successfully predicted survival (C-index = 0.608) and IDH molecular status (AUROC = 0.92), matching the performances of previous task-oriented models.

Conclusion: DUNE efficiently represents clinically relevant patterns from full-size brain MRI scans across several disease areas, opening ways for innovative clinical applications in neurology.

背景:大脑的磁共振成像(MRI)包含复杂的数据,对计算分析构成重大挑战。虽然为脑MRI分析提出的模型产生了令人鼓舞的结果,但神经成像数据的高度复杂性阻碍了推广和临床应用。我们介绍了DUNE,这是一种面向神经成像的工作流程,通过集成预处理和深度特征提取,将原始的大脑MRI扫描转换为标准化的紧凑的患者级嵌入,从而使其能够通过基本的机器学习算法进行处理。使用3814张形态学正常(健康志愿者)或异常(神经胶质瘤患者)大脑的扫描图来训练基于unet的自动编码器,以生成全尺寸图像的全面紧凑表示。为了评估它们的质量,这些嵌入被用来训练机器学习模型来预测广泛的临床变量。结果:提取了用于模型开发的队列(21,102个体)的嵌入,以及3个额外的独立队列(阿尔茨海默病,精神分裂症和胶质瘤队列,1,322个体),以评估模型的泛化能力。从健康志愿者的扫描中提取的嵌入可以预测广泛的临床参数,包括体积指标、心血管疾病(受试者工作特征曲线下面积[AUROC] = 0.80)和酒精消耗(AUROC = 0.99),以及更细微的参数,如阿尔茨海默氏症易感性APOE4等位基因(AUROC = 0.67)。来自验证队列的嵌入成功地预测了阿尔茨海默氏痴呆(AUROC = 0.92)和精神分裂症(AUROC = 0.64)的诊断。从胶质瘤扫描中提取的嵌入成功地预测了生存(C-index = 0.608)和IDH分子状态(AUROC = 0.92),与之前的任务导向模型的性能相匹配。结论:DUNE有效地代表了多个疾病区域的全尺寸脑MRI扫描的临床相关模式,为神经病学的创新临床应用开辟了道路。
{"title":"DUNE: a versatile neuroimaging encoder captures brain complexity across 3 major diseases: cancer, dementia, and schizophrenia.","authors":"Thomas Barba, Bryce A Bagley, Sandra Steyaert, Francisco Carrillo-Perez, Christoph Sadée, Michael Iv, Olivier Gevaert","doi":"10.1093/gigascience/giaf116","DOIUrl":"10.1093/gigascience/giaf116","url":null,"abstract":"<p><strong>Background: </strong>Magnetic resonance imaging (MRI) of the brain contains complex data that pose significant challenges for computational analysis. While models proposed for brain MRI analyses yield encouraging results, the high complexity of neuroimaging data hinders generalizability and clinical application. We introduce DUNE, a neuroimaging-oriented workflow that transforms raw brain MRI scans into standardized compact patient-level embeddings through integrated preprocessing and deep feature extraction, thereby enabling their processing by basic machine learning algorithms. A UNet-based autoencoder was trained using 3,814 selected scans of morphologically normal (healthy volunteers) or abnormal (glioma patients) brains, to generate comprehensive compact representations of the full-sized images. To evaluate their quality, these embeddings were utilized to train machine learning models to predict a wide range of clinical variables.</p><p><strong>Results: </strong>Embeddings were extracted for cohorts used for the model development (21,102 individuals), along with 3 additional independent cohorts (Alzheimer's disease, schizophrenia, and glioma cohorts, 1,322 individuals), to evaluate the model's generalization capabilities. The embeddings extracted from healthy volunteers' scans could predict a broad spectrum of clinical parameters, including volumetry metrics, cardiovascular disease (area under the receiver operating characteristic curve [AUROC] = 0.80) and alcohol consumption (AUROC = 0.99), and more nuanced parameters such as the Alzheimer's predisposing APOE4 allele (AUROC = 0.67). Embeddings derived from the validation cohorts successfully predicted the diagnoses of Alzheimer's dementia (AUROC = 0.92) and schizophrenia (AUROC = 0.64). Embeddings extracted from glioma scans successfully predicted survival (C-index = 0.608) and IDH molecular status (AUROC = 0.92), matching the performances of previous task-oriented models.</p><p><strong>Conclusion: </strong>DUNE efficiently represents clinically relevant patterns from full-size brain MRI scans across several disease areas, opening ways for innovative clinical applications in neurology.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527335/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145299561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PanGIA: A universal framework for identifying association between ncRNAs and diseases. PanGIA:鉴定ncrna与疾病之间关联的通用框架。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf123
Xiaoyuan Liu, Xiye Lü, Qiuhao Chen, Jiqiu Sun, Tianyi Zhao, Yan Zhu

Background: With the growing recognition of the important roles noncoding RNAs (ncRNAs) play in various biological functions, especially their potential involvement in many human diseases, predicting ncRNA-disease associations has become a key challenge in biomedical research.

Results: Although many computational methods have been proposed to predict ncRNA-disease associations, most of these methods focus on a single type of ncRNA. However, the competitive and cooperative interactions among different types of ncRNAs are closely related to their functional roles in disease associations. To address this limitation, we propose a novel computational framework, PanGIA (Pan-ncRNA Graph-Interaction Attention network), designed to simultaneously predict potential associations between multiple types of noncoding RNAs, including microRNAs (miRNAs), long noncoding RNAs (lncRNAs), circular RNAs (circRNAs), and PIWI-interacting RNAs (piRNAs), and diseases. Experimental results show that PanGIA outperforms type-specific SOTA methods in both individual and comprehensive predictions. It remains robust even when nodes or ncRNA types are removed, and ablation studies confirm the benefits of cross-type information. PanGIA also outperforms several single-type state-of-the-art methods across multiple metrics.

Conclusions: PanGIA demonstrates significant advantages in predicting disease associations for different types of ncRNAs, including miRNAs, lncRNAs, circRNAs, and piRNAs. Case studies further confirm the accuracy of the model's predictions, as all high-confidence associations were supported by literature evidence. This demonstrates the model's strong biological interpretability and promising potential for practical applications. The successful application of PanGIA provides a new paradigm for exploring disease-associated ncRNAs, highlighting their immense potential in the field of biomedical research.

背景:随着人们越来越认识到非编码rna (ncRNAs)在各种生物学功能中的重要作用,特别是它们可能参与许多人类疾病,预测ncrna与疾病的关联已成为生物医学研究的关键挑战。结果:虽然已经提出了许多计算方法来预测ncRNA与疾病的关联,但这些方法大多集中在单一类型的ncRNA上。然而,不同类型的ncrna之间的竞争和合作相互作用与它们在疾病关联中的功能作用密切相关。为了解决这一限制,我们提出了一个新的计算框架PanGIA (Pan-ncRNA Graph-Interaction Attention network),旨在同时预测多种非编码rna(包括microRNAs (miRNAs)、长链非编码rna (lncRNAs)、环状rna (circRNAs)和piwi相互作用rna (piRNAs))与疾病之间的潜在关联。实验结果表明,PanGIA在个体和综合预测方面都优于类型特异性SOTA方法。即使当节点或ncRNA类型被移除时,它仍然是健壮的,并且消融研究证实了交叉类型信息的好处。PanGIA在多个指标上也优于几种单一类型的最先进方法。结论:PanGIA在预测不同类型的ncrna(包括mirna、lncrna、circrna和pirna)的疾病相关性方面具有显著优势。案例研究进一步证实了模型预测的准确性,因为所有高置信度的关联都得到了文献证据的支持。这表明该模型具有很强的生物学可解释性和实际应用的潜力。PanGIA的成功应用为探索疾病相关ncrna提供了一个新的范例,凸显了它们在生物医学研究领域的巨大潜力。
{"title":"PanGIA: A universal framework for identifying association between ncRNAs and diseases.","authors":"Xiaoyuan Liu, Xiye Lü, Qiuhao Chen, Jiqiu Sun, Tianyi Zhao, Yan Zhu","doi":"10.1093/gigascience/giaf123","DOIUrl":"10.1093/gigascience/giaf123","url":null,"abstract":"<p><strong>Background: </strong>With the growing recognition of the important roles noncoding RNAs (ncRNAs) play in various biological functions, especially their potential involvement in many human diseases, predicting ncRNA-disease associations has become a key challenge in biomedical research.</p><p><strong>Results: </strong>Although many computational methods have been proposed to predict ncRNA-disease associations, most of these methods focus on a single type of ncRNA. However, the competitive and cooperative interactions among different types of ncRNAs are closely related to their functional roles in disease associations. To address this limitation, we propose a novel computational framework, PanGIA (Pan-ncRNA Graph-Interaction Attention network), designed to simultaneously predict potential associations between multiple types of noncoding RNAs, including microRNAs (miRNAs), long noncoding RNAs (lncRNAs), circular RNAs (circRNAs), and PIWI-interacting RNAs (piRNAs), and diseases. Experimental results show that PanGIA outperforms type-specific SOTA methods in both individual and comprehensive predictions. It remains robust even when nodes or ncRNA types are removed, and ablation studies confirm the benefits of cross-type information. PanGIA also outperforms several single-type state-of-the-art methods across multiple metrics.</p><p><strong>Conclusions: </strong>PanGIA demonstrates significant advantages in predicting disease associations for different types of ncRNAs, including miRNAs, lncRNAs, circRNAs, and piRNAs. Case studies further confirm the accuracy of the model's predictions, as all high-confidence associations were supported by literature evidence. This demonstrates the model's strong biological interpretability and promising potential for practical applications. The successful application of PanGIA provides a new paradigm for exploring disease-associated ncRNAs, highlighting their immense potential in the field of biomedical research.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12532321/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145307641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cervical whole-slide images dataset for multiclass classification. 用于多类分类的宫颈全幻灯片图像数据集。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf144
Mahnaz Mohammadi, Christina Fell, David Morrison, Sarah Bell, Gareth Bryson, Sheeba Syed, Prakash Konanahalli, David Harris-Birtill, Ognjen Arandjelovic, Clare Orange, Prishma Shahi, In Hwa Um, James D Blackwood, David J Harrison

Background: The clinical pathway for the prevention and treatment of cervical cancer depends on cytology and then the assessment of biopsy specimens, fragments of tissue removed for histological examination. This can be a significant workload and is an obvious exemplar to explore triage based on machine learning analysis of slides. Limited access to large annotated datasets of human diseased tissue is a major obstacle to developing standards and algorithms that can assist diagnosis.

Results: We present a dataset comprising 2,539 whole-slide images of cervical biopsy specimens, each annotated by several pathologists and consensus on diagnosis and individual features agreed. Each whole-slide image represents 1 slide per patient in iSyntax format, with manual annotations by pathologists in Jason format. Each whole-slide image is assigned a category label, which is the final diagnosis of the image, and a subcategory label, which declares in which subcategory the image is found.

Conclusion: This dataset has been used to build a model that accurately predicts diagnosis, allowing the possibility of automatically triaging biopsy specimens, so that the most significant pathologies can be identified rapidly and those patients selected for immediate treatment. The level of annotation, at the subslide level, and the number of cases are unique in public databases and should allow investigators to explore multiple aspects of computer vision relevant to human tissue diagnosis, with no limitation placed on access to the whole-slide images.

预防和治疗宫颈癌的临床途径取决于细胞学,然后是活检的评估,切除组织片段进行组织学检查。这可能是一个很大的工作量,并且是探索基于机器学习分析幻灯片的分类的一个明显的例子。对人类患病组织的大型注释数据集的有限访问是开发有助于诊断的标准和算法的主要障碍。我们提出了一个数据集,包括2539张完整的宫颈活检切片图像,每张图像都由几位病理学家注释,并就诊断和个体特征达成共识。每张完整的幻灯片图像代表每位患者的一张幻灯片,采用iSyntax格式,病理学家使用Jason格式进行手动注释。每个完整的幻灯片图像都被分配了一个类别标签,这是图像的最终诊断,以及一个子类别标签,声明在哪个子类别中找到图像。这个数据集被用来建立一个准确预测诊断的模型,允许自动分诊活检,这样最重要的病理可以被快速识别,并选择那些患者立即治疗。在亚幻灯片级别的注释水平和病例数量在公共数据库中是独一无二的,并且应该允许研究者探索与人体组织诊断相关的计算机视觉的多个方面,而不限制对整个幻灯片图像的访问。
{"title":"Cervical whole-slide images dataset for multiclass classification.","authors":"Mahnaz Mohammadi, Christina Fell, David Morrison, Sarah Bell, Gareth Bryson, Sheeba Syed, Prakash Konanahalli, David Harris-Birtill, Ognjen Arandjelovic, Clare Orange, Prishma Shahi, In Hwa Um, James D Blackwood, David J Harrison","doi":"10.1093/gigascience/giaf144","DOIUrl":"10.1093/gigascience/giaf144","url":null,"abstract":"<p><strong>Background: </strong>The clinical pathway for the prevention and treatment of cervical cancer depends on cytology and then the assessment of biopsy specimens, fragments of tissue removed for histological examination. This can be a significant workload and is an obvious exemplar to explore triage based on machine learning analysis of slides. Limited access to large annotated datasets of human diseased tissue is a major obstacle to developing standards and algorithms that can assist diagnosis.</p><p><strong>Results: </strong>We present a dataset comprising 2,539 whole-slide images of cervical biopsy specimens, each annotated by several pathologists and consensus on diagnosis and individual features agreed. Each whole-slide image represents 1 slide per patient in iSyntax format, with manual annotations by pathologists in Jason format. Each whole-slide image is assigned a category label, which is the final diagnosis of the image, and a subcategory label, which declares in which subcategory the image is found.</p><p><strong>Conclusion: </strong>This dataset has been used to build a model that accurately predicts diagnosis, allowing the possibility of automatically triaging biopsy specimens, so that the most significant pathologies can be identified rapidly and those patients selected for immediate treatment. The level of annotation, at the subslide level, and the number of cases are unique in public databases and should allow investigators to explore multiple aspects of computer vision relevant to human tissue diagnosis, with no limitation placed on access to the whole-slide images.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751090/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145632199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
GigaScience
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1