首页 > 最新文献

GigaScience最新文献

英文 中文
Open RGB Imaging Workflow for Morphological and Morphometric Analysis of Fruits using Deep Learning: A Case Study on Almonds. 基于深度学习的开放式RGB成像流程用于水果形态和形态计量学分析:以杏仁为例
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-12-19 DOI: 10.1093/gigascience/giaf157
Jorge Mas-Gómez, Manuel Rubio, Federico Dicenta, Pedro José Martínez-García

Background: High-throughput phenotyping is addressing the current bottleneck in phenotyping within breeding programs. Imaging tools are becoming the primary resource for improving the efficiency of phenotyping processes and providing large datasets for genomic selection approaches. The advent of AI brings new advantages by enhancing phenotyping methods using imaging, making them more accessible to breeding programs. In this context, we have developed an open Python workflow for analyzing morphology, colour and morphometric traits using AI, which can be applied to fruits and other plant organs.

Results: The workflow was implemented in almond (Prunus dulcis (Mill.) D. A. Webb), a species where breeding efficiency is critical due to its long breeding cycle. Over 25,000 kernels, more than 20,000 nuts, and over 600 individuals were phenotyped, making this the largest morphological study conducted in almond so far. The best segmentation and reconstruction approaches achieved error rates below 1%. Weight and area variables enabled accurate estimation of kernel thickness, with a root mean squared error (RMSE) of 0.47. Fifty-five heritable morphological, morphometric and colour traits were identified, highlighting their potential as target traits in breeding programs.

Conclusion: The proposed workflow demonstrated robust performance across diverse datasets and being effective with limited training data for fine-tuning. Its compatibility with the output of AI-based labelling tools allows users to fully leverage the advantages of these technologies-reducing manual effort, accelerating dataset preparation, and streamlining the fine-tuning process of segmentation models. This flexibility enhances the scalability and practical applicability of the workflow in real-world phenotyping scenarios, especially in the context of breeding programs.

背景:高通量表型分析正在解决育种计划中表型分析的当前瓶颈。成像工具正在成为提高表型过程效率和为基因组选择方法提供大型数据集的主要资源。人工智能的出现带来了新的优势,它增强了使用成像的表型分析方法,使它们更容易用于育种计划。在此背景下,我们开发了一个开放的Python工作流,用于使用AI分析形态,颜色和形态特征,可应用于水果和其他植物器官。结果:该流程可在扁桃(Prunus dulcis, Mill.)中实现。D. a . Webb),由于其繁殖周期长,繁殖效率至关重要。超过25,000粒,20,000多个坚果,600多个个体进行了表型分析,这是迄今为止对杏仁进行的最大规模的形态学研究。最好的分割和重建方法使错误率低于1%。权重和面积变量能够准确估计核厚,均方根误差(RMSE)为0.47。鉴定了55个可遗传的形态、形态计量和颜色性状,突出了它们作为育种目标性状的潜力。结论:所提出的工作流在不同的数据集上表现出稳健的性能,并且在有限的训练数据上进行微调是有效的。它与基于人工智能的标签工具的输出的兼容性允许用户充分利用这些技术的优势-减少人工劳动,加速数据集准备,并简化分割模型的微调过程。这种灵活性增强了工作流程在现实世界表型场景中的可扩展性和实际适用性,特别是在育种计划的背景下。
{"title":"Open RGB Imaging Workflow for Morphological and Morphometric Analysis of Fruits using Deep Learning: A Case Study on Almonds.","authors":"Jorge Mas-Gómez, Manuel Rubio, Federico Dicenta, Pedro José Martínez-García","doi":"10.1093/gigascience/giaf157","DOIUrl":"https://doi.org/10.1093/gigascience/giaf157","url":null,"abstract":"<p><strong>Background: </strong>High-throughput phenotyping is addressing the current bottleneck in phenotyping within breeding programs. Imaging tools are becoming the primary resource for improving the efficiency of phenotyping processes and providing large datasets for genomic selection approaches. The advent of AI brings new advantages by enhancing phenotyping methods using imaging, making them more accessible to breeding programs. In this context, we have developed an open Python workflow for analyzing morphology, colour and morphometric traits using AI, which can be applied to fruits and other plant organs.</p><p><strong>Results: </strong>The workflow was implemented in almond (Prunus dulcis (Mill.) D. A. Webb), a species where breeding efficiency is critical due to its long breeding cycle. Over 25,000 kernels, more than 20,000 nuts, and over 600 individuals were phenotyped, making this the largest morphological study conducted in almond so far. The best segmentation and reconstruction approaches achieved error rates below 1%. Weight and area variables enabled accurate estimation of kernel thickness, with a root mean squared error (RMSE) of 0.47. Fifty-five heritable morphological, morphometric and colour traits were identified, highlighting their potential as target traits in breeding programs.</p><p><strong>Conclusion: </strong>The proposed workflow demonstrated robust performance across diverse datasets and being effective with limited training data for fine-tuning. Its compatibility with the output of AI-based labelling tools allows users to fully leverage the advantages of these technologies-reducing manual effort, accelerating dataset preparation, and streamlining the fine-tuning process of segmentation models. This flexibility enhances the scalability and practical applicability of the workflow in real-world phenotyping scenarios, especially in the context of breeding programs.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145793888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The genomes of five mantises provide insights into sex chromosome evolution and Mantodea phylogeny clarification. 五种螳螂的基因组提供了性染色体进化和螳螂科系统发育澄清的见解。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-12-18 DOI: 10.1093/gigascience/giaf158
Hangwei Liu, Lihong Lei, Fan Jiang, Bo Zhang, Hengchao Wang, Yutong Zhang, Hanbo Zhao, Guirong Wang, Wei Fan

Background: Praying mantises, members of the order Mantodea, play important roles in agriculture, medicine, bionics, and entertainment. However, the scarcity of genomic resources has hindered extensive studies on mantis evolution and behaviour.

Results: Here, we present the chromosome-scale reference genomes of five mantis species: the European mantis (Mantis religiosa), Chinese mantis (Tenodera sinensis), triangle dead leaf mantis (Deroplatys truncata), orchid mantis (Hymenopus coronatus), and metallic mantis (Metallyticus violacea). The assembled genome sizes range ∼2.3-4.2 Gb, with contig N50 size 1-109 Mb and 85-99% of sequence anchored to chromosomes. The annotated protein-coding gene number ranges 17,804-19,017, with BUSCO complete rate 96.7-98.4%. We found that transposable element expansion is the major force governing genome size in Mantodea, and suggest that translocations between the X chromosome and an autosome have occurred in the lineage of the family Mantidae. In addition, we found the lineage of M. violacea has accumulated fewer substitutions than the lineages of other mantises. Furthermore, our genome-wide analyses showed that D. truncata is sister to H. coronatus than M. religiosa and T. sinensis, helps resolve the phylogenic controversies of Deroplatys genus.

Conclusions: The high-quality genome assemblies of the five mantises provide a valuable resource for evolution studies of Mantodea and genetic improvement and breeding of beneficial biological control agents.

背景:螳螂是螳螂目的一员,在农业、医学、仿生学和娱乐中发挥着重要作用。然而,基因组资源的匮乏阻碍了对螳螂进化和行为的广泛研究。结果:本研究获得了欧洲螳螂(mantis religiosa)、中国螳螂(Tenodera sinensis)、三角死叶螳螂(Deroplatys truncata)、兰花螳螂(hymenus coronatus)和金属螳螂(Metallyticus violacea) 5种螳螂的染色体尺度参考基因组。组装的基因组大小范围为~ 2.3-4.2 Gb,其中N50序列大小为1-109 Mb, 85% -99%的序列锚定在染色体上。注释的蛋白编码基因数为17,804 ~ 19,017个,BUSCO完成率为96.7 ~ 98.4%。我们发现,转座因子扩展是控制螳螂基因组大小的主要力量,并表明X染色体和常染色体之间的易位发生在螳螂家族的谱系中。此外,我们还发现紫毛螳螂的谱系比其他种类的螳螂积累了更少的替换。此外,我们的全基因组分析表明,与宗教支原体和中华支原体相比,truncata是冠状支原体的姐妹,这有助于解决Deroplatys属的系统发育争议。结论:高质量的五种螳螂基因组组合为螳螂的进化研究和有益生物防治剂的遗传改良和选育提供了宝贵的资源。
{"title":"The genomes of five mantises provide insights into sex chromosome evolution and Mantodea phylogeny clarification.","authors":"Hangwei Liu, Lihong Lei, Fan Jiang, Bo Zhang, Hengchao Wang, Yutong Zhang, Hanbo Zhao, Guirong Wang, Wei Fan","doi":"10.1093/gigascience/giaf158","DOIUrl":"https://doi.org/10.1093/gigascience/giaf158","url":null,"abstract":"<p><strong>Background: </strong>Praying mantises, members of the order Mantodea, play important roles in agriculture, medicine, bionics, and entertainment. However, the scarcity of genomic resources has hindered extensive studies on mantis evolution and behaviour.</p><p><strong>Results: </strong>Here, we present the chromosome-scale reference genomes of five mantis species: the European mantis (Mantis religiosa), Chinese mantis (Tenodera sinensis), triangle dead leaf mantis (Deroplatys truncata), orchid mantis (Hymenopus coronatus), and metallic mantis (Metallyticus violacea). The assembled genome sizes range ∼2.3-4.2 Gb, with contig N50 size 1-109 Mb and 85-99% of sequence anchored to chromosomes. The annotated protein-coding gene number ranges 17,804-19,017, with BUSCO complete rate 96.7-98.4%. We found that transposable element expansion is the major force governing genome size in Mantodea, and suggest that translocations between the X chromosome and an autosome have occurred in the lineage of the family Mantidae. In addition, we found the lineage of M. violacea has accumulated fewer substitutions than the lineages of other mantises. Furthermore, our genome-wide analyses showed that D. truncata is sister to H. coronatus than M. religiosa and T. sinensis, helps resolve the phylogenic controversies of Deroplatys genus.</p><p><strong>Conclusions: </strong>The high-quality genome assemblies of the five mantises provide a valuable resource for evolution studies of Mantodea and genetic improvement and breeding of beneficial biological control agents.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145774156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LinkML: An Open Data Modeling Framework. LinkML:一个开放数据建模框架。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-12-12 DOI: 10.1093/gigascience/giaf152
Sierra A T Moxon, Harold Solbrig, Nomi L Harris, Patrick Kalita, Mark A Miller, Sujay Patil, Kevin Schaper, Chris Bizon, J Harry Caufield, Silvano Cirujano Cuesta, Corey Cox, Frank Dekervel, Damion M Dooley, William D Duncan, Tim Fliss, Sarah Gehrke, Adam S L Graefe, Harshad Hegde, A J Ireland, Julius O B Jacobsen, Madan Krishnamurthy, Carlo Kroll, David Linke, Ryan Ly, Nicolas Matentzoglu, James A Overton, Jonny L Saunders, Deepak R Unni, Gaurav Vaidya, Wouter-Michiel A M Vierdag, Oliver Ruebel, Christopher G Chute, Matthew H Brush, Melissa A Haendel, Christopher J Mungall

Background: Scientific research relies on well-structured, standardized data; however, much of it is stored in formats such as free-text lab notebooks, non-standardized spreadsheets, or data repositories. This lack of structure challenges interoperability, making data integration, validation, and reuse difficult.

Findings: LinkML (Linked Data Modeling Language) is an open framework that simplifies the process of authoring, validating, and sharing data. LinkML can describe a range of data structures, from flat, list-based models to complex, interrelated, and normalized models that utilize polymorphism and compound inheritance. It offers an approachable syntax that is not tied to any one technical architecture and can be integrated seamlessly with many existing frameworks. The LinkML syntax provides a standard way to describe schemas, classes, and relationships, allowing modelers to build well-defined, stable, and optionally ontology-aligned data structures. Once defined, LinkML schemas may be imported into other LinkML schemas. These key features make LinkML an accessible platform for interdisciplinary collaboration and a reliable way to define and share data semantics.

Conclusions: LinkML helps reduce heterogeneity, complexity, and the proliferation of single-use data models while simultaneously enabling compliance with FAIR data standards. LinkML has seen increasing adoption in various fields, including biology, chemistry, biomedicine, microbiome research, finance, electrical engineering, transportation, and commercial software development. In short, LinkML makes implicit models explicitly computable and allows data to be standardized at its origin. LinkML documentation and code are available at linkml.io.

背景:科学研究依赖于结构良好、标准化的数据;然而,大部分数据是以自由文本实验笔记本、非标准化电子表格或数据存储库等格式存储的。这种结构的缺乏挑战了互操作性,使数据集成、验证和重用变得困难。发现:LinkML(关联数据建模语言)是一个开放的框架,它简化了创作、验证和共享数据的过程。LinkML可以描述一系列数据结构,从扁平的、基于列表的模型到利用多态性和复合继承的复杂的、相互关联的和规范化的模型。它提供了一种易于使用的语法,不依赖于任何一种技术体系结构,可以与许多现有框架无缝集成。LinkML语法提供了一种描述模式、类和关系的标准方法,允许建模者构建定义良好、稳定且可选地与本体对齐的数据结构。一旦定义,就可以将LinkML模式导入到其他LinkML模式中。这些关键特性使LinkML成为跨学科协作的可访问平台,也是定义和共享数据语义的可靠方式。结论:LinkML有助于减少异构性、复杂性和一次性使用数据模型的激增,同时使其符合FAIR数据标准。LinkML在各个领域的应用越来越广泛,包括生物学、化学、生物医学、微生物组研究、金融、电子工程、交通运输和商业软件开发。简而言之,LinkML使隐式模型显式可计算,并允许数据在其起源处标准化。LinkML文档和代码可在LinkML .io上获得。
{"title":"LinkML: An Open Data Modeling Framework.","authors":"Sierra A T Moxon, Harold Solbrig, Nomi L Harris, Patrick Kalita, Mark A Miller, Sujay Patil, Kevin Schaper, Chris Bizon, J Harry Caufield, Silvano Cirujano Cuesta, Corey Cox, Frank Dekervel, Damion M Dooley, William D Duncan, Tim Fliss, Sarah Gehrke, Adam S L Graefe, Harshad Hegde, A J Ireland, Julius O B Jacobsen, Madan Krishnamurthy, Carlo Kroll, David Linke, Ryan Ly, Nicolas Matentzoglu, James A Overton, Jonny L Saunders, Deepak R Unni, Gaurav Vaidya, Wouter-Michiel A M Vierdag, Oliver Ruebel, Christopher G Chute, Matthew H Brush, Melissa A Haendel, Christopher J Mungall","doi":"10.1093/gigascience/giaf152","DOIUrl":"https://doi.org/10.1093/gigascience/giaf152","url":null,"abstract":"<p><strong>Background: </strong>Scientific research relies on well-structured, standardized data; however, much of it is stored in formats such as free-text lab notebooks, non-standardized spreadsheets, or data repositories. This lack of structure challenges interoperability, making data integration, validation, and reuse difficult.</p><p><strong>Findings: </strong>LinkML (Linked Data Modeling Language) is an open framework that simplifies the process of authoring, validating, and sharing data. LinkML can describe a range of data structures, from flat, list-based models to complex, interrelated, and normalized models that utilize polymorphism and compound inheritance. It offers an approachable syntax that is not tied to any one technical architecture and can be integrated seamlessly with many existing frameworks. The LinkML syntax provides a standard way to describe schemas, classes, and relationships, allowing modelers to build well-defined, stable, and optionally ontology-aligned data structures. Once defined, LinkML schemas may be imported into other LinkML schemas. These key features make LinkML an accessible platform for interdisciplinary collaboration and a reliable way to define and share data semantics.</p><p><strong>Conclusions: </strong>LinkML helps reduce heterogeneity, complexity, and the proliferation of single-use data models while simultaneously enabling compliance with FAIR data standards. LinkML has seen increasing adoption in various fields, including biology, chemistry, biomedicine, microbiome research, finance, electrical engineering, transportation, and commercial software development. In short, LinkML makes implicit models explicitly computable and allows data to be standardized at its origin. LinkML documentation and code are available at linkml.io.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145742108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Challenges in structural variant calling in low-complexity regions. 低复杂度区域结构变量调用的挑战。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-12-12 DOI: 10.1093/gigascience/giaf154
Qian Qin, Heng Li

Background: Structural variants (SVs) are genomic differences ≥50 bp in length. They remain challenging to detect even with long sequence reads, and the sources of these difficulties are not well quantified.

Results: We identified 35.4 Mb of low-complexity regions (LCRs) in GRCh38. Although these regions cover only 1.2% of the genome, they contain 69.1% of confident SVs in sample HG002. Across long-read SV callers, 77.3-91.3% of erroneous SV calls occur within LCRs, with error rates increasing with LCR length.

Conclusion: SVs are enriched and difficult to call in LCRs. Special care needs to be taken for calling and analyzing these variants.

背景:结构变异(SVs)是指长度≥50 bp的基因组差异。即使使用长序列读取,检测它们仍然具有挑战性,并且这些困难的来源没有很好地量化。结果:我们在GRCh38中鉴定出35.4 Mb的低复杂度区域(lcr)。尽管这些区域只覆盖了基因组的1.2%,但在HG002样本中,它们包含了69.1%的可靠SVs。在长读SV调用者中,77.3-91.3%的错误SV调用发生在LCR内,错误率随着LCR长度的增加而增加。结论:lcr中SVs丰富且难以调用。在调用和分析这些变量时需要特别注意。
{"title":"Challenges in structural variant calling in low-complexity regions.","authors":"Qian Qin, Heng Li","doi":"10.1093/gigascience/giaf154","DOIUrl":"https://doi.org/10.1093/gigascience/giaf154","url":null,"abstract":"<p><strong>Background: </strong>Structural variants (SVs) are genomic differences ≥50 bp in length. They remain challenging to detect even with long sequence reads, and the sources of these difficulties are not well quantified.</p><p><strong>Results: </strong>We identified 35.4 Mb of low-complexity regions (LCRs) in GRCh38. Although these regions cover only 1.2% of the genome, they contain 69.1% of confident SVs in sample HG002. Across long-read SV callers, 77.3-91.3% of erroneous SV calls occur within LCRs, with error rates increasing with LCR length.</p><p><strong>Conclusion: </strong>SVs are enriched and difficult to call in LCRs. Special care needs to be taken for calling and analyzing these variants.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145742150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improved reference assembly and core collection re-sequencing to facilitate exploration of important agronomical traits for the improvement of oilseed crop, Carthamus tinctorius L. 改进参比组合和核心集合重测序,为油料作物红花改良的重要农艺性状探索提供便利。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-12-11 DOI: 10.1093/gigascience/giaf151
Megha Sharma, Varun Bhardwaj, Praveen Kumar Oraon, Shivani Choudhary, Heena Ambreen, Rohit Nandan Shukla, Harsha Rayudu Jamedar, Ajitha Vijjeswarapu, Vandana Jaiswal, Palchamy Kadirvel, Arun Jagannath, Shailendra Goel

Background: Safflower (Carthamus tinctorius L.) is a drought-resilient oilseed crop. Besides producing edible oil rich in oleic and linoleic acid, it is also used in biofuels, cosmetics, colouring dyes, pharmaceuticals and nutraceuticals. Despite its significant economic uses, availability of genetic and genomic resources in safflower are limited.

Results: We report an improved de novo genome assembly of safflower (Safflower_A2). A chromosome-level assembly of 1.15 Gb with telomeres and centromeric repeats, was constructed using PacBio HiFi reads, optical maps, Illumina short reads, and Hi-C sequencing. Safflower_A2 shows better contiguity, completeness, and high-quality annotation than previous assemblies. The assembly was further validated with the help of a single nucleotide polymorphism (SNP)-based linkage map. A genome-wide survey identified genes for comprehensive exploration of disease resistance in the safflower. Employing the de novo genome assembly as a reference, we used resequencing data of a global core-collection of 123 accessions to carry out a SNP-based genome-wide association study, which identified significant associations for several traits, their haplotypes of agronomic value, including seed oil content. Resequencing data was also applied for a pan-genome analysis which provided critical insights into genome diversity identifying an additional ∼11000 genes and their functional enrichment that will be useful for region-specific breeding lines.

Conclusion: Our study provides insights into the genomic architecture of safflower by leveraging an improved genome assembly and annotation. Additionally, resources including high-density linkage map, marker-trait associations, and pan-genome developed in this study provide valuable resources for use in breeding and crop improvement programs by the global research community.

背景:红花(Carthamus tinctorius L.)是一种抗旱油料作物。除了生产富含油酸和亚油酸的食用油外,它还用于生物燃料、化妆品、染料、药品和营养保健品。尽管红花具有重要的经济用途,但其遗传和基因组资源的可用性有限。结果:我们报道了一个改进的红花(Safflower_A2)从头基因组组装。利用PacBio HiFi reads、光学图谱、Illumina short reads和Hi-C测序,构建了1.15 Gb染色体水平的端粒和着丝粒重复序列。与以前的程序集相比,Safflower_A2具有更好的连续性、完整性和高质量的注释。通过基于单核苷酸多态性(SNP)的连锁图谱进一步验证了该序列。一项全基因组调查确定了红花抗病基因的全面探索。以从头基因组组装为参考,我们利用123份全球核心收集的重测序数据进行了基于snp的全基因组关联研究,发现了几种性状及其农艺价值单倍型(包括种子含油量)的显著相关性。重测序数据还用于泛基因组分析,该分析为基因组多样性提供了关键见解,确定了额外的约11000个基因及其功能富集,这将对区域特异性育种系有用。结论:我们的研究利用改进的基因组组装和注释为红花的基因组结构提供了见解。此外,本研究开发的高密度连锁图谱、标记-性状关联、泛基因组等资源为全球研究界的育种和作物改良计划提供了宝贵的资源。
{"title":"Improved reference assembly and core collection re-sequencing to facilitate exploration of important agronomical traits for the improvement of oilseed crop, Carthamus tinctorius L.","authors":"Megha Sharma, Varun Bhardwaj, Praveen Kumar Oraon, Shivani Choudhary, Heena Ambreen, Rohit Nandan Shukla, Harsha Rayudu Jamedar, Ajitha Vijjeswarapu, Vandana Jaiswal, Palchamy Kadirvel, Arun Jagannath, Shailendra Goel","doi":"10.1093/gigascience/giaf151","DOIUrl":"https://doi.org/10.1093/gigascience/giaf151","url":null,"abstract":"<p><strong>Background: </strong>Safflower (Carthamus tinctorius L.) is a drought-resilient oilseed crop. Besides producing edible oil rich in oleic and linoleic acid, it is also used in biofuels, cosmetics, colouring dyes, pharmaceuticals and nutraceuticals. Despite its significant economic uses, availability of genetic and genomic resources in safflower are limited.</p><p><strong>Results: </strong>We report an improved de novo genome assembly of safflower (Safflower_A2). A chromosome-level assembly of 1.15 Gb with telomeres and centromeric repeats, was constructed using PacBio HiFi reads, optical maps, Illumina short reads, and Hi-C sequencing. Safflower_A2 shows better contiguity, completeness, and high-quality annotation than previous assemblies. The assembly was further validated with the help of a single nucleotide polymorphism (SNP)-based linkage map. A genome-wide survey identified genes for comprehensive exploration of disease resistance in the safflower. Employing the de novo genome assembly as a reference, we used resequencing data of a global core-collection of 123 accessions to carry out a SNP-based genome-wide association study, which identified significant associations for several traits, their haplotypes of agronomic value, including seed oil content. Resequencing data was also applied for a pan-genome analysis which provided critical insights into genome diversity identifying an additional ∼11000 genes and their functional enrichment that will be useful for region-specific breeding lines.</p><p><strong>Conclusion: </strong>Our study provides insights into the genomic architecture of safflower by leveraging an improved genome assembly and annotation. Additionally, resources including high-density linkage map, marker-trait associations, and pan-genome developed in this study provide valuable resources for use in breeding and crop improvement programs by the global research community.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145722306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Integrative Multi-Omics Random Forest Framework for Robust Biomarker Discovery. 一个整合的多组学随机森林框架稳健的生物标志物发现。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-12-09 DOI: 10.1093/gigascience/giaf148
Wei Zhang, Hanchen Huang, Lily Wang, Brian D Lehmann, X Steven Chen

High-throughput technologies now produce a wide array of omics data, from genomic and transcriptomic profiles to epigenomic and proteomic measurements. Integrating multiple omics layers measured on the same samples can reveal cross-layer molecular hubs that single-layer analyses miss. We present an unsupervised, multivariate random forest (MRF) framework with an inverse minimal depth (IMD) importance to prioritize shared biomarkers across omics. In each forest, one layer serves as a multivariate response and the other as predictors; IMD summarizes how early a predictor (or response MSRV) appears across trees, yielding interpretable, cross-layer feature rankings. We provide three IMD-based selection strategies and introduce an optional IMD power transform to enhance sensitivity to interaction signals. In extensive simulations spanning linear, nonlinear, and interaction regimes, our method matches SPLS/CCA under linear settings and outperforms them as nonlinearity increases, while adapted univariate ensemble learners (RF, GBM, XGBoost) underperform in the multivariate, unsupervised context. Applied to TCGA BRCA and COAD, MRF-IMD identifies genes, CpGs, and miRNAs enriched for cancer-relevant pathways and yields more robust survival stratification than linear integrators with matched model sizes. In a TCGA pan-cancer analysis, MRF-IMD features achieve higher ARI than alternatives and recover coherent tumor-type clusters; in ADNI, the integrative signature improves dementia-progression stratification over a published methylation risk score. Our scalable, interpretable MRF-IMD framework advances reliable multi-omics biomarker discovery when nonlinear, cross-layer dependencies matter.

高通量技术现在产生广泛的组学数据,从基因组和转录组谱到表观基因组和蛋白质组测量。整合在相同样品上测量的多个组学层可以揭示单层分析遗漏的跨层分子中心。我们提出了一个无监督的多变量随机森林(MRF)框架,具有逆最小深度(IMD)重要性,可以优先考虑组学中共享的生物标志物。在每个森林中,一层作为多变量响应,另一层作为预测因子;IMD总结了预测器(或响应MSRV)在树中出现的时间,从而产生可解释的跨层特征排名。我们提供了三种基于IMD的选择策略,并引入了一个可选的IMD功率变换来提高对交互信号的灵敏度。在跨越线性、非线性和交互机制的广泛模拟中,我们的方法在线性设置下匹配SPLS/CCA,并在非线性增加时优于它们,而自适应单变量集成学习器(RF、GBM、XGBoost)在多变量、无监督环境下表现不佳。应用于TCGA、BRCA和COAD, MRF-IMD可以识别癌症相关途径富集的基因、CpGs和mirna,并且比具有匹配模型大小的线性整合器产生更强大的生存分层。在TCGA泛癌症分析中,MRF-IMD特征比其他选择获得更高的ARI,并恢复连贯的肿瘤类型集群;在ADNI中,综合特征优于已公布的甲基化风险评分,可改善痴呆进展分层。我们的可扩展、可解释的MRF-IMD框架在非线性、跨层依赖关系重要的情况下,推进了可靠的多组学生物标志物发现。
{"title":"An Integrative Multi-Omics Random Forest Framework for Robust Biomarker Discovery.","authors":"Wei Zhang, Hanchen Huang, Lily Wang, Brian D Lehmann, X Steven Chen","doi":"10.1093/gigascience/giaf148","DOIUrl":"10.1093/gigascience/giaf148","url":null,"abstract":"<p><p>High-throughput technologies now produce a wide array of omics data, from genomic and transcriptomic profiles to epigenomic and proteomic measurements. Integrating multiple omics layers measured on the same samples can reveal cross-layer molecular hubs that single-layer analyses miss. We present an unsupervised, multivariate random forest (MRF) framework with an inverse minimal depth (IMD) importance to prioritize shared biomarkers across omics. In each forest, one layer serves as a multivariate response and the other as predictors; IMD summarizes how early a predictor (or response MSRV) appears across trees, yielding interpretable, cross-layer feature rankings. We provide three IMD-based selection strategies and introduce an optional IMD power transform to enhance sensitivity to interaction signals. In extensive simulations spanning linear, nonlinear, and interaction regimes, our method matches SPLS/CCA under linear settings and outperforms them as nonlinearity increases, while adapted univariate ensemble learners (RF, GBM, XGBoost) underperform in the multivariate, unsupervised context. Applied to TCGA BRCA and COAD, MRF-IMD identifies genes, CpGs, and miRNAs enriched for cancer-relevant pathways and yields more robust survival stratification than linear integrators with matched model sizes. In a TCGA pan-cancer analysis, MRF-IMD features achieve higher ARI than alternatives and recover coherent tumor-type clusters; in ADNI, the integrative signature improves dementia-progression stratification over a published methylation risk score. Our scalable, interpretable MRF-IMD framework advances reliable multi-omics biomarker discovery when nonlinear, cross-layer dependencies matter.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145707663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A sulfatide-centered ultra-high resolution magnetic resonance MALDI imaging benchmark dataset for MS1-based lipid annotation tools. 基于ms1的脂质注释工具的以硫脂脂为中心的超高分辨率磁共振MALDI成像基准数据集。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-12-09 DOI: 10.1093/gigascience/giaf150
Lars Gruber, Stefan Schmidt, Thomas Enzlein, Carsten Hopf

Spatial 'omics techniques are indispensable for studying complex biological systems and for the discovery of spatial biomarkers. While several current matrix-assisted laser desorption/ionization (MALDI) mass spectrometry imaging (MSI) instruments are capable of localizing numerous metabolites at high spatial and spectral resolution, the majority of MSI data is acquired at the MS1 level only. Assigning molecular identities based on MS1 data presents significant analytical and computational challenges, as the inherent limitations of MS1 data preclude confident annotations beyond the sum formula level. To enable future advancements of computational lipid annotation tools, well-characterized benchmark - or ground truth - datasets are crucial, which exceed the scope of synthetic data or data derived from mimetic tissue models. To this end, we provide two sulfatide-centered, biology-driven magnetic resonance MSI (MR-MSI) datasets at different mass resolving powers that characterize lipids in a mouse model of human metachromatic dystrophy. This data includes an ultra-high-resolution (R ∼1,230,000) quantum cascade laser mid-infrared imaging-guided MR-MSI dataset that enables isotopic fine structure analysis and therefore enhances the level of confidence substantially. To highlight the usefulness of the data, we compared 118 manual sulfatide annotations with the number of decoy database-controlled sulfatide annotations performed in Metaspace (67 at FDR < 10%). Overall, our datasets can be used to benchmark annotation algorithms, validate spatial biomarker discovery pipelines, and serve as a reference for future studies that explore sulfatide metabolism and its spatial regulation.

空间组学技术对于复杂生物系统的研究和空间生物标志物的发现是不可或缺的。虽然目前一些基质辅助激光解吸/电离(MALDI)质谱成像(MSI)仪器能够在高空间和光谱分辨率下定位大量代谢物,但大多数MSI数据仅在MS1水平上获得。基于MS1数据分配分子身份提出了重大的分析和计算挑战,因为MS1数据的固有局限性排除了求和公式级别以外的自信注释。为了实现计算脂质注释工具的未来发展,具有良好特征的基准数据集(或基础事实数据集)至关重要,这超出了合成数据或源自模拟组织模型的数据的范围。为此,我们提供了两个以硫脂为中心的、生物驱动的磁共振MSI (MR-MSI)数据集,以不同的质量分辨率来表征人类异色性营养不良小鼠模型中的脂质。该数据包括一个超高分辨率(R ~ 123万)量子级联激光中红外成像引导的MR-MSI数据集,该数据集可以进行同位素精细结构分析,从而大大提高了置信度。为了突出数据的有用性,我们比较了118个人工硫胺注释与在Metaspace中执行的诱饵数据库控制的硫胺注释的数量(FDR < 10%时67个)。总体而言,我们的数据集可用于基准标注算法,验证空间生物标志物发现管道,并为未来探索硫脂代谢及其空间调节的研究提供参考。
{"title":"A sulfatide-centered ultra-high resolution magnetic resonance MALDI imaging benchmark dataset for MS1-based lipid annotation tools.","authors":"Lars Gruber, Stefan Schmidt, Thomas Enzlein, Carsten Hopf","doi":"10.1093/gigascience/giaf150","DOIUrl":"https://doi.org/10.1093/gigascience/giaf150","url":null,"abstract":"<p><p>Spatial 'omics techniques are indispensable for studying complex biological systems and for the discovery of spatial biomarkers. While several current matrix-assisted laser desorption/ionization (MALDI) mass spectrometry imaging (MSI) instruments are capable of localizing numerous metabolites at high spatial and spectral resolution, the majority of MSI data is acquired at the MS1 level only. Assigning molecular identities based on MS1 data presents significant analytical and computational challenges, as the inherent limitations of MS1 data preclude confident annotations beyond the sum formula level. To enable future advancements of computational lipid annotation tools, well-characterized benchmark - or ground truth - datasets are crucial, which exceed the scope of synthetic data or data derived from mimetic tissue models. To this end, we provide two sulfatide-centered, biology-driven magnetic resonance MSI (MR-MSI) datasets at different mass resolving powers that characterize lipids in a mouse model of human metachromatic dystrophy. This data includes an ultra-high-resolution (R ∼1,230,000) quantum cascade laser mid-infrared imaging-guided MR-MSI dataset that enables isotopic fine structure analysis and therefore enhances the level of confidence substantially. To highlight the usefulness of the data, we compared 118 manual sulfatide annotations with the number of decoy database-controlled sulfatide annotations performed in Metaspace (67 at FDR < 10%). Overall, our datasets can be used to benchmark annotation algorithms, validate spatial biomarker discovery pipelines, and serve as a reference for future studies that explore sulfatide metabolism and its spatial regulation.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145707558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Translating short-form Python exercises to other programming languages using diverse prompting strategies. 使用不同的提示策略将简短的Python练习翻译成其他编程语言。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-12-08 DOI: 10.1093/gigascience/giaf149
Stephen R Piccolo, Harlan P Stevens

With the increasing complexity and quantity of experimental and observational data, life scientists rely on programming to automate analyses, enhance reproducibility, and facilitate collaboration. Scripting languages like Python are often favored for their simplicity and flexibility, enabling researchers to focus primarily on high-level tasks. Compiled languages such as C++ and Rust offer greater efficiency, making them preferable for intensive or repeated computations. In educational settings, instructors may wish to teach both types of languages and thus may wish to translate content from one programming language to another. In research contexts, researchers may wish to implement their ideas in one language before translating the code to another. However, translating between programming languages requires significant effort, prompting our interest in using large language models (LLMs) for semi-automated code translation. This study explores the use of an LLM (GPT-4) to translate 559 short-form programming exercises from Python into C++, Rust, Julia, and JavaScript. We used three prompting strategies-instructions only, code only, or both combined-and compared the translated code's output against the Python code's output. Translation success differed considerably by prompting strategy, and at least one of the strategies tested was effective for nearly every exercise. The highest overall success rate occurred for Rust (99.5%), followed by JavaScript (98.9%), C++ (97.9%), and Julia (95.0%). Our findings demonstrate that LLMs can effectively translate small-scale programming exercises between languages, reducing the need for manual rewriting. To support education and research, we have manually translated all exercises that were not translated successfully through automation, and we have made our translations freely available.

随着实验和观测数据的复杂性和数量的增加,生命科学家依靠编程来自动化分析,提高可重复性,促进协作。像Python这样的脚本语言通常因其简单性和灵活性而受到青睐,使研究人员能够主要专注于高级任务。像c++和Rust这样的编译语言提供了更高的效率,使它们更适合密集或重复的计算。在教育环境中,教师可能希望教授两种语言,因此可能希望将内容从一种编程语言翻译成另一种编程语言。在研究环境中,研究人员可能希望在将代码翻译成另一种语言之前先用一种语言实现他们的想法。然而,在编程语言之间进行翻译需要大量的工作,这促使我们对使用大型语言模型(llm)进行半自动代码翻译产生了兴趣。本研究探讨了使用法学硕士(GPT-4)将559个简短的编程练习从Python翻译成c++、Rust、Julia和JavaScript。我们使用了三种提示策略——仅限指令、仅限代码或两者结合——并将翻译后的代码输出与Python代码的输出进行比较。提示策略对翻译成功的影响很大,而且至少有一种策略对几乎所有练习都有效。总体成功率最高的是Rust(99.5%),其次是JavaScript(98.9%)、c++(97.9%)和Julia(95.0%)。我们的研究结果表明,法学硕士可以有效地翻译语言之间的小规模编程练习,减少手工重写的需要。为了支持教育和研究,我们已经手动翻译了所有没有通过自动化成功翻译的练习,并且我们已经免费提供了我们的翻译。
{"title":"Translating short-form Python exercises to other programming languages using diverse prompting strategies.","authors":"Stephen R Piccolo, Harlan P Stevens","doi":"10.1093/gigascience/giaf149","DOIUrl":"https://doi.org/10.1093/gigascience/giaf149","url":null,"abstract":"<p><p>With the increasing complexity and quantity of experimental and observational data, life scientists rely on programming to automate analyses, enhance reproducibility, and facilitate collaboration. Scripting languages like Python are often favored for their simplicity and flexibility, enabling researchers to focus primarily on high-level tasks. Compiled languages such as C++ and Rust offer greater efficiency, making them preferable for intensive or repeated computations. In educational settings, instructors may wish to teach both types of languages and thus may wish to translate content from one programming language to another. In research contexts, researchers may wish to implement their ideas in one language before translating the code to another. However, translating between programming languages requires significant effort, prompting our interest in using large language models (LLMs) for semi-automated code translation. This study explores the use of an LLM (GPT-4) to translate 559 short-form programming exercises from Python into C++, Rust, Julia, and JavaScript. We used three prompting strategies-instructions only, code only, or both combined-and compared the translated code's output against the Python code's output. Translation success differed considerably by prompting strategy, and at least one of the strategies tested was effective for nearly every exercise. The highest overall success rate occurred for Rust (99.5%), followed by JavaScript (98.9%), C++ (97.9%), and Julia (95.0%). Our findings demonstrate that LLMs can effectively translate small-scale programming exercises between languages, reducing the need for manual rewriting. To support education and research, we have manually translated all exercises that were not translated successfully through automation, and we have made our translations freely available.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145700216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Omics and High-Spatial-Resolution Omics: Deciphering Complexity in Neurological Disorders. 多组学和高空间分辨率组学:解读神经系统疾病的复杂性。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-12-05 DOI: 10.1093/gigascience/giaf137
Xiuyun Liu, Fangfang Li, Marek Czosnyka, Zofia Czosnyka, Huijie Yu, Xiaoguang Tong, Yan Xing, Hongliang Li, Ke Pu, Keke Feng, Kuo Zhang, Meijun Pang, Dong Ming

Background: The world has witnessed a steady rise in neurological diseases, which represent a heterogeneous group of disorders characterized by complex pathogenesis involving disruptions at multiple molecular levels, including genomic, transcriptomic, proteomic, and metabolomic levels. These disorders, often caused by genetic mutations, metabolic imbalances, immune dysregulation, and environmental factors, pose significant challenges to global public health due to their high prevalence, mortality, and disability burden.

Results: The advent of high-throughput technologies, such as next-generation sequencing and mass spectrometry, has provided valuable insights into the underlying mechanisms of disease, especially the development of multi- and high-spatial-resolution omics technologies, enabling the interaction of multiple levels of biology and analysis of the complex molecular networks and pathophysiological processes.

Conclusions: This review provides a comprehensive analysis of the latest advancements in multi- and high-spatial-resolution omics, with a focus on their applications in precision diagnostics, biomarker discovery, and therapeutic target identification in brain diseases. The study also highlights the current challenges in the clinical implementation and discusses the future directions, with artificial intelligence being anticipated to enhance clinical translation and diagnostic accuracy significantly.

背景:世界范围内神经系统疾病的发病率稳步上升,神经系统疾病是一种异质性疾病,其发病机制复杂,涉及多个分子水平的破坏,包括基因组、转录组、蛋白质组和代谢组水平。这些疾病通常由基因突变、代谢失衡、免疫失调和环境因素引起,由于其高患病率、死亡率和残疾负担,对全球公共卫生构成重大挑战。结果:高通量技术的出现,如下一代测序和质谱,为疾病的潜在机制提供了有价值的见解,特别是多分辨率和高空间分辨率组学技术的发展,使多个生物学水平的相互作用和复杂分子网络和病理生理过程的分析成为可能。结论:本文综述了多分辨率和高空间分辨率组学的最新进展,重点介绍了它们在脑部疾病的精确诊断、生物标志物发现和治疗靶点识别方面的应用。该研究还强调了临床实施中的当前挑战,并讨论了未来的方向,预计人工智能将显著提高临床翻译和诊断准确性。
{"title":"Multi-Omics and High-Spatial-Resolution Omics: Deciphering Complexity in Neurological Disorders.","authors":"Xiuyun Liu, Fangfang Li, Marek Czosnyka, Zofia Czosnyka, Huijie Yu, Xiaoguang Tong, Yan Xing, Hongliang Li, Ke Pu, Keke Feng, Kuo Zhang, Meijun Pang, Dong Ming","doi":"10.1093/gigascience/giaf137","DOIUrl":"https://doi.org/10.1093/gigascience/giaf137","url":null,"abstract":"<p><strong>Background: </strong>The world has witnessed a steady rise in neurological diseases, which represent a heterogeneous group of disorders characterized by complex pathogenesis involving disruptions at multiple molecular levels, including genomic, transcriptomic, proteomic, and metabolomic levels. These disorders, often caused by genetic mutations, metabolic imbalances, immune dysregulation, and environmental factors, pose significant challenges to global public health due to their high prevalence, mortality, and disability burden.</p><p><strong>Results: </strong>The advent of high-throughput technologies, such as next-generation sequencing and mass spectrometry, has provided valuable insights into the underlying mechanisms of disease, especially the development of multi- and high-spatial-resolution omics technologies, enabling the interaction of multiple levels of biology and analysis of the complex molecular networks and pathophysiological processes.</p><p><strong>Conclusions: </strong>This review provides a comprehensive analysis of the latest advancements in multi- and high-spatial-resolution omics, with a focus on their applications in precision diagnostics, biomarker discovery, and therapeutic target identification in brain diseases. The study also highlights the current challenges in the clinical implementation and discusses the future directions, with artificial intelligence being anticipated to enhance clinical translation and diagnostic accuracy significantly.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145687130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Endometrial Whole Slide Images Dataset for Detection of malignancy in endometrial biopsies. 子宫内膜全切片图像数据集用于子宫内膜活检中恶性肿瘤的检测。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2025-12-05 DOI: 10.1093/gigascience/giaf147
Mahnaz Mohammadi, Christina Fell, Sarah Bell, Gareth Bryson, Sheeba Syed, Prakash Konanahalli, David Harris-Birtill, Ognjen Arandjelovic, Clare Orange, Prishma Shahi, In Hwa Um, James D Blackwood, David J Harrison

Background: Whole slide imaging (WSI) enables the digitisation of entire histological slides at high resolution, allowing pathologists and researchers to analyse tissue samples digitally rather than through traditional microscopy. This technology has become increasingly valuable in pathology for research, education, and clinical diagnostics. Endometrial biopsy is very common, often being undertaken to exclude non-cancerous disease. This means that most cases do not contain cancer, and the challenge is to accurately and efficiently exclude serious pathology rather than simply make a diagnosis of malignancy. A well-curated, expert-annotated, endometrial whole slide dataset covering a spread of cancer and non-cancer diagnoses will support machine learning applications in automated diagnosis, facilitate research into the pathology of endometrial cancer, and serve as an educational resource for medical professionals.

Results: We introduce a newly constructed, large-scale dataset of endometrial biopsies, comprising 2,909 whole slide images in iSyntax format, each accompanied by a corresponding annotation file in JSON format. Each whole slide image is labelled with a primary class label representing its final diagnosis and a sub-category label providing further details within that diagnostic class. These class labels are critical for machine learning applications, as they enable the development of AI models capable of distinguishing between different types of endometrial abnormalities, improving automated classification, and guiding clinical decision-making.

Conclusions: Constructing and curating a high-quality endometrial whole slide dataset requires significant effort to ensure accurate annotations, data integrity, and patient privacy protection. However, the availability of a well-annotated dataset with detailed class labels is crucial for advancing digital pathology. Such a resource can enhance diagnostic accuracy, support personalized treatment strategies, and ultimately improve outcomes for patients with endometrial cancer and other endometrial conditions.

背景:全切片成像(WSI)能够以高分辨率实现整个组织学切片的数字化,使病理学家和研究人员能够以数字方式分析组织样本,而不是通过传统的显微镜。这项技术在病理学研究、教育和临床诊断方面变得越来越有价值。子宫内膜活检是非常常见的,通常用于排除非癌性疾病。这意味着大多数病例并不包含癌症,挑战在于准确有效地排除严重的病理,而不是简单地做出恶性诊断。一个精心策划、专家注释的子宫内膜全幻灯片数据集涵盖了癌症和非癌症诊断的传播,将支持机器学习在自动诊断中的应用,促进对子宫内膜癌病理的研究,并作为医疗专业人员的教育资源。结果:我们引入了一个新构建的大规模子宫内膜活检数据集,包括2909张iSyntax格式的整张幻灯片图像,每张图像都附有相应的JSON格式的注释文件。每个完整的幻灯片图像都标有一个代表其最终诊断的主要类别标签和一个在该诊断类别内提供进一步细节的子类别标签。这些分类标签对于机器学习应用至关重要,因为它们使人工智能模型能够区分不同类型的子宫内膜异常,改进自动分类,并指导临床决策。结论:构建和管理高质量的子宫内膜全幻灯片数据集需要付出巨大的努力,以确保准确的注释、数据完整性和患者隐私保护。然而,一个带有详细分类标签的注释良好的数据集的可用性对于推进数字病理学至关重要。这样的资源可以提高诊断的准确性,支持个性化的治疗策略,并最终改善子宫内膜癌和其他子宫内膜疾病患者的预后。
{"title":"Endometrial Whole Slide Images Dataset for Detection of malignancy in endometrial biopsies.","authors":"Mahnaz Mohammadi, Christina Fell, Sarah Bell, Gareth Bryson, Sheeba Syed, Prakash Konanahalli, David Harris-Birtill, Ognjen Arandjelovic, Clare Orange, Prishma Shahi, In Hwa Um, James D Blackwood, David J Harrison","doi":"10.1093/gigascience/giaf147","DOIUrl":"https://doi.org/10.1093/gigascience/giaf147","url":null,"abstract":"<p><strong>Background: </strong>Whole slide imaging (WSI) enables the digitisation of entire histological slides at high resolution, allowing pathologists and researchers to analyse tissue samples digitally rather than through traditional microscopy. This technology has become increasingly valuable in pathology for research, education, and clinical diagnostics. Endometrial biopsy is very common, often being undertaken to exclude non-cancerous disease. This means that most cases do not contain cancer, and the challenge is to accurately and efficiently exclude serious pathology rather than simply make a diagnosis of malignancy. A well-curated, expert-annotated, endometrial whole slide dataset covering a spread of cancer and non-cancer diagnoses will support machine learning applications in automated diagnosis, facilitate research into the pathology of endometrial cancer, and serve as an educational resource for medical professionals.</p><p><strong>Results: </strong>We introduce a newly constructed, large-scale dataset of endometrial biopsies, comprising 2,909 whole slide images in iSyntax format, each accompanied by a corresponding annotation file in JSON format. Each whole slide image is labelled with a primary class label representing its final diagnosis and a sub-category label providing further details within that diagnostic class. These class labels are critical for machine learning applications, as they enable the development of AI models capable of distinguishing between different types of endometrial abnormalities, improving automated classification, and guiding clinical decision-making.</p><p><strong>Conclusions: </strong>Constructing and curating a high-quality endometrial whole slide dataset requires significant effort to ensure accurate annotations, data integrity, and patient privacy protection. However, the availability of a well-annotated dataset with detailed class labels is crucial for advancing digital pathology. Such a resource can enhance diagnostic accuracy, support personalized treatment strategies, and ultimately improve outcomes for patients with endometrial cancer and other endometrial conditions.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145687123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
GigaScience
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1