Pub Date : 2025-12-19DOI: 10.1093/gigascience/giaf157
Jorge Mas-Gómez, Manuel Rubio, Federico Dicenta, Pedro José Martínez-García
Background: High-throughput phenotyping is addressing the current bottleneck in phenotyping within breeding programs. Imaging tools are becoming the primary resource for improving the efficiency of phenotyping processes and providing large datasets for genomic selection approaches. The advent of AI brings new advantages by enhancing phenotyping methods using imaging, making them more accessible to breeding programs. In this context, we have developed an open Python workflow for analyzing morphology, colour and morphometric traits using AI, which can be applied to fruits and other plant organs.
Results: The workflow was implemented in almond (Prunus dulcis (Mill.) D. A. Webb), a species where breeding efficiency is critical due to its long breeding cycle. Over 25,000 kernels, more than 20,000 nuts, and over 600 individuals were phenotyped, making this the largest morphological study conducted in almond so far. The best segmentation and reconstruction approaches achieved error rates below 1%. Weight and area variables enabled accurate estimation of kernel thickness, with a root mean squared error (RMSE) of 0.47. Fifty-five heritable morphological, morphometric and colour traits were identified, highlighting their potential as target traits in breeding programs.
Conclusion: The proposed workflow demonstrated robust performance across diverse datasets and being effective with limited training data for fine-tuning. Its compatibility with the output of AI-based labelling tools allows users to fully leverage the advantages of these technologies-reducing manual effort, accelerating dataset preparation, and streamlining the fine-tuning process of segmentation models. This flexibility enhances the scalability and practical applicability of the workflow in real-world phenotyping scenarios, especially in the context of breeding programs.
背景:高通量表型分析正在解决育种计划中表型分析的当前瓶颈。成像工具正在成为提高表型过程效率和为基因组选择方法提供大型数据集的主要资源。人工智能的出现带来了新的优势,它增强了使用成像的表型分析方法,使它们更容易用于育种计划。在此背景下,我们开发了一个开放的Python工作流,用于使用AI分析形态,颜色和形态特征,可应用于水果和其他植物器官。结果:该流程可在扁桃(Prunus dulcis, Mill.)中实现。D. a . Webb),由于其繁殖周期长,繁殖效率至关重要。超过25,000粒,20,000多个坚果,600多个个体进行了表型分析,这是迄今为止对杏仁进行的最大规模的形态学研究。最好的分割和重建方法使错误率低于1%。权重和面积变量能够准确估计核厚,均方根误差(RMSE)为0.47。鉴定了55个可遗传的形态、形态计量和颜色性状,突出了它们作为育种目标性状的潜力。结论:所提出的工作流在不同的数据集上表现出稳健的性能,并且在有限的训练数据上进行微调是有效的。它与基于人工智能的标签工具的输出的兼容性允许用户充分利用这些技术的优势-减少人工劳动,加速数据集准备,并简化分割模型的微调过程。这种灵活性增强了工作流程在现实世界表型场景中的可扩展性和实际适用性,特别是在育种计划的背景下。
{"title":"Open RGB Imaging Workflow for Morphological and Morphometric Analysis of Fruits using Deep Learning: A Case Study on Almonds.","authors":"Jorge Mas-Gómez, Manuel Rubio, Federico Dicenta, Pedro José Martínez-García","doi":"10.1093/gigascience/giaf157","DOIUrl":"https://doi.org/10.1093/gigascience/giaf157","url":null,"abstract":"<p><strong>Background: </strong>High-throughput phenotyping is addressing the current bottleneck in phenotyping within breeding programs. Imaging tools are becoming the primary resource for improving the efficiency of phenotyping processes and providing large datasets for genomic selection approaches. The advent of AI brings new advantages by enhancing phenotyping methods using imaging, making them more accessible to breeding programs. In this context, we have developed an open Python workflow for analyzing morphology, colour and morphometric traits using AI, which can be applied to fruits and other plant organs.</p><p><strong>Results: </strong>The workflow was implemented in almond (Prunus dulcis (Mill.) D. A. Webb), a species where breeding efficiency is critical due to its long breeding cycle. Over 25,000 kernels, more than 20,000 nuts, and over 600 individuals were phenotyped, making this the largest morphological study conducted in almond so far. The best segmentation and reconstruction approaches achieved error rates below 1%. Weight and area variables enabled accurate estimation of kernel thickness, with a root mean squared error (RMSE) of 0.47. Fifty-five heritable morphological, morphometric and colour traits were identified, highlighting their potential as target traits in breeding programs.</p><p><strong>Conclusion: </strong>The proposed workflow demonstrated robust performance across diverse datasets and being effective with limited training data for fine-tuning. Its compatibility with the output of AI-based labelling tools allows users to fully leverage the advantages of these technologies-reducing manual effort, accelerating dataset preparation, and streamlining the fine-tuning process of segmentation models. This flexibility enhances the scalability and practical applicability of the workflow in real-world phenotyping scenarios, especially in the context of breeding programs.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145793888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-18DOI: 10.1093/gigascience/giaf158
Hangwei Liu, Lihong Lei, Fan Jiang, Bo Zhang, Hengchao Wang, Yutong Zhang, Hanbo Zhao, Guirong Wang, Wei Fan
Background: Praying mantises, members of the order Mantodea, play important roles in agriculture, medicine, bionics, and entertainment. However, the scarcity of genomic resources has hindered extensive studies on mantis evolution and behaviour.
Results: Here, we present the chromosome-scale reference genomes of five mantis species: the European mantis (Mantis religiosa), Chinese mantis (Tenodera sinensis), triangle dead leaf mantis (Deroplatys truncata), orchid mantis (Hymenopus coronatus), and metallic mantis (Metallyticus violacea). The assembled genome sizes range ∼2.3-4.2 Gb, with contig N50 size 1-109 Mb and 85-99% of sequence anchored to chromosomes. The annotated protein-coding gene number ranges 17,804-19,017, with BUSCO complete rate 96.7-98.4%. We found that transposable element expansion is the major force governing genome size in Mantodea, and suggest that translocations between the X chromosome and an autosome have occurred in the lineage of the family Mantidae. In addition, we found the lineage of M. violacea has accumulated fewer substitutions than the lineages of other mantises. Furthermore, our genome-wide analyses showed that D. truncata is sister to H. coronatus than M. religiosa and T. sinensis, helps resolve the phylogenic controversies of Deroplatys genus.
Conclusions: The high-quality genome assemblies of the five mantises provide a valuable resource for evolution studies of Mantodea and genetic improvement and breeding of beneficial biological control agents.
{"title":"The genomes of five mantises provide insights into sex chromosome evolution and Mantodea phylogeny clarification.","authors":"Hangwei Liu, Lihong Lei, Fan Jiang, Bo Zhang, Hengchao Wang, Yutong Zhang, Hanbo Zhao, Guirong Wang, Wei Fan","doi":"10.1093/gigascience/giaf158","DOIUrl":"https://doi.org/10.1093/gigascience/giaf158","url":null,"abstract":"<p><strong>Background: </strong>Praying mantises, members of the order Mantodea, play important roles in agriculture, medicine, bionics, and entertainment. However, the scarcity of genomic resources has hindered extensive studies on mantis evolution and behaviour.</p><p><strong>Results: </strong>Here, we present the chromosome-scale reference genomes of five mantis species: the European mantis (Mantis religiosa), Chinese mantis (Tenodera sinensis), triangle dead leaf mantis (Deroplatys truncata), orchid mantis (Hymenopus coronatus), and metallic mantis (Metallyticus violacea). The assembled genome sizes range ∼2.3-4.2 Gb, with contig N50 size 1-109 Mb and 85-99% of sequence anchored to chromosomes. The annotated protein-coding gene number ranges 17,804-19,017, with BUSCO complete rate 96.7-98.4%. We found that transposable element expansion is the major force governing genome size in Mantodea, and suggest that translocations between the X chromosome and an autosome have occurred in the lineage of the family Mantidae. In addition, we found the lineage of M. violacea has accumulated fewer substitutions than the lineages of other mantises. Furthermore, our genome-wide analyses showed that D. truncata is sister to H. coronatus than M. religiosa and T. sinensis, helps resolve the phylogenic controversies of Deroplatys genus.</p><p><strong>Conclusions: </strong>The high-quality genome assemblies of the five mantises provide a valuable resource for evolution studies of Mantodea and genetic improvement and breeding of beneficial biological control agents.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145774156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-12DOI: 10.1093/gigascience/giaf152
Sierra A T Moxon, Harold Solbrig, Nomi L Harris, Patrick Kalita, Mark A Miller, Sujay Patil, Kevin Schaper, Chris Bizon, J Harry Caufield, Silvano Cirujano Cuesta, Corey Cox, Frank Dekervel, Damion M Dooley, William D Duncan, Tim Fliss, Sarah Gehrke, Adam S L Graefe, Harshad Hegde, A J Ireland, Julius O B Jacobsen, Madan Krishnamurthy, Carlo Kroll, David Linke, Ryan Ly, Nicolas Matentzoglu, James A Overton, Jonny L Saunders, Deepak R Unni, Gaurav Vaidya, Wouter-Michiel A M Vierdag, Oliver Ruebel, Christopher G Chute, Matthew H Brush, Melissa A Haendel, Christopher J Mungall
Background: Scientific research relies on well-structured, standardized data; however, much of it is stored in formats such as free-text lab notebooks, non-standardized spreadsheets, or data repositories. This lack of structure challenges interoperability, making data integration, validation, and reuse difficult.
Findings: LinkML (Linked Data Modeling Language) is an open framework that simplifies the process of authoring, validating, and sharing data. LinkML can describe a range of data structures, from flat, list-based models to complex, interrelated, and normalized models that utilize polymorphism and compound inheritance. It offers an approachable syntax that is not tied to any one technical architecture and can be integrated seamlessly with many existing frameworks. The LinkML syntax provides a standard way to describe schemas, classes, and relationships, allowing modelers to build well-defined, stable, and optionally ontology-aligned data structures. Once defined, LinkML schemas may be imported into other LinkML schemas. These key features make LinkML an accessible platform for interdisciplinary collaboration and a reliable way to define and share data semantics.
Conclusions: LinkML helps reduce heterogeneity, complexity, and the proliferation of single-use data models while simultaneously enabling compliance with FAIR data standards. LinkML has seen increasing adoption in various fields, including biology, chemistry, biomedicine, microbiome research, finance, electrical engineering, transportation, and commercial software development. In short, LinkML makes implicit models explicitly computable and allows data to be standardized at its origin. LinkML documentation and code are available at linkml.io.
{"title":"LinkML: An Open Data Modeling Framework.","authors":"Sierra A T Moxon, Harold Solbrig, Nomi L Harris, Patrick Kalita, Mark A Miller, Sujay Patil, Kevin Schaper, Chris Bizon, J Harry Caufield, Silvano Cirujano Cuesta, Corey Cox, Frank Dekervel, Damion M Dooley, William D Duncan, Tim Fliss, Sarah Gehrke, Adam S L Graefe, Harshad Hegde, A J Ireland, Julius O B Jacobsen, Madan Krishnamurthy, Carlo Kroll, David Linke, Ryan Ly, Nicolas Matentzoglu, James A Overton, Jonny L Saunders, Deepak R Unni, Gaurav Vaidya, Wouter-Michiel A M Vierdag, Oliver Ruebel, Christopher G Chute, Matthew H Brush, Melissa A Haendel, Christopher J Mungall","doi":"10.1093/gigascience/giaf152","DOIUrl":"https://doi.org/10.1093/gigascience/giaf152","url":null,"abstract":"<p><strong>Background: </strong>Scientific research relies on well-structured, standardized data; however, much of it is stored in formats such as free-text lab notebooks, non-standardized spreadsheets, or data repositories. This lack of structure challenges interoperability, making data integration, validation, and reuse difficult.</p><p><strong>Findings: </strong>LinkML (Linked Data Modeling Language) is an open framework that simplifies the process of authoring, validating, and sharing data. LinkML can describe a range of data structures, from flat, list-based models to complex, interrelated, and normalized models that utilize polymorphism and compound inheritance. It offers an approachable syntax that is not tied to any one technical architecture and can be integrated seamlessly with many existing frameworks. The LinkML syntax provides a standard way to describe schemas, classes, and relationships, allowing modelers to build well-defined, stable, and optionally ontology-aligned data structures. Once defined, LinkML schemas may be imported into other LinkML schemas. These key features make LinkML an accessible platform for interdisciplinary collaboration and a reliable way to define and share data semantics.</p><p><strong>Conclusions: </strong>LinkML helps reduce heterogeneity, complexity, and the proliferation of single-use data models while simultaneously enabling compliance with FAIR data standards. LinkML has seen increasing adoption in various fields, including biology, chemistry, biomedicine, microbiome research, finance, electrical engineering, transportation, and commercial software development. In short, LinkML makes implicit models explicitly computable and allows data to be standardized at its origin. LinkML documentation and code are available at linkml.io.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145742108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-12DOI: 10.1093/gigascience/giaf154
Qian Qin, Heng Li
Background: Structural variants (SVs) are genomic differences ≥50 bp in length. They remain challenging to detect even with long sequence reads, and the sources of these difficulties are not well quantified.
Results: We identified 35.4 Mb of low-complexity regions (LCRs) in GRCh38. Although these regions cover only 1.2% of the genome, they contain 69.1% of confident SVs in sample HG002. Across long-read SV callers, 77.3-91.3% of erroneous SV calls occur within LCRs, with error rates increasing with LCR length.
Conclusion: SVs are enriched and difficult to call in LCRs. Special care needs to be taken for calling and analyzing these variants.
{"title":"Challenges in structural variant calling in low-complexity regions.","authors":"Qian Qin, Heng Li","doi":"10.1093/gigascience/giaf154","DOIUrl":"https://doi.org/10.1093/gigascience/giaf154","url":null,"abstract":"<p><strong>Background: </strong>Structural variants (SVs) are genomic differences ≥50 bp in length. They remain challenging to detect even with long sequence reads, and the sources of these difficulties are not well quantified.</p><p><strong>Results: </strong>We identified 35.4 Mb of low-complexity regions (LCRs) in GRCh38. Although these regions cover only 1.2% of the genome, they contain 69.1% of confident SVs in sample HG002. Across long-read SV callers, 77.3-91.3% of erroneous SV calls occur within LCRs, with error rates increasing with LCR length.</p><p><strong>Conclusion: </strong>SVs are enriched and difficult to call in LCRs. Special care needs to be taken for calling and analyzing these variants.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145742150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Safflower (Carthamus tinctorius L.) is a drought-resilient oilseed crop. Besides producing edible oil rich in oleic and linoleic acid, it is also used in biofuels, cosmetics, colouring dyes, pharmaceuticals and nutraceuticals. Despite its significant economic uses, availability of genetic and genomic resources in safflower are limited.
Results: We report an improved de novo genome assembly of safflower (Safflower_A2). A chromosome-level assembly of 1.15 Gb with telomeres and centromeric repeats, was constructed using PacBio HiFi reads, optical maps, Illumina short reads, and Hi-C sequencing. Safflower_A2 shows better contiguity, completeness, and high-quality annotation than previous assemblies. The assembly was further validated with the help of a single nucleotide polymorphism (SNP)-based linkage map. A genome-wide survey identified genes for comprehensive exploration of disease resistance in the safflower. Employing the de novo genome assembly as a reference, we used resequencing data of a global core-collection of 123 accessions to carry out a SNP-based genome-wide association study, which identified significant associations for several traits, their haplotypes of agronomic value, including seed oil content. Resequencing data was also applied for a pan-genome analysis which provided critical insights into genome diversity identifying an additional ∼11000 genes and their functional enrichment that will be useful for region-specific breeding lines.
Conclusion: Our study provides insights into the genomic architecture of safflower by leveraging an improved genome assembly and annotation. Additionally, resources including high-density linkage map, marker-trait associations, and pan-genome developed in this study provide valuable resources for use in breeding and crop improvement programs by the global research community.
背景:红花(Carthamus tinctorius L.)是一种抗旱油料作物。除了生产富含油酸和亚油酸的食用油外,它还用于生物燃料、化妆品、染料、药品和营养保健品。尽管红花具有重要的经济用途,但其遗传和基因组资源的可用性有限。结果:我们报道了一个改进的红花(Safflower_A2)从头基因组组装。利用PacBio HiFi reads、光学图谱、Illumina short reads和Hi-C测序,构建了1.15 Gb染色体水平的端粒和着丝粒重复序列。与以前的程序集相比,Safflower_A2具有更好的连续性、完整性和高质量的注释。通过基于单核苷酸多态性(SNP)的连锁图谱进一步验证了该序列。一项全基因组调查确定了红花抗病基因的全面探索。以从头基因组组装为参考,我们利用123份全球核心收集的重测序数据进行了基于snp的全基因组关联研究,发现了几种性状及其农艺价值单倍型(包括种子含油量)的显著相关性。重测序数据还用于泛基因组分析,该分析为基因组多样性提供了关键见解,确定了额外的约11000个基因及其功能富集,这将对区域特异性育种系有用。结论:我们的研究利用改进的基因组组装和注释为红花的基因组结构提供了见解。此外,本研究开发的高密度连锁图谱、标记-性状关联、泛基因组等资源为全球研究界的育种和作物改良计划提供了宝贵的资源。
{"title":"Improved reference assembly and core collection re-sequencing to facilitate exploration of important agronomical traits for the improvement of oilseed crop, Carthamus tinctorius L.","authors":"Megha Sharma, Varun Bhardwaj, Praveen Kumar Oraon, Shivani Choudhary, Heena Ambreen, Rohit Nandan Shukla, Harsha Rayudu Jamedar, Ajitha Vijjeswarapu, Vandana Jaiswal, Palchamy Kadirvel, Arun Jagannath, Shailendra Goel","doi":"10.1093/gigascience/giaf151","DOIUrl":"https://doi.org/10.1093/gigascience/giaf151","url":null,"abstract":"<p><strong>Background: </strong>Safflower (Carthamus tinctorius L.) is a drought-resilient oilseed crop. Besides producing edible oil rich in oleic and linoleic acid, it is also used in biofuels, cosmetics, colouring dyes, pharmaceuticals and nutraceuticals. Despite its significant economic uses, availability of genetic and genomic resources in safflower are limited.</p><p><strong>Results: </strong>We report an improved de novo genome assembly of safflower (Safflower_A2). A chromosome-level assembly of 1.15 Gb with telomeres and centromeric repeats, was constructed using PacBio HiFi reads, optical maps, Illumina short reads, and Hi-C sequencing. Safflower_A2 shows better contiguity, completeness, and high-quality annotation than previous assemblies. The assembly was further validated with the help of a single nucleotide polymorphism (SNP)-based linkage map. A genome-wide survey identified genes for comprehensive exploration of disease resistance in the safflower. Employing the de novo genome assembly as a reference, we used resequencing data of a global core-collection of 123 accessions to carry out a SNP-based genome-wide association study, which identified significant associations for several traits, their haplotypes of agronomic value, including seed oil content. Resequencing data was also applied for a pan-genome analysis which provided critical insights into genome diversity identifying an additional ∼11000 genes and their functional enrichment that will be useful for region-specific breeding lines.</p><p><strong>Conclusion: </strong>Our study provides insights into the genomic architecture of safflower by leveraging an improved genome assembly and annotation. Additionally, resources including high-density linkage map, marker-trait associations, and pan-genome developed in this study provide valuable resources for use in breeding and crop improvement programs by the global research community.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145722306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-09DOI: 10.1093/gigascience/giaf148
Wei Zhang, Hanchen Huang, Lily Wang, Brian D Lehmann, X Steven Chen
High-throughput technologies now produce a wide array of omics data, from genomic and transcriptomic profiles to epigenomic and proteomic measurements. Integrating multiple omics layers measured on the same samples can reveal cross-layer molecular hubs that single-layer analyses miss. We present an unsupervised, multivariate random forest (MRF) framework with an inverse minimal depth (IMD) importance to prioritize shared biomarkers across omics. In each forest, one layer serves as a multivariate response and the other as predictors; IMD summarizes how early a predictor (or response MSRV) appears across trees, yielding interpretable, cross-layer feature rankings. We provide three IMD-based selection strategies and introduce an optional IMD power transform to enhance sensitivity to interaction signals. In extensive simulations spanning linear, nonlinear, and interaction regimes, our method matches SPLS/CCA under linear settings and outperforms them as nonlinearity increases, while adapted univariate ensemble learners (RF, GBM, XGBoost) underperform in the multivariate, unsupervised context. Applied to TCGA BRCA and COAD, MRF-IMD identifies genes, CpGs, and miRNAs enriched for cancer-relevant pathways and yields more robust survival stratification than linear integrators with matched model sizes. In a TCGA pan-cancer analysis, MRF-IMD features achieve higher ARI than alternatives and recover coherent tumor-type clusters; in ADNI, the integrative signature improves dementia-progression stratification over a published methylation risk score. Our scalable, interpretable MRF-IMD framework advances reliable multi-omics biomarker discovery when nonlinear, cross-layer dependencies matter.
{"title":"An Integrative Multi-Omics Random Forest Framework for Robust Biomarker Discovery.","authors":"Wei Zhang, Hanchen Huang, Lily Wang, Brian D Lehmann, X Steven Chen","doi":"10.1093/gigascience/giaf148","DOIUrl":"10.1093/gigascience/giaf148","url":null,"abstract":"<p><p>High-throughput technologies now produce a wide array of omics data, from genomic and transcriptomic profiles to epigenomic and proteomic measurements. Integrating multiple omics layers measured on the same samples can reveal cross-layer molecular hubs that single-layer analyses miss. We present an unsupervised, multivariate random forest (MRF) framework with an inverse minimal depth (IMD) importance to prioritize shared biomarkers across omics. In each forest, one layer serves as a multivariate response and the other as predictors; IMD summarizes how early a predictor (or response MSRV) appears across trees, yielding interpretable, cross-layer feature rankings. We provide three IMD-based selection strategies and introduce an optional IMD power transform to enhance sensitivity to interaction signals. In extensive simulations spanning linear, nonlinear, and interaction regimes, our method matches SPLS/CCA under linear settings and outperforms them as nonlinearity increases, while adapted univariate ensemble learners (RF, GBM, XGBoost) underperform in the multivariate, unsupervised context. Applied to TCGA BRCA and COAD, MRF-IMD identifies genes, CpGs, and miRNAs enriched for cancer-relevant pathways and yields more robust survival stratification than linear integrators with matched model sizes. In a TCGA pan-cancer analysis, MRF-IMD features achieve higher ARI than alternatives and recover coherent tumor-type clusters; in ADNI, the integrative signature improves dementia-progression stratification over a published methylation risk score. Our scalable, interpretable MRF-IMD framework advances reliable multi-omics biomarker discovery when nonlinear, cross-layer dependencies matter.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145707663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-09DOI: 10.1093/gigascience/giaf150
Lars Gruber, Stefan Schmidt, Thomas Enzlein, Carsten Hopf
Spatial 'omics techniques are indispensable for studying complex biological systems and for the discovery of spatial biomarkers. While several current matrix-assisted laser desorption/ionization (MALDI) mass spectrometry imaging (MSI) instruments are capable of localizing numerous metabolites at high spatial and spectral resolution, the majority of MSI data is acquired at the MS1 level only. Assigning molecular identities based on MS1 data presents significant analytical and computational challenges, as the inherent limitations of MS1 data preclude confident annotations beyond the sum formula level. To enable future advancements of computational lipid annotation tools, well-characterized benchmark - or ground truth - datasets are crucial, which exceed the scope of synthetic data or data derived from mimetic tissue models. To this end, we provide two sulfatide-centered, biology-driven magnetic resonance MSI (MR-MSI) datasets at different mass resolving powers that characterize lipids in a mouse model of human metachromatic dystrophy. This data includes an ultra-high-resolution (R ∼1,230,000) quantum cascade laser mid-infrared imaging-guided MR-MSI dataset that enables isotopic fine structure analysis and therefore enhances the level of confidence substantially. To highlight the usefulness of the data, we compared 118 manual sulfatide annotations with the number of decoy database-controlled sulfatide annotations performed in Metaspace (67 at FDR < 10%). Overall, our datasets can be used to benchmark annotation algorithms, validate spatial biomarker discovery pipelines, and serve as a reference for future studies that explore sulfatide metabolism and its spatial regulation.
{"title":"A sulfatide-centered ultra-high resolution magnetic resonance MALDI imaging benchmark dataset for MS1-based lipid annotation tools.","authors":"Lars Gruber, Stefan Schmidt, Thomas Enzlein, Carsten Hopf","doi":"10.1093/gigascience/giaf150","DOIUrl":"https://doi.org/10.1093/gigascience/giaf150","url":null,"abstract":"<p><p>Spatial 'omics techniques are indispensable for studying complex biological systems and for the discovery of spatial biomarkers. While several current matrix-assisted laser desorption/ionization (MALDI) mass spectrometry imaging (MSI) instruments are capable of localizing numerous metabolites at high spatial and spectral resolution, the majority of MSI data is acquired at the MS1 level only. Assigning molecular identities based on MS1 data presents significant analytical and computational challenges, as the inherent limitations of MS1 data preclude confident annotations beyond the sum formula level. To enable future advancements of computational lipid annotation tools, well-characterized benchmark - or ground truth - datasets are crucial, which exceed the scope of synthetic data or data derived from mimetic tissue models. To this end, we provide two sulfatide-centered, biology-driven magnetic resonance MSI (MR-MSI) datasets at different mass resolving powers that characterize lipids in a mouse model of human metachromatic dystrophy. This data includes an ultra-high-resolution (R ∼1,230,000) quantum cascade laser mid-infrared imaging-guided MR-MSI dataset that enables isotopic fine structure analysis and therefore enhances the level of confidence substantially. To highlight the usefulness of the data, we compared 118 manual sulfatide annotations with the number of decoy database-controlled sulfatide annotations performed in Metaspace (67 at FDR < 10%). Overall, our datasets can be used to benchmark annotation algorithms, validate spatial biomarker discovery pipelines, and serve as a reference for future studies that explore sulfatide metabolism and its spatial regulation.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145707558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-08DOI: 10.1093/gigascience/giaf149
Stephen R Piccolo, Harlan P Stevens
With the increasing complexity and quantity of experimental and observational data, life scientists rely on programming to automate analyses, enhance reproducibility, and facilitate collaboration. Scripting languages like Python are often favored for their simplicity and flexibility, enabling researchers to focus primarily on high-level tasks. Compiled languages such as C++ and Rust offer greater efficiency, making them preferable for intensive or repeated computations. In educational settings, instructors may wish to teach both types of languages and thus may wish to translate content from one programming language to another. In research contexts, researchers may wish to implement their ideas in one language before translating the code to another. However, translating between programming languages requires significant effort, prompting our interest in using large language models (LLMs) for semi-automated code translation. This study explores the use of an LLM (GPT-4) to translate 559 short-form programming exercises from Python into C++, Rust, Julia, and JavaScript. We used three prompting strategies-instructions only, code only, or both combined-and compared the translated code's output against the Python code's output. Translation success differed considerably by prompting strategy, and at least one of the strategies tested was effective for nearly every exercise. The highest overall success rate occurred for Rust (99.5%), followed by JavaScript (98.9%), C++ (97.9%), and Julia (95.0%). Our findings demonstrate that LLMs can effectively translate small-scale programming exercises between languages, reducing the need for manual rewriting. To support education and research, we have manually translated all exercises that were not translated successfully through automation, and we have made our translations freely available.
{"title":"Translating short-form Python exercises to other programming languages using diverse prompting strategies.","authors":"Stephen R Piccolo, Harlan P Stevens","doi":"10.1093/gigascience/giaf149","DOIUrl":"https://doi.org/10.1093/gigascience/giaf149","url":null,"abstract":"<p><p>With the increasing complexity and quantity of experimental and observational data, life scientists rely on programming to automate analyses, enhance reproducibility, and facilitate collaboration. Scripting languages like Python are often favored for their simplicity and flexibility, enabling researchers to focus primarily on high-level tasks. Compiled languages such as C++ and Rust offer greater efficiency, making them preferable for intensive or repeated computations. In educational settings, instructors may wish to teach both types of languages and thus may wish to translate content from one programming language to another. In research contexts, researchers may wish to implement their ideas in one language before translating the code to another. However, translating between programming languages requires significant effort, prompting our interest in using large language models (LLMs) for semi-automated code translation. This study explores the use of an LLM (GPT-4) to translate 559 short-form programming exercises from Python into C++, Rust, Julia, and JavaScript. We used three prompting strategies-instructions only, code only, or both combined-and compared the translated code's output against the Python code's output. Translation success differed considerably by prompting strategy, and at least one of the strategies tested was effective for nearly every exercise. The highest overall success rate occurred for Rust (99.5%), followed by JavaScript (98.9%), C++ (97.9%), and Julia (95.0%). Our findings demonstrate that LLMs can effectively translate small-scale programming exercises between languages, reducing the need for manual rewriting. To support education and research, we have manually translated all exercises that were not translated successfully through automation, and we have made our translations freely available.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145700216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-05DOI: 10.1093/gigascience/giaf137
Xiuyun Liu, Fangfang Li, Marek Czosnyka, Zofia Czosnyka, Huijie Yu, Xiaoguang Tong, Yan Xing, Hongliang Li, Ke Pu, Keke Feng, Kuo Zhang, Meijun Pang, Dong Ming
Background: The world has witnessed a steady rise in neurological diseases, which represent a heterogeneous group of disorders characterized by complex pathogenesis involving disruptions at multiple molecular levels, including genomic, transcriptomic, proteomic, and metabolomic levels. These disorders, often caused by genetic mutations, metabolic imbalances, immune dysregulation, and environmental factors, pose significant challenges to global public health due to their high prevalence, mortality, and disability burden.
Results: The advent of high-throughput technologies, such as next-generation sequencing and mass spectrometry, has provided valuable insights into the underlying mechanisms of disease, especially the development of multi- and high-spatial-resolution omics technologies, enabling the interaction of multiple levels of biology and analysis of the complex molecular networks and pathophysiological processes.
Conclusions: This review provides a comprehensive analysis of the latest advancements in multi- and high-spatial-resolution omics, with a focus on their applications in precision diagnostics, biomarker discovery, and therapeutic target identification in brain diseases. The study also highlights the current challenges in the clinical implementation and discusses the future directions, with artificial intelligence being anticipated to enhance clinical translation and diagnostic accuracy significantly.
{"title":"Multi-Omics and High-Spatial-Resolution Omics: Deciphering Complexity in Neurological Disorders.","authors":"Xiuyun Liu, Fangfang Li, Marek Czosnyka, Zofia Czosnyka, Huijie Yu, Xiaoguang Tong, Yan Xing, Hongliang Li, Ke Pu, Keke Feng, Kuo Zhang, Meijun Pang, Dong Ming","doi":"10.1093/gigascience/giaf137","DOIUrl":"https://doi.org/10.1093/gigascience/giaf137","url":null,"abstract":"<p><strong>Background: </strong>The world has witnessed a steady rise in neurological diseases, which represent a heterogeneous group of disorders characterized by complex pathogenesis involving disruptions at multiple molecular levels, including genomic, transcriptomic, proteomic, and metabolomic levels. These disorders, often caused by genetic mutations, metabolic imbalances, immune dysregulation, and environmental factors, pose significant challenges to global public health due to their high prevalence, mortality, and disability burden.</p><p><strong>Results: </strong>The advent of high-throughput technologies, such as next-generation sequencing and mass spectrometry, has provided valuable insights into the underlying mechanisms of disease, especially the development of multi- and high-spatial-resolution omics technologies, enabling the interaction of multiple levels of biology and analysis of the complex molecular networks and pathophysiological processes.</p><p><strong>Conclusions: </strong>This review provides a comprehensive analysis of the latest advancements in multi- and high-spatial-resolution omics, with a focus on their applications in precision diagnostics, biomarker discovery, and therapeutic target identification in brain diseases. The study also highlights the current challenges in the clinical implementation and discusses the future directions, with artificial intelligence being anticipated to enhance clinical translation and diagnostic accuracy significantly.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145687130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-05DOI: 10.1093/gigascience/giaf147
Mahnaz Mohammadi, Christina Fell, Sarah Bell, Gareth Bryson, Sheeba Syed, Prakash Konanahalli, David Harris-Birtill, Ognjen Arandjelovic, Clare Orange, Prishma Shahi, In Hwa Um, James D Blackwood, David J Harrison
Background: Whole slide imaging (WSI) enables the digitisation of entire histological slides at high resolution, allowing pathologists and researchers to analyse tissue samples digitally rather than through traditional microscopy. This technology has become increasingly valuable in pathology for research, education, and clinical diagnostics. Endometrial biopsy is very common, often being undertaken to exclude non-cancerous disease. This means that most cases do not contain cancer, and the challenge is to accurately and efficiently exclude serious pathology rather than simply make a diagnosis of malignancy. A well-curated, expert-annotated, endometrial whole slide dataset covering a spread of cancer and non-cancer diagnoses will support machine learning applications in automated diagnosis, facilitate research into the pathology of endometrial cancer, and serve as an educational resource for medical professionals.
Results: We introduce a newly constructed, large-scale dataset of endometrial biopsies, comprising 2,909 whole slide images in iSyntax format, each accompanied by a corresponding annotation file in JSON format. Each whole slide image is labelled with a primary class label representing its final diagnosis and a sub-category label providing further details within that diagnostic class. These class labels are critical for machine learning applications, as they enable the development of AI models capable of distinguishing between different types of endometrial abnormalities, improving automated classification, and guiding clinical decision-making.
Conclusions: Constructing and curating a high-quality endometrial whole slide dataset requires significant effort to ensure accurate annotations, data integrity, and patient privacy protection. However, the availability of a well-annotated dataset with detailed class labels is crucial for advancing digital pathology. Such a resource can enhance diagnostic accuracy, support personalized treatment strategies, and ultimately improve outcomes for patients with endometrial cancer and other endometrial conditions.
{"title":"Endometrial Whole Slide Images Dataset for Detection of malignancy in endometrial biopsies.","authors":"Mahnaz Mohammadi, Christina Fell, Sarah Bell, Gareth Bryson, Sheeba Syed, Prakash Konanahalli, David Harris-Birtill, Ognjen Arandjelovic, Clare Orange, Prishma Shahi, In Hwa Um, James D Blackwood, David J Harrison","doi":"10.1093/gigascience/giaf147","DOIUrl":"https://doi.org/10.1093/gigascience/giaf147","url":null,"abstract":"<p><strong>Background: </strong>Whole slide imaging (WSI) enables the digitisation of entire histological slides at high resolution, allowing pathologists and researchers to analyse tissue samples digitally rather than through traditional microscopy. This technology has become increasingly valuable in pathology for research, education, and clinical diagnostics. Endometrial biopsy is very common, often being undertaken to exclude non-cancerous disease. This means that most cases do not contain cancer, and the challenge is to accurately and efficiently exclude serious pathology rather than simply make a diagnosis of malignancy. A well-curated, expert-annotated, endometrial whole slide dataset covering a spread of cancer and non-cancer diagnoses will support machine learning applications in automated diagnosis, facilitate research into the pathology of endometrial cancer, and serve as an educational resource for medical professionals.</p><p><strong>Results: </strong>We introduce a newly constructed, large-scale dataset of endometrial biopsies, comprising 2,909 whole slide images in iSyntax format, each accompanied by a corresponding annotation file in JSON format. Each whole slide image is labelled with a primary class label representing its final diagnosis and a sub-category label providing further details within that diagnostic class. These class labels are critical for machine learning applications, as they enable the development of AI models capable of distinguishing between different types of endometrial abnormalities, improving automated classification, and guiding clinical decision-making.</p><p><strong>Conclusions: </strong>Constructing and curating a high-quality endometrial whole slide dataset requires significant effort to ensure accurate annotations, data integrity, and patient privacy protection. However, the availability of a well-annotated dataset with detailed class labels is crucial for advancing digital pathology. Such a resource can enhance diagnostic accuracy, support personalized treatment strategies, and ultimately improve outcomes for patients with endometrial cancer and other endometrial conditions.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145687123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}