Background: Safflower (Carthamus tinctorius L.) is a drought-resilient oilseed crop. Besides producing edible oil rich in oleic and linoleic acid, it is also used in biofuels, cosmetics, colouring dyes, pharmaceuticals and nutraceuticals. Despite its significant economic uses, availability of genetic and genomic resources in safflower are limited.
Results: We report an improved de novo genome assembly of safflower (Safflower_A2). A chromosome-level assembly of 1.15 Gb with telomeres and centromeric repeats, was constructed using PacBio HiFi reads, optical maps, Illumina short reads, and Hi-C sequencing. Safflower_A2 shows better contiguity, completeness, and high-quality annotation than previous assemblies. The assembly was further validated with the help of a single nucleotide polymorphism (SNP)-based linkage map. A genome-wide survey identified genes for comprehensive exploration of disease resistance in the safflower. Employing the de novo genome assembly as a reference, we used resequencing data of a global core-collection of 123 accessions to carry out a SNP-based genome-wide association study, which identified significant associations for several traits, their haplotypes of agronomic value, including seed oil content. Resequencing data was also applied for a pan-genome analysis which provided critical insights into genome diversity identifying an additional ∼11000 genes and their functional enrichment that will be useful for region-specific breeding lines.
Conclusion: Our study provides insights into the genomic architecture of safflower by leveraging an improved genome assembly and annotation. Additionally, resources including high-density linkage map, marker-trait associations, and pan-genome developed in this study provide valuable resources for use in breeding and crop improvement programs by the global research community.
背景:红花(Carthamus tinctorius L.)是一种抗旱油料作物。除了生产富含油酸和亚油酸的食用油外,它还用于生物燃料、化妆品、染料、药品和营养保健品。尽管红花具有重要的经济用途,但其遗传和基因组资源的可用性有限。结果:我们报道了一个改进的红花(Safflower_A2)从头基因组组装。利用PacBio HiFi reads、光学图谱、Illumina short reads和Hi-C测序,构建了1.15 Gb染色体水平的端粒和着丝粒重复序列。与以前的程序集相比,Safflower_A2具有更好的连续性、完整性和高质量的注释。通过基于单核苷酸多态性(SNP)的连锁图谱进一步验证了该序列。一项全基因组调查确定了红花抗病基因的全面探索。以从头基因组组装为参考,我们利用123份全球核心收集的重测序数据进行了基于snp的全基因组关联研究,发现了几种性状及其农艺价值单倍型(包括种子含油量)的显著相关性。重测序数据还用于泛基因组分析,该分析为基因组多样性提供了关键见解,确定了额外的约11000个基因及其功能富集,这将对区域特异性育种系有用。结论:我们的研究利用改进的基因组组装和注释为红花的基因组结构提供了见解。此外,本研究开发的高密度连锁图谱、标记-性状关联、泛基因组等资源为全球研究界的育种和作物改良计划提供了宝贵的资源。
{"title":"Improved reference assembly and core collection re-sequencing to facilitate exploration of important agronomical traits for the improvement of oilseed crop, Carthamus tinctorius L.","authors":"Megha Sharma, Varun Bhardwaj, Praveen Kumar Oraon, Shivani Choudhary, Heena Ambreen, Rohit Nandan Shukla, Harsha Rayudu Jamedar, Ajitha Vijjeswarapu, Vandana Jaiswal, Palchamy Kadirvel, Arun Jagannath, Shailendra Goel","doi":"10.1093/gigascience/giaf151","DOIUrl":"https://doi.org/10.1093/gigascience/giaf151","url":null,"abstract":"<p><strong>Background: </strong>Safflower (Carthamus tinctorius L.) is a drought-resilient oilseed crop. Besides producing edible oil rich in oleic and linoleic acid, it is also used in biofuels, cosmetics, colouring dyes, pharmaceuticals and nutraceuticals. Despite its significant economic uses, availability of genetic and genomic resources in safflower are limited.</p><p><strong>Results: </strong>We report an improved de novo genome assembly of safflower (Safflower_A2). A chromosome-level assembly of 1.15 Gb with telomeres and centromeric repeats, was constructed using PacBio HiFi reads, optical maps, Illumina short reads, and Hi-C sequencing. Safflower_A2 shows better contiguity, completeness, and high-quality annotation than previous assemblies. The assembly was further validated with the help of a single nucleotide polymorphism (SNP)-based linkage map. A genome-wide survey identified genes for comprehensive exploration of disease resistance in the safflower. Employing the de novo genome assembly as a reference, we used resequencing data of a global core-collection of 123 accessions to carry out a SNP-based genome-wide association study, which identified significant associations for several traits, their haplotypes of agronomic value, including seed oil content. Resequencing data was also applied for a pan-genome analysis which provided critical insights into genome diversity identifying an additional ∼11000 genes and their functional enrichment that will be useful for region-specific breeding lines.</p><p><strong>Conclusion: </strong>Our study provides insights into the genomic architecture of safflower by leveraging an improved genome assembly and annotation. Additionally, resources including high-density linkage map, marker-trait associations, and pan-genome developed in this study provide valuable resources for use in breeding and crop improvement programs by the global research community.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145722306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-09DOI: 10.1093/gigascience/giaf148
Wei Zhang, Hanchen Huang, Lily Wang, Brian D Lehmann, X Steven Chen
High-throughput technologies now produce a wide array of omics data, from genomic and transcriptomic profiles to epigenomic and proteomic measurements. Integrating multiple omics layers measured on the same samples can reveal cross-layer molecular hubs that single-layer analyses miss. We present an unsupervised, multivariate random forest (MRF) framework with an inverse minimal depth (IMD) importance to prioritize shared biomarkers across omics. In each forest, one layer serves as a multivariate response and the other as predictors; IMD summarizes how early a predictor (or response MSRV) appears across trees, yielding interpretable, cross-layer feature rankings. We provide three IMD-based selection strategies and introduce an optional IMD power transform to enhance sensitivity to interaction signals. In extensive simulations spanning linear, nonlinear, and interaction regimes, our method matches SPLS/CCA under linear settings and outperforms them as nonlinearity increases, while adapted univariate ensemble learners (RF, GBM, XGBoost) underperform in the multivariate, unsupervised context. Applied to TCGA BRCA and COAD, MRF-IMD identifies genes, CpGs, and miRNAs enriched for cancer-relevant pathways and yields more robust survival stratification than linear integrators with matched model sizes. In a TCGA pan-cancer analysis, MRF-IMD features achieve higher ARI than alternatives and recover coherent tumor-type clusters; in ADNI, the integrative signature improves dementia-progression stratification over a published methylation risk score. Our scalable, interpretable MRF-IMD framework advances reliable multi-omics biomarker discovery when nonlinear, cross-layer dependencies matter.
{"title":"An Integrative Multi-Omics Random Forest Framework for Robust Biomarker Discovery.","authors":"Wei Zhang, Hanchen Huang, Lily Wang, Brian D Lehmann, X Steven Chen","doi":"10.1093/gigascience/giaf148","DOIUrl":"https://doi.org/10.1093/gigascience/giaf148","url":null,"abstract":"<p><p>High-throughput technologies now produce a wide array of omics data, from genomic and transcriptomic profiles to epigenomic and proteomic measurements. Integrating multiple omics layers measured on the same samples can reveal cross-layer molecular hubs that single-layer analyses miss. We present an unsupervised, multivariate random forest (MRF) framework with an inverse minimal depth (IMD) importance to prioritize shared biomarkers across omics. In each forest, one layer serves as a multivariate response and the other as predictors; IMD summarizes how early a predictor (or response MSRV) appears across trees, yielding interpretable, cross-layer feature rankings. We provide three IMD-based selection strategies and introduce an optional IMD power transform to enhance sensitivity to interaction signals. In extensive simulations spanning linear, nonlinear, and interaction regimes, our method matches SPLS/CCA under linear settings and outperforms them as nonlinearity increases, while adapted univariate ensemble learners (RF, GBM, XGBoost) underperform in the multivariate, unsupervised context. Applied to TCGA BRCA and COAD, MRF-IMD identifies genes, CpGs, and miRNAs enriched for cancer-relevant pathways and yields more robust survival stratification than linear integrators with matched model sizes. In a TCGA pan-cancer analysis, MRF-IMD features achieve higher ARI than alternatives and recover coherent tumor-type clusters; in ADNI, the integrative signature improves dementia-progression stratification over a published methylation risk score. Our scalable, interpretable MRF-IMD framework advances reliable multi-omics biomarker discovery when nonlinear, cross-layer dependencies matter.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145707663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-09DOI: 10.1093/gigascience/giaf150
Lars Gruber, Stefan Schmidt, Thomas Enzlein, Carsten Hopf
Spatial 'omics techniques are indispensable for studying complex biological systems and for the discovery of spatial biomarkers. While several current matrix-assisted laser desorption/ionization (MALDI) mass spectrometry imaging (MSI) instruments are capable of localizing numerous metabolites at high spatial and spectral resolution, the majority of MSI data is acquired at the MS1 level only. Assigning molecular identities based on MS1 data presents significant analytical and computational challenges, as the inherent limitations of MS1 data preclude confident annotations beyond the sum formula level. To enable future advancements of computational lipid annotation tools, well-characterized benchmark - or ground truth - datasets are crucial, which exceed the scope of synthetic data or data derived from mimetic tissue models. To this end, we provide two sulfatide-centered, biology-driven magnetic resonance MSI (MR-MSI) datasets at different mass resolving powers that characterize lipids in a mouse model of human metachromatic dystrophy. This data includes an ultra-high-resolution (R ∼1,230,000) quantum cascade laser mid-infrared imaging-guided MR-MSI dataset that enables isotopic fine structure analysis and therefore enhances the level of confidence substantially. To highlight the usefulness of the data, we compared 118 manual sulfatide annotations with the number of decoy database-controlled sulfatide annotations performed in Metaspace (67 at FDR < 10%). Overall, our datasets can be used to benchmark annotation algorithms, validate spatial biomarker discovery pipelines, and serve as a reference for future studies that explore sulfatide metabolism and its spatial regulation.
{"title":"A sulfatide-centered ultra-high resolution magnetic resonance MALDI imaging benchmark dataset for MS1-based lipid annotation tools.","authors":"Lars Gruber, Stefan Schmidt, Thomas Enzlein, Carsten Hopf","doi":"10.1093/gigascience/giaf150","DOIUrl":"https://doi.org/10.1093/gigascience/giaf150","url":null,"abstract":"<p><p>Spatial 'omics techniques are indispensable for studying complex biological systems and for the discovery of spatial biomarkers. While several current matrix-assisted laser desorption/ionization (MALDI) mass spectrometry imaging (MSI) instruments are capable of localizing numerous metabolites at high spatial and spectral resolution, the majority of MSI data is acquired at the MS1 level only. Assigning molecular identities based on MS1 data presents significant analytical and computational challenges, as the inherent limitations of MS1 data preclude confident annotations beyond the sum formula level. To enable future advancements of computational lipid annotation tools, well-characterized benchmark - or ground truth - datasets are crucial, which exceed the scope of synthetic data or data derived from mimetic tissue models. To this end, we provide two sulfatide-centered, biology-driven magnetic resonance MSI (MR-MSI) datasets at different mass resolving powers that characterize lipids in a mouse model of human metachromatic dystrophy. This data includes an ultra-high-resolution (R ∼1,230,000) quantum cascade laser mid-infrared imaging-guided MR-MSI dataset that enables isotopic fine structure analysis and therefore enhances the level of confidence substantially. To highlight the usefulness of the data, we compared 118 manual sulfatide annotations with the number of decoy database-controlled sulfatide annotations performed in Metaspace (67 at FDR < 10%). Overall, our datasets can be used to benchmark annotation algorithms, validate spatial biomarker discovery pipelines, and serve as a reference for future studies that explore sulfatide metabolism and its spatial regulation.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145707558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-08DOI: 10.1093/gigascience/giaf149
Stephen R Piccolo, Harlan P Stevens
With the increasing complexity and quantity of experimental and observational data, life scientists rely on programming to automate analyses, enhance reproducibility, and facilitate collaboration. Scripting languages like Python are often favored for their simplicity and flexibility, enabling researchers to focus primarily on high-level tasks. Compiled languages such as C++ and Rust offer greater efficiency, making them preferable for intensive or repeated computations. In educational settings, instructors may wish to teach both types of languages and thus may wish to translate content from one programming language to another. In research contexts, researchers may wish to implement their ideas in one language before translating the code to another. However, translating between programming languages requires significant effort, prompting our interest in using large language models (LLMs) for semi-automated code translation. This study explores the use of an LLM (GPT-4) to translate 559 short-form programming exercises from Python into C++, Rust, Julia, and JavaScript. We used three prompting strategies-instructions only, code only, or both combined-and compared the translated code's output against the Python code's output. Translation success differed considerably by prompting strategy, and at least one of the strategies tested was effective for nearly every exercise. The highest overall success rate occurred for Rust (99.5%), followed by JavaScript (98.9%), C++ (97.9%), and Julia (95.0%). Our findings demonstrate that LLMs can effectively translate small-scale programming exercises between languages, reducing the need for manual rewriting. To support education and research, we have manually translated all exercises that were not translated successfully through automation, and we have made our translations freely available.
{"title":"Translating short-form Python exercises to other programming languages using diverse prompting strategies.","authors":"Stephen R Piccolo, Harlan P Stevens","doi":"10.1093/gigascience/giaf149","DOIUrl":"https://doi.org/10.1093/gigascience/giaf149","url":null,"abstract":"<p><p>With the increasing complexity and quantity of experimental and observational data, life scientists rely on programming to automate analyses, enhance reproducibility, and facilitate collaboration. Scripting languages like Python are often favored for their simplicity and flexibility, enabling researchers to focus primarily on high-level tasks. Compiled languages such as C++ and Rust offer greater efficiency, making them preferable for intensive or repeated computations. In educational settings, instructors may wish to teach both types of languages and thus may wish to translate content from one programming language to another. In research contexts, researchers may wish to implement their ideas in one language before translating the code to another. However, translating between programming languages requires significant effort, prompting our interest in using large language models (LLMs) for semi-automated code translation. This study explores the use of an LLM (GPT-4) to translate 559 short-form programming exercises from Python into C++, Rust, Julia, and JavaScript. We used three prompting strategies-instructions only, code only, or both combined-and compared the translated code's output against the Python code's output. Translation success differed considerably by prompting strategy, and at least one of the strategies tested was effective for nearly every exercise. The highest overall success rate occurred for Rust (99.5%), followed by JavaScript (98.9%), C++ (97.9%), and Julia (95.0%). Our findings demonstrate that LLMs can effectively translate small-scale programming exercises between languages, reducing the need for manual rewriting. To support education and research, we have manually translated all exercises that were not translated successfully through automation, and we have made our translations freely available.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145700216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-05DOI: 10.1093/gigascience/giaf137
Xiuyun Liu, Fangfang Li, Marek Czosnyka, Zofia Czosnyka, Huijie Yu, Xiaoguang Tong, Yan Xing, Hongliang Li, Ke Pu, Keke Feng, Kuo Zhang, Meijun Pang, Dong Ming
Background: The world has witnessed a steady rise in neurological diseases, which represent a heterogeneous group of disorders characterized by complex pathogenesis involving disruptions at multiple molecular levels, including genomic, transcriptomic, proteomic, and metabolomic levels. These disorders, often caused by genetic mutations, metabolic imbalances, immune dysregulation, and environmental factors, pose significant challenges to global public health due to their high prevalence, mortality, and disability burden.
Results: The advent of high-throughput technologies, such as next-generation sequencing and mass spectrometry, has provided valuable insights into the underlying mechanisms of disease, especially the development of multi- and high-spatial-resolution omics technologies, enabling the interaction of multiple levels of biology and analysis of the complex molecular networks and pathophysiological processes.
Conclusions: This review provides a comprehensive analysis of the latest advancements in multi- and high-spatial-resolution omics, with a focus on their applications in precision diagnostics, biomarker discovery, and therapeutic target identification in brain diseases. The study also highlights the current challenges in the clinical implementation and discusses the future directions, with artificial intelligence being anticipated to enhance clinical translation and diagnostic accuracy significantly.
{"title":"Multi-Omics and High-Spatial-Resolution Omics: Deciphering Complexity in Neurological Disorders.","authors":"Xiuyun Liu, Fangfang Li, Marek Czosnyka, Zofia Czosnyka, Huijie Yu, Xiaoguang Tong, Yan Xing, Hongliang Li, Ke Pu, Keke Feng, Kuo Zhang, Meijun Pang, Dong Ming","doi":"10.1093/gigascience/giaf137","DOIUrl":"https://doi.org/10.1093/gigascience/giaf137","url":null,"abstract":"<p><strong>Background: </strong>The world has witnessed a steady rise in neurological diseases, which represent a heterogeneous group of disorders characterized by complex pathogenesis involving disruptions at multiple molecular levels, including genomic, transcriptomic, proteomic, and metabolomic levels. These disorders, often caused by genetic mutations, metabolic imbalances, immune dysregulation, and environmental factors, pose significant challenges to global public health due to their high prevalence, mortality, and disability burden.</p><p><strong>Results: </strong>The advent of high-throughput technologies, such as next-generation sequencing and mass spectrometry, has provided valuable insights into the underlying mechanisms of disease, especially the development of multi- and high-spatial-resolution omics technologies, enabling the interaction of multiple levels of biology and analysis of the complex molecular networks and pathophysiological processes.</p><p><strong>Conclusions: </strong>This review provides a comprehensive analysis of the latest advancements in multi- and high-spatial-resolution omics, with a focus on their applications in precision diagnostics, biomarker discovery, and therapeutic target identification in brain diseases. The study also highlights the current challenges in the clinical implementation and discusses the future directions, with artificial intelligence being anticipated to enhance clinical translation and diagnostic accuracy significantly.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145687130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-05DOI: 10.1093/gigascience/giaf147
Mahnaz Mohammadi, Christina Fell, Sarah Bell, Gareth Bryson, Sheeba Syed, Prakash Konanahalli, David Harris-Birtill, Ognjen Arandjelovic, Clare Orange, Prishma Shahi, In Hwa Um, James D Blackwood, David J Harrison
Background: Whole slide imaging (WSI) enables the digitisation of entire histological slides at high resolution, allowing pathologists and researchers to analyse tissue samples digitally rather than through traditional microscopy. This technology has become increasingly valuable in pathology for research, education, and clinical diagnostics. Endometrial biopsy is very common, often being undertaken to exclude non-cancerous disease. This means that most cases do not contain cancer, and the challenge is to accurately and efficiently exclude serious pathology rather than simply make a diagnosis of malignancy. A well-curated, expert-annotated, endometrial whole slide dataset covering a spread of cancer and non-cancer diagnoses will support machine learning applications in automated diagnosis, facilitate research into the pathology of endometrial cancer, and serve as an educational resource for medical professionals.
Results: We introduce a newly constructed, large-scale dataset of endometrial biopsies, comprising 2,909 whole slide images in iSyntax format, each accompanied by a corresponding annotation file in JSON format. Each whole slide image is labelled with a primary class label representing its final diagnosis and a sub-category label providing further details within that diagnostic class. These class labels are critical for machine learning applications, as they enable the development of AI models capable of distinguishing between different types of endometrial abnormalities, improving automated classification, and guiding clinical decision-making.
Conclusions: Constructing and curating a high-quality endometrial whole slide dataset requires significant effort to ensure accurate annotations, data integrity, and patient privacy protection. However, the availability of a well-annotated dataset with detailed class labels is crucial for advancing digital pathology. Such a resource can enhance diagnostic accuracy, support personalized treatment strategies, and ultimately improve outcomes for patients with endometrial cancer and other endometrial conditions.
{"title":"Endometrial Whole Slide Images Dataset for Detection of malignancy in endometrial biopsies.","authors":"Mahnaz Mohammadi, Christina Fell, Sarah Bell, Gareth Bryson, Sheeba Syed, Prakash Konanahalli, David Harris-Birtill, Ognjen Arandjelovic, Clare Orange, Prishma Shahi, In Hwa Um, James D Blackwood, David J Harrison","doi":"10.1093/gigascience/giaf147","DOIUrl":"https://doi.org/10.1093/gigascience/giaf147","url":null,"abstract":"<p><strong>Background: </strong>Whole slide imaging (WSI) enables the digitisation of entire histological slides at high resolution, allowing pathologists and researchers to analyse tissue samples digitally rather than through traditional microscopy. This technology has become increasingly valuable in pathology for research, education, and clinical diagnostics. Endometrial biopsy is very common, often being undertaken to exclude non-cancerous disease. This means that most cases do not contain cancer, and the challenge is to accurately and efficiently exclude serious pathology rather than simply make a diagnosis of malignancy. A well-curated, expert-annotated, endometrial whole slide dataset covering a spread of cancer and non-cancer diagnoses will support machine learning applications in automated diagnosis, facilitate research into the pathology of endometrial cancer, and serve as an educational resource for medical professionals.</p><p><strong>Results: </strong>We introduce a newly constructed, large-scale dataset of endometrial biopsies, comprising 2,909 whole slide images in iSyntax format, each accompanied by a corresponding annotation file in JSON format. Each whole slide image is labelled with a primary class label representing its final diagnosis and a sub-category label providing further details within that diagnostic class. These class labels are critical for machine learning applications, as they enable the development of AI models capable of distinguishing between different types of endometrial abnormalities, improving automated classification, and guiding clinical decision-making.</p><p><strong>Conclusions: </strong>Constructing and curating a high-quality endometrial whole slide dataset requires significant effort to ensure accurate annotations, data integrity, and patient privacy protection. However, the availability of a well-annotated dataset with detailed class labels is crucial for advancing digital pathology. Such a resource can enhance diagnostic accuracy, support personalized treatment strategies, and ultimately improve outcomes for patients with endometrial cancer and other endometrial conditions.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145687123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-29DOI: 10.1093/gigascience/giaf144
Mahnaz Mohammadi, Christina Fell, David Morrison, Sarah Bell, Gareth Bryson, Sheeba Syed, Prakash Konanahalli, David Harris-Birtill, Ognjen Arandjelovic, Clare Orange, Prishma Shahi, In Hwa Um, James D Blackwood, David J Harrison
The clinical pathway for prevention and treatment of cervical cancer depends on cytology and then the assessment of biopsies, fragments of tissue removed for histological examination. This can be a significant workload and is an obvious exemplar to explore triage based on machine learning analysis of slides. Limited access to large annotated datasets of human diseased tissue is a major obstacle to developing standards and algorithms that can assist diagnosis. We present a dataset comprising 2539 whole slide images of cervical biopsies, each annotated by several pathologists and consensus on diagnosis and individual features agreed. Each whole slide image represents one slide per patient, in iSyntax format with manual annotations by pathologists in Jason format. Each whole slide image is assigned a category label which is the final diagnosis of the image, and a subcategory label which declares in which subcategory the image is found. This dataset has been used to build a model that accurately predicts diagnosis, allowing the possibility of automatically triaging biopsies, so that the most significant pathologies can be identified rapidly and those patients selected for immediate treatment. The level of annotation, at sub-slide level, and the number of cases is unique in public databases and should allow investigators to explore multiple aspects of computer vision relevant to human tissue diagnosis, with no limitation placed on access to the whole slide images.
{"title":"Cervical Whole Slide Images Dataset for Multi-class Classification.","authors":"Mahnaz Mohammadi, Christina Fell, David Morrison, Sarah Bell, Gareth Bryson, Sheeba Syed, Prakash Konanahalli, David Harris-Birtill, Ognjen Arandjelovic, Clare Orange, Prishma Shahi, In Hwa Um, James D Blackwood, David J Harrison","doi":"10.1093/gigascience/giaf144","DOIUrl":"https://doi.org/10.1093/gigascience/giaf144","url":null,"abstract":"<p><p>The clinical pathway for prevention and treatment of cervical cancer depends on cytology and then the assessment of biopsies, fragments of tissue removed for histological examination. This can be a significant workload and is an obvious exemplar to explore triage based on machine learning analysis of slides. Limited access to large annotated datasets of human diseased tissue is a major obstacle to developing standards and algorithms that can assist diagnosis. We present a dataset comprising 2539 whole slide images of cervical biopsies, each annotated by several pathologists and consensus on diagnosis and individual features agreed. Each whole slide image represents one slide per patient, in iSyntax format with manual annotations by pathologists in Jason format. Each whole slide image is assigned a category label which is the final diagnosis of the image, and a subcategory label which declares in which subcategory the image is found. This dataset has been used to build a model that accurately predicts diagnosis, allowing the possibility of automatically triaging biopsies, so that the most significant pathologies can be identified rapidly and those patients selected for immediate treatment. The level of annotation, at sub-slide level, and the number of cases is unique in public databases and should allow investigators to explore multiple aspects of computer vision relevant to human tissue diagnosis, with no limitation placed on access to the whole slide images.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145632199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-29DOI: 10.1093/gigascience/giaf145
Joanna Szablińska-Piernik, Paweł Sulima, Jakub Sawicki
Background: The liverwort A. endiviifolia, a dioicous, simple thalloid species, is notable for its cryptic diversity, habitat adaptability, genomic innovation, and represents a clade that is sister to all other Jungermanniopsida. These features make A. endiviifolia an essential model for exploring speciation mechanisms and the evolution of genome structures within liverworts.
Findings: We present the genome assembly of a haploid A. endiviifolia isolate with a total size of 2,914,960,273 bp and an N50 of 468,157,909 bp, demonstrating high completeness (99.2% BUSCO) and a high consensus quality (QV 47.6). The assembly consisted of nine chromosomes, which included 18 telomeres and nine centromeres (ranging from 1.9 to 5 Mbp in length). RNA-seq-based annotation identified 34,615 genes, predominantly protein-coding. The TEs comprised 12.16% LTR elements and 57 Helitrons. Among the retroelements, the Copia and Gypsy superfamilies comprised 8.94% and 2.95% of the genome, respectively. The Ty3/Gypsy superfamily was found to be significantly enriched in centromeric regions. The average GC content ranged from 38.8% to 39.6%, with gene density varied between a value 5.52 and 9.78 genes per 500 kbp. Synteny analysis of related liverwort species has revealed complex chromosomal relationships, indicating extensive genome rearrangements among species.
Conclusions: This study provides the first high-quality reference genome assembly of the haploid liverwort A. endiviifolia. Assembly and annotation offer valuable resources for investigating liverwort evolution, centromere biology, and genome expansion in simple thalloid liverworts.
{"title":"Giant chromosomes of a tiny plant - the complete telomere-to-telomere genome assembly of the simple thalloid liverwort Apopellia endiviifolia (Jungermanniopsida, Marchantiophyta).","authors":"Joanna Szablińska-Piernik, Paweł Sulima, Jakub Sawicki","doi":"10.1093/gigascience/giaf145","DOIUrl":"https://doi.org/10.1093/gigascience/giaf145","url":null,"abstract":"<p><strong>Background: </strong>The liverwort A. endiviifolia, a dioicous, simple thalloid species, is notable for its cryptic diversity, habitat adaptability, genomic innovation, and represents a clade that is sister to all other Jungermanniopsida. These features make A. endiviifolia an essential model for exploring speciation mechanisms and the evolution of genome structures within liverworts.</p><p><strong>Findings: </strong>We present the genome assembly of a haploid A. endiviifolia isolate with a total size of 2,914,960,273 bp and an N50 of 468,157,909 bp, demonstrating high completeness (99.2% BUSCO) and a high consensus quality (QV 47.6). The assembly consisted of nine chromosomes, which included 18 telomeres and nine centromeres (ranging from 1.9 to 5 Mbp in length). RNA-seq-based annotation identified 34,615 genes, predominantly protein-coding. The TEs comprised 12.16% LTR elements and 57 Helitrons. Among the retroelements, the Copia and Gypsy superfamilies comprised 8.94% and 2.95% of the genome, respectively. The Ty3/Gypsy superfamily was found to be significantly enriched in centromeric regions. The average GC content ranged from 38.8% to 39.6%, with gene density varied between a value 5.52 and 9.78 genes per 500 kbp. Synteny analysis of related liverwort species has revealed complex chromosomal relationships, indicating extensive genome rearrangements among species.</p><p><strong>Conclusions: </strong>This study provides the first high-quality reference genome assembly of the haploid liverwort A. endiviifolia. Assembly and annotation offer valuable resources for investigating liverwort evolution, centromere biology, and genome expansion in simple thalloid liverworts.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145632216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-29DOI: 10.1093/gigascience/giaf146
Robert Dahnke, Polona Kalc, Gabriel Ziegler, Julian Grosskreutz, Christian Gaser
Background: The processing and analysis of magnetic resonance images is highly dependent on the quality of the input data, and systematic differences in quality can consequently lead to loss of sensitivity or biased results. However, varying image properties due to different scanners and acquisition protocols, as well as subject-specific image interferences, such as motion artifacts, can be incorporated in the analysis. A reliable assessment of image quality is therefore essential to identify critical outliers that may bias results.
Findings: Here we present a quality assessment for structural (T1-weighted) images using tissue classification in the SPM/CAT12 ecosystem. We introduce multiple useful image quality measures, standardize them into quality scales and combine them into an integrated structural image quality rating to facilitate the interpretation and fast identification of outliers with (motion) artifacts. The reliability and robustness of the measures are evaluated using synthetic and real datasets. Our study results demonstrate that the proposed measures are robust to simulated segmentation problems and variables of interest such as cortical atrophy, age, sex, brain size and severe disease-related changes, and might facilitate the separation of motion artifacts based on within-protocol deviations.
Conclusion: The quality control framework presents a simple but powerful tool for the use in research and clinical settings.
{"title":"Segmentation-Based Quality Control of Structural MRI using the CAT12 Toolbox.","authors":"Robert Dahnke, Polona Kalc, Gabriel Ziegler, Julian Grosskreutz, Christian Gaser","doi":"10.1093/gigascience/giaf146","DOIUrl":"https://doi.org/10.1093/gigascience/giaf146","url":null,"abstract":"<p><strong>Background: </strong>The processing and analysis of magnetic resonance images is highly dependent on the quality of the input data, and systematic differences in quality can consequently lead to loss of sensitivity or biased results. However, varying image properties due to different scanners and acquisition protocols, as well as subject-specific image interferences, such as motion artifacts, can be incorporated in the analysis. A reliable assessment of image quality is therefore essential to identify critical outliers that may bias results.</p><p><strong>Findings: </strong>Here we present a quality assessment for structural (T1-weighted) images using tissue classification in the SPM/CAT12 ecosystem. We introduce multiple useful image quality measures, standardize them into quality scales and combine them into an integrated structural image quality rating to facilitate the interpretation and fast identification of outliers with (motion) artifacts. The reliability and robustness of the measures are evaluated using synthetic and real datasets. Our study results demonstrate that the proposed measures are robust to simulated segmentation problems and variables of interest such as cortical atrophy, age, sex, brain size and severe disease-related changes, and might facilitate the separation of motion artifacts based on within-protocol deviations.</p><p><strong>Conclusion: </strong>The quality control framework presents a simple but powerful tool for the use in research and clinical settings.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145632242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Bactrocera tsuneonis is a major pest of citrus, causing significant economic losses in fruit production. It exhibits a highly specialized host preference, primarily infesting citrus fruits. However, the genetic basis underlying its olfactory adaptation and host specificity remains largely unexplored. To elucidate the molecular mechanisms governing host selection in B. tsuneonis, we assembled a high-quality chromosome-level genome and performed comparative genomic, transcriptomic, and functional analyses of its chemosensory system.
Results: The genome of B. tsuneonis was assembled to a total size of 339 Mb, with a contig N50 of 11.21 Mb and a scaffold N50 of 59.93 Mb. Comparative genomic analysis revealed significant contractions in chemosensory-related gene families, particularly in odorant-binding proteins (OBPs) and odorant receptors (ORs), maybe suggesting an adaptation to a narrow host range. Transcriptome analysis demonstrated that BtsuOBP83a and BtsuOBP83b were highly expressed in the antennae, and most ORs were predominantly expressed in the antennae. Functional assays confirmed that BtsuOBP83a selectively binds to two citrus volatiles, trans-nerolidol and piperitone, with strong affinity. Molecular docking and molecular dynamics simulations further revealed that BtsuOr7a-6 and BtsuOr7a-4 specifically interact with these volatiles, suggesting their role in host odor recognition.
Conclusions: Our high-quality genome of B. tsuneonis provides a valuable resource for genomic research and offers valuable insights into the genetic basis of its olfactory adaptation and host specificity. The findings highlight key molecular mechanisms underlying host selection and provide potential targets for behavior-based pest management strategies.
{"title":"A high-quality chromosome-level genome assembly of the oligophagous fruit fly Bactrocera tsuneonis (Diptera: Tephritidae) and insights into its host specificity.","authors":"Tengda Guo, Weisong Li, Yuan Zhang, Wenzhao Yang, Zhihong Li, Yujia Qin","doi":"10.1093/gigascience/giaf143","DOIUrl":"https://doi.org/10.1093/gigascience/giaf143","url":null,"abstract":"<p><strong>Background: </strong>Bactrocera tsuneonis is a major pest of citrus, causing significant economic losses in fruit production. It exhibits a highly specialized host preference, primarily infesting citrus fruits. However, the genetic basis underlying its olfactory adaptation and host specificity remains largely unexplored. To elucidate the molecular mechanisms governing host selection in B. tsuneonis, we assembled a high-quality chromosome-level genome and performed comparative genomic, transcriptomic, and functional analyses of its chemosensory system.</p><p><strong>Results: </strong>The genome of B. tsuneonis was assembled to a total size of 339 Mb, with a contig N50 of 11.21 Mb and a scaffold N50 of 59.93 Mb. Comparative genomic analysis revealed significant contractions in chemosensory-related gene families, particularly in odorant-binding proteins (OBPs) and odorant receptors (ORs), maybe suggesting an adaptation to a narrow host range. Transcriptome analysis demonstrated that BtsuOBP83a and BtsuOBP83b were highly expressed in the antennae, and most ORs were predominantly expressed in the antennae. Functional assays confirmed that BtsuOBP83a selectively binds to two citrus volatiles, trans-nerolidol and piperitone, with strong affinity. Molecular docking and molecular dynamics simulations further revealed that BtsuOr7a-6 and BtsuOr7a-4 specifically interact with these volatiles, suggesting their role in host odor recognition.</p><p><strong>Conclusions: </strong>Our high-quality genome of B. tsuneonis provides a valuable resource for genomic research and offers valuable insights into the genetic basis of its olfactory adaptation and host specificity. The findings highlight key molecular mechanisms underlying host selection and provide potential targets for behavior-based pest management strategies.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145563504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}