Pub Date : 2026-01-21DOI: 10.1093/gigascience/giaf145
Joanna Szablińska-Piernik, Paweł Sulima, Jakub Sawicki
Background: The liverwort Apopellia endiviifolia, a dioicous, simple thalloid species, is notable for its cryptic diversity, habitat adaptability, and genomic innovation, and it represents a clade that is sister to all other Jungermanniopsida. These features make A. endiviifolia an essential model for exploring speciation mechanisms and the evolution of genome structures within liverworts.
Findings: We present the genome assembly of a haploid A. endiviifolia isolate with a total size of 2,914,960,273 bp and an N50 of 468,157,909 bp, demonstrating high completeness (99.2% BUSCO) and a high consensus quality (quality value 47.6). The assembly consisted of 9 chromosomes, which included 18 telomeres and 9 centromeres (ranging from 1.9 to 5 Mbp in length). RNA sequencing-based annotation identified 34,615 genes, predominantly protein coding. The transposable elements comprised 12.16% long terminal repeat elements and 57 Helitrons. Among the retroelements, the Copia and Gypsy superfamilies comprised 8.94% and 2.95% of the genome, respectively. The Ty3/Gypsy superfamily was significantly enriched in centromeric regions. The average GC content ranged from 38.8% to 39.6%, with gene density varying between 5.52 and 9.78 genes per 500 kbp. Synteny analysis of related liverwort species has revealed complex chromosomal relationships, indicating extensive genome rearrangements among species.
Conclusions: This study provides the first high-quality reference genome assembly of the haploid liverwort A. endiviifolia. Assembly and annotation offer valuable resources for investigating liverwort evolution, centromere biology, and genome expansion in simple thalloid liverworts.
{"title":"Giant chromosomes of a tiny plant-the complete telomere-to-telomere genome assembly of the simple thalloid liverwort Apopellia endiviifolia (Jungermanniopsida, Marchantiophyta).","authors":"Joanna Szablińska-Piernik, Paweł Sulima, Jakub Sawicki","doi":"10.1093/gigascience/giaf145","DOIUrl":"10.1093/gigascience/giaf145","url":null,"abstract":"<p><strong>Background: </strong>The liverwort Apopellia endiviifolia, a dioicous, simple thalloid species, is notable for its cryptic diversity, habitat adaptability, and genomic innovation, and it represents a clade that is sister to all other Jungermanniopsida. These features make A. endiviifolia an essential model for exploring speciation mechanisms and the evolution of genome structures within liverworts.</p><p><strong>Findings: </strong>We present the genome assembly of a haploid A. endiviifolia isolate with a total size of 2,914,960,273 bp and an N50 of 468,157,909 bp, demonstrating high completeness (99.2% BUSCO) and a high consensus quality (quality value 47.6). The assembly consisted of 9 chromosomes, which included 18 telomeres and 9 centromeres (ranging from 1.9 to 5 Mbp in length). RNA sequencing-based annotation identified 34,615 genes, predominantly protein coding. The transposable elements comprised 12.16% long terminal repeat elements and 57 Helitrons. Among the retroelements, the Copia and Gypsy superfamilies comprised 8.94% and 2.95% of the genome, respectively. The Ty3/Gypsy superfamily was significantly enriched in centromeric regions. The average GC content ranged from 38.8% to 39.6%, with gene density varying between 5.52 and 9.78 genes per 500 kbp. Synteny analysis of related liverwort species has revealed complex chromosomal relationships, indicating extensive genome rearrangements among species.</p><p><strong>Conclusions: </strong>This study provides the first high-quality reference genome assembly of the haploid liverwort A. endiviifolia. Assembly and annotation offer valuable resources for investigating liverwort evolution, centromere biology, and genome expansion in simple thalloid liverworts.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12885004/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145632216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1093/gigascience/giag009
Evangelos Karatzas, Martin Beracochea, Fotis A Baltoumas, Eleni Aplakidou, Lorna Richardson, James A Fellows Yates, Daniel Lundin
The growth of metagenomics-derived amino acid sequence data has transformed our understanding of protein function, microbial diversity, and evolutionary relationships. However, the vast majority of these proteins remain functionally uncharacterized. Grouping the millions of such uncharacterised sequences with the few experimentally characterised ones allows the transfer of annotations, while the inspection of conserved residues with multiple sequence alignments can provide clues to function, even in the absence of existing functional information. To address the challenges associated with this data surge and the need to group sequences, we present a scalable, open-source, parametrizable Nextflow pipeline (nf-core/proteinfamilies) that generates nascent protein families or assigns new proteins to existing families. The computational benchmarks demonstrated that resource usage scales approximately linearly with input size, and the biological benchmarks showed that the generated protein families closely resemble manually curated families in widely used databases.
{"title":"nf-core/proteinfamilies: A scalable pipeline for the generation of protein families.","authors":"Evangelos Karatzas, Martin Beracochea, Fotis A Baltoumas, Eleni Aplakidou, Lorna Richardson, James A Fellows Yates, Daniel Lundin","doi":"10.1093/gigascience/giag009","DOIUrl":"https://doi.org/10.1093/gigascience/giag009","url":null,"abstract":"<p><p>The growth of metagenomics-derived amino acid sequence data has transformed our understanding of protein function, microbial diversity, and evolutionary relationships. However, the vast majority of these proteins remain functionally uncharacterized. Grouping the millions of such uncharacterised sequences with the few experimentally characterised ones allows the transfer of annotations, while the inspection of conserved residues with multiple sequence alignments can provide clues to function, even in the absence of existing functional information. To address the challenges associated with this data surge and the need to group sequences, we present a scalable, open-source, parametrizable Nextflow pipeline (nf-core/proteinfamilies) that generates nascent protein families or assigns new proteins to existing families. The computational benchmarks demonstrated that resource usage scales approximately linearly with input size, and the biological benchmarks showed that the generated protein families closely resemble manually curated families in widely used databases.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146009872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1093/gigascience/giaf141
Ian Tsang, Lawrence Percival-Alwyn, Stephen Rawsthorne, James Cockram, Fiona Leigh, Jonathan A Atkinson
Background: Root hairs play a key role in plant nutrient and water uptake. Historically, root hair traits have largely been quantified manually. As such, this process has been laborious and low-throughput. However, given their importance for plant health and development, high-throughput quantification of root hair morphology could help underpin rapid advances in the genetic understanding of these traits. With recent increases in the accessibility and availability of artificial intelligence (AI) and machine learning techniques, the development of tools to automate plant phenotyping processes has been greatly accelerated.
Results: We present pyRootHair, a high-throughput, AI-powered software application to automate root hair trait extraction from microscope images of plant roots grown on agar plates. pyRootHair is capable of batch processing over 600 images per hour without manual input from the end user. In this study, we deploy pyRootHair on a panel of 24 diverse wheat (Triticum aestivum and Triticum turgidum ssp. durum) cultivars and uncover a large, previously unresolved amount of variation in many root hair traits. We show that the overall root hair profile falls under 2 distinct shape categories and that different root hair traits often correlate with each other. We also demonstrate that pyRootHair can be deployed on a range of plant species, including oat (Avena sativa), rice (Oryza sativa), teff (Eragrostis tef), and tomato (Solanum lycopersicum).
Conclusions: The application of pyRootHair enables users to rapidly screen a large number of plant germplasm resources for variation in root hair morphology, supporting high-resolution measurements and high-throughput data analysis. This facilitates downstream investigation of the impacts of root hair genetic control and morphological variation on plant performance. pyRootHair is installable via PyPI (https://pypi.org/project/pyRootHair/) and can be accessed on GitHub at https://github.com/iantsang779/pyRootHair.
{"title":"pyRootHair: Machine learning accelerated software for high-throughput phenotyping of plant root hair traits.","authors":"Ian Tsang, Lawrence Percival-Alwyn, Stephen Rawsthorne, James Cockram, Fiona Leigh, Jonathan A Atkinson","doi":"10.1093/gigascience/giaf141","DOIUrl":"10.1093/gigascience/giaf141","url":null,"abstract":"<p><strong>Background: </strong>Root hairs play a key role in plant nutrient and water uptake. Historically, root hair traits have largely been quantified manually. As such, this process has been laborious and low-throughput. However, given their importance for plant health and development, high-throughput quantification of root hair morphology could help underpin rapid advances in the genetic understanding of these traits. With recent increases in the accessibility and availability of artificial intelligence (AI) and machine learning techniques, the development of tools to automate plant phenotyping processes has been greatly accelerated.</p><p><strong>Results: </strong>We present pyRootHair, a high-throughput, AI-powered software application to automate root hair trait extraction from microscope images of plant roots grown on agar plates. pyRootHair is capable of batch processing over 600 images per hour without manual input from the end user. In this study, we deploy pyRootHair on a panel of 24 diverse wheat (Triticum aestivum and Triticum turgidum ssp. durum) cultivars and uncover a large, previously unresolved amount of variation in many root hair traits. We show that the overall root hair profile falls under 2 distinct shape categories and that different root hair traits often correlate with each other. We also demonstrate that pyRootHair can be deployed on a range of plant species, including oat (Avena sativa), rice (Oryza sativa), teff (Eragrostis tef), and tomato (Solanum lycopersicum).</p><p><strong>Conclusions: </strong>The application of pyRootHair enables users to rapidly screen a large number of plant germplasm resources for variation in root hair morphology, supporting high-resolution measurements and high-throughput data analysis. This facilitates downstream investigation of the impacts of root hair genetic control and morphological variation on plant performance. pyRootHair is installable via PyPI (https://pypi.org/project/pyRootHair/) and can be accessed on GitHub at https://github.com/iantsang779/pyRootHair.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12824728/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145512386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurate genetic variant interpretation is crucial for disease research and the development of targeted therapies. Artificial intelligence is transforming this field by integrating computational methodologies across structural biology, evolutionary analysis, and multimodal genomic data. This review examines the evolution from traditional rule-based systems and statistical models to contemporary machine learning, deep learning, and protein language models, while addressing critical challenges in variant classification. Key obstacles include data heterogeneity, interpretability, and the persistence of variants of uncertain significance, emphasizing the critical need for explainable artificial intelligence frameworks and more inclusive genomic databases to improve predictive accuracy across diverse populations. Based on the assessment of current variant impact predictors, we propose strategies for enhanced predictor selection, effective multi-omics data integration, and optimized computational workflows. These recommendations aim to enhance variant interpretation accuracy in both research settings and clinical practice, ultimately contributing to advances in personalized medicine.
{"title":"Harnessing artificial intelligence for genomic variant prediction: advances, challenges, and future directions.","authors":"Indah Pakpahan, Mentari Sihombing, Haohan Liu, Mengyao Wang, Zheng Su, Mingyan Fang","doi":"10.1093/gigascience/giag004","DOIUrl":"10.1093/gigascience/giag004","url":null,"abstract":"<p><p>Accurate genetic variant interpretation is crucial for disease research and the development of targeted therapies. Artificial intelligence is transforming this field by integrating computational methodologies across structural biology, evolutionary analysis, and multimodal genomic data. This review examines the evolution from traditional rule-based systems and statistical models to contemporary machine learning, deep learning, and protein language models, while addressing critical challenges in variant classification. Key obstacles include data heterogeneity, interpretability, and the persistence of variants of uncertain significance, emphasizing the critical need for explainable artificial intelligence frameworks and more inclusive genomic databases to improve predictive accuracy across diverse populations. Based on the assessment of current variant impact predictors, we propose strategies for enhanced predictor selection, effective multi-omics data integration, and optimized computational workflows. These recommendations aim to enhance variant interpretation accuracy in both research settings and clinical practice, ultimately contributing to advances in personalized medicine.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145948740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1093/gigascience/giaf161
Joseph E Powell
Large-scale single-cell genomics projects have revolutionized our understanding of human immune variation. Yet most studies to date have been Eurocentric, limited in cell-type resolution, or restricted to a single data modality. The newly published Chinese Immune Multi-Omics Atlas helps address these gaps by profiling 428 healthy Chinese adults using a multiomics single-cell approach that combines single-cell RNA sequencing and single-cell chromatin accessibility sequencing across over 10 million immune cells. This integrated strategy enabled the identification of 73 distinct immune cell subsets and the construction of cell-type-specific gene regulatory networks linking noncoding enhancers to target genes. The atlas delineated hundreds of enhancer modules (eRegulons), highlighting both established and novel regulators of immune cell identity. By aligning transcriptomic and epigenomic maps, Yin et al. show how expanding both the ancestral diversity and data modalities of immune cell genomics can reveal new biology and provide a valuable addition to global reference cell atlases.
{"title":"Charting immune variation through genetics and single-cell genomics.","authors":"Joseph E Powell","doi":"10.1093/gigascience/giaf161","DOIUrl":"10.1093/gigascience/giaf161","url":null,"abstract":"<p><p>Large-scale single-cell genomics projects have revolutionized our understanding of human immune variation. Yet most studies to date have been Eurocentric, limited in cell-type resolution, or restricted to a single data modality. The newly published Chinese Immune Multi-Omics Atlas helps address these gaps by profiling 428 healthy Chinese adults using a multiomics single-cell approach that combines single-cell RNA sequencing and single-cell chromatin accessibility sequencing across over 10 million immune cells. This integrated strategy enabled the identification of 73 distinct immune cell subsets and the construction of cell-type-specific gene regulatory networks linking noncoding enhancers to target genes. The atlas delineated hundreds of enhancer modules (eRegulons), highlighting both established and novel regulators of immune cell identity. By aligning transcriptomic and epigenomic maps, Yin et al. show how expanding both the ancestral diversity and data modalities of immune cell genomics can reveal new biology and provide a valuable addition to global reference cell atlases.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12821369/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145855259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1093/gigascience/giaf133
Cenny Taslim, Yuan Zhang, Galen Rask, Genevieve C Kendall, Emily R Theisen
Background: RNA sequencing (RNA-seq) analysis has become a routine task in numerous genomic research labs, driven by the reduced cost of bulk RNA sequencing experiments. These studies generate billions of reads that require easy-to-run, comprehensive, and reproducible analysis. However, many labs rely on in-house scripts, which can be challenging for bench scientists to use and hinder standardization and reproducibility. While existing RNA-seq pipelines attempt to address these challenges, they often lack a complete end-to-end user interface.
Findings: To bridge this gap, we developed RNA-SeqEZPZ, an automated pipeline with a user-friendly point-and-click interface, enabling rigorous and reproducible RNA-seq analysis without requiring programming or bioinformatics expertise. For advanced users, the pipeline can also be executed from the command line, allowing customization of steps to suit specific applications. The innovation of this pipeline lies in the combination of 3 key features: (i) all software is packaged within a Singularity container, eliminating installation issues; (ii) it offers a graphical, point-and-click interface from raw FASTQ files through differential expression and pathway analysis; and (iii) it includes a Nextflow implementation, enabling scalability and portability for seamless execution across various platforms, including job submission in the cloud and cluster computing. Additionally, RNA-SeqEZPZ generates a comprehensive statistical report and offers an option for batch adjustment to minimize effects of noise due to technical variation across replicates. Reports can also be reviewed by a bioinformatician to ensure the overall quality of the analysis.
Conclusions: RNA-SeqEZPZ is a robust, accessible, and scalable solution for comprehensive RNA-seq analysis, enabling researchers to focus on biological insights rather than computational challenges.
{"title":"RNA-SeqEZPZ: a point-and-click pipeline for comprehensive transcriptomics analysis with interactive visualizations.","authors":"Cenny Taslim, Yuan Zhang, Galen Rask, Genevieve C Kendall, Emily R Theisen","doi":"10.1093/gigascience/giaf133","DOIUrl":"10.1093/gigascience/giaf133","url":null,"abstract":"<p><strong>Background: </strong>RNA sequencing (RNA-seq) analysis has become a routine task in numerous genomic research labs, driven by the reduced cost of bulk RNA sequencing experiments. These studies generate billions of reads that require easy-to-run, comprehensive, and reproducible analysis. However, many labs rely on in-house scripts, which can be challenging for bench scientists to use and hinder standardization and reproducibility. While existing RNA-seq pipelines attempt to address these challenges, they often lack a complete end-to-end user interface.</p><p><strong>Findings: </strong>To bridge this gap, we developed RNA-SeqEZPZ, an automated pipeline with a user-friendly point-and-click interface, enabling rigorous and reproducible RNA-seq analysis without requiring programming or bioinformatics expertise. For advanced users, the pipeline can also be executed from the command line, allowing customization of steps to suit specific applications. The innovation of this pipeline lies in the combination of 3 key features: (i) all software is packaged within a Singularity container, eliminating installation issues; (ii) it offers a graphical, point-and-click interface from raw FASTQ files through differential expression and pathway analysis; and (iii) it includes a Nextflow implementation, enabling scalability and portability for seamless execution across various platforms, including job submission in the cloud and cluster computing. Additionally, RNA-SeqEZPZ generates a comprehensive statistical report and offers an option for batch adjustment to minimize effects of noise due to technical variation across replicates. Reports can also be reviewed by a bioinformatician to ensure the overall quality of the analysis.</p><p><strong>Conclusions: </strong>RNA-SeqEZPZ is a robust, accessible, and scalable solution for comprehensive RNA-seq analysis, enabling researchers to focus on biological insights rather than computational challenges.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12857227/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145495174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1093/gigascience/giag008
Tomasz M Kowalski
Background: FASTA is the primary format for representing DNA, RNA and protein sequences. While progress has been made in specialized FASTA collection compressors, they still struggle with practical limitations and inconsistent performance across different datasets, hindering effective storage and transfer of large genomic datasets.
Results: We present an enhanced version of the Multiple Bacteria Genome Compressor (MBGC), a high-throughput, in-memory algorithm for compressing genome collections. It relies on information about maximum exact matches in the compressed set to identify possibly long approximate matches. It encodes them even when they partially overlap, boosting the compression ratio by an average of 14% across bacterial datasets, while the reengineered multi-threaded decoding speeds up decompression compared to its predecessor by around 40%. The compression ratio improvement is even more pronounced on other collections, for H. sapiens reaching 18%, and up to 55% for S. paradoxus. MBGC2 performs consistently across diverse datasets and introduces practical features to ease data management such as archive appending, repacking, fast content listing and flexible decompression options. Benchmark tests covering nucleotide-based bacterial, viral, and human genome collections show that MBGC2 combines compression efficiency and processing speed. The tool supports working with single genomes or amino acid collections, but does not guarantee such high performance in these cases.
Conclusions: MBGC2 addresses critical limitations in genome collection compression by delivering reliable performance, improved compression ratios, and enhanced usability features. The consistent efficiency across diverse genomic datasets makes it a versatile tool for managing the growing volume of genomic data in research and clinical settings. The balance between compression ratio and speed positions MBGC2 as a practical solution for the storage and transfer of large genomic collections.
{"title":"MBGC2: Boosting compression via efficient encoding of approximate matches in genome collections.","authors":"Tomasz M Kowalski","doi":"10.1093/gigascience/giag008","DOIUrl":"https://doi.org/10.1093/gigascience/giag008","url":null,"abstract":"<p><strong>Background: </strong>FASTA is the primary format for representing DNA, RNA and protein sequences. While progress has been made in specialized FASTA collection compressors, they still struggle with practical limitations and inconsistent performance across different datasets, hindering effective storage and transfer of large genomic datasets.</p><p><strong>Results: </strong>We present an enhanced version of the Multiple Bacteria Genome Compressor (MBGC), a high-throughput, in-memory algorithm for compressing genome collections. It relies on information about maximum exact matches in the compressed set to identify possibly long approximate matches. It encodes them even when they partially overlap, boosting the compression ratio by an average of 14% across bacterial datasets, while the reengineered multi-threaded decoding speeds up decompression compared to its predecessor by around 40%. The compression ratio improvement is even more pronounced on other collections, for H. sapiens reaching 18%, and up to 55% for S. paradoxus. MBGC2 performs consistently across diverse datasets and introduces practical features to ease data management such as archive appending, repacking, fast content listing and flexible decompression options. Benchmark tests covering nucleotide-based bacterial, viral, and human genome collections show that MBGC2 combines compression efficiency and processing speed. The tool supports working with single genomes or amino acid collections, but does not guarantee such high performance in these cases.</p><p><strong>Conclusions: </strong>MBGC2 addresses critical limitations in genome collection compression by delivering reliable performance, improved compression ratios, and enhanced usability features. The consistent efficiency across diverse genomic datasets makes it a versatile tool for managing the growing volume of genomic data in research and clinical settings. The balance between compression ratio and speed positions MBGC2 as a practical solution for the storage and transfer of large genomic collections.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146009898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Safflower (Carthamus tinctorius L.) is a drought-resilient oilseed crop. Besides producing edible oil rich in oleic and linoleic acids, it is also used in biofuels, cosmetics, coloring dyes, pharmaceuticals, and nutraceuticals. Despite its significant economic uses, the availability of genetic and genomic resources in safflower is limited.
Results: We report an improved de novo genome assembly of safflower (Safflower_A2). A chromosome-level assembly of 1.15 Gb with telomeres and centromeric repeats was constructed using PacBio HiFi reads, optical maps, Illumina short reads, and Hi-C sequencing. Safflower_A2 shows better contiguity, completeness, and high-quality annotation than previous assemblies. The assembly was further validated with the help of a single-nucleotide polymorphism (SNP)-based linkage map. A genome-wide survey identified genes for comprehensive exploration of disease resistance in the safflower. Employing the de novo genome assembly as a reference, we used resequencing data of a global core collection of 123 accessions to carry out an SNP-based genome-wide association study, which identified significant associations for several traits and their haplotypes of agronomic value, including seed oil content. Resequencing data were also applied for a pan-genome analysis, which provided critical insights into genome diversity, identifying an additional ~11,000 genes and their functional enrichment that will be useful for region-specific breeding lines.
Conclusion: Our study provides insights into the genomic architecture of safflower by leveraging an improved genome assembly and annotation. Additionally, resources, including a high-density linkage map, marker-trait associations, and pan-genome development in this study, provide valuable resources for use in breeding and crop improvement programs by the global research community.
背景:红花(Carthamus tinctorius L.)是一种抗旱油料作物。除了生产富含油酸和亚油酸的食用油外,它还用于生物燃料、化妆品、染料、药品和营养保健品。尽管红花具有重要的经济用途,但其遗传和基因组资源的可用性有限。结果:我们报道了一个改进的红花(Safflower_A2)从头基因组组装。利用PacBio HiFi reads、光学图谱、Illumina short reads和Hi-C测序,构建了1.15 Gb染色体水平的端粒和着丝粒重复序列。与以前的程序集相比,Safflower_A2具有更好的连续性、完整性和高质量的注释。通过基于单核苷酸多态性(SNP)的连锁图谱进一步验证了该序列。一项全基因组调查确定了红花抗病基因的全面探索。以从头基因组组装为参考,我们利用123份全球核心收集的重测序数据进行了基于snp的全基因组关联研究,发现了几种性状及其农艺价值单倍型(包括种子含油量)的显著相关性。重测序数据还用于泛基因组分析,该分析为基因组多样性提供了关键见解,确定了额外的约11000个基因及其功能富集,这将对区域特异性育种系有用。结论:我们的研究利用改进的基因组组装和注释为红花的基因组结构提供了见解。此外,本研究开发的高密度连锁图谱、标记-性状关联、泛基因组等资源为全球研究界的育种和作物改良计划提供了宝贵的资源。
{"title":"Improved reference assembly and core collection resequencing to facilitate exploration of important agronomical traits for the improvement of oilseed crop, Carthamus tinctorius L.","authors":"Megha Sharma, Varun Bhardwaj, Praveen Kumar Oraon, Shivani Choudhary, Heena Ambreen, Rohit Nandan Shukla, Harsha Rayudu Jamedar, Ajitha Vijjeswarapu, Vandana Jaiswal, Palchamy Kadirvel, Arun Jagannath, Shailendra Goel","doi":"10.1093/gigascience/giaf151","DOIUrl":"10.1093/gigascience/giaf151","url":null,"abstract":"<p><strong>Background: </strong>Safflower (Carthamus tinctorius L.) is a drought-resilient oilseed crop. Besides producing edible oil rich in oleic and linoleic acids, it is also used in biofuels, cosmetics, coloring dyes, pharmaceuticals, and nutraceuticals. Despite its significant economic uses, the availability of genetic and genomic resources in safflower is limited.</p><p><strong>Results: </strong>We report an improved de novo genome assembly of safflower (Safflower_A2). A chromosome-level assembly of 1.15 Gb with telomeres and centromeric repeats was constructed using PacBio HiFi reads, optical maps, Illumina short reads, and Hi-C sequencing. Safflower_A2 shows better contiguity, completeness, and high-quality annotation than previous assemblies. The assembly was further validated with the help of a single-nucleotide polymorphism (SNP)-based linkage map. A genome-wide survey identified genes for comprehensive exploration of disease resistance in the safflower. Employing the de novo genome assembly as a reference, we used resequencing data of a global core collection of 123 accessions to carry out an SNP-based genome-wide association study, which identified significant associations for several traits and their haplotypes of agronomic value, including seed oil content. Resequencing data were also applied for a pan-genome analysis, which provided critical insights into genome diversity, identifying an additional ~11,000 genes and their functional enrichment that will be useful for region-specific breeding lines.</p><p><strong>Conclusion: </strong>Our study provides insights into the genomic architecture of safflower by leveraging an improved genome assembly and annotation. Additionally, resources, including a high-density linkage map, marker-trait associations, and pan-genome development in this study, provide valuable resources for use in breeding and crop improvement programs by the global research community.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145722306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1093/gigascience/giag001
Gina Turco
The Pacific Symposium on Biocomputing (PSB) recognized my work with the 2024 Junior Research Parasite Award, an honor established to highlight the scientific value of reanalyzing, integrating, and reinterpreting existing datasets. The award invites recipients to reflect on the role of research parasites within the broader ecosystem of computational biology and data reuse. For me, this perspective is rooted in years of working across diverse -omics datasets, where I've seen firsthand how the structure, resolution, and context of a dataset shape the biological insight it can support. Rather than focusing on data volume alone, meaningful discovery often emerges from understanding what each dataset can-and cannot-reveal. Here, I outline how different modes of secondary analysis, from integrating complementary datasets to deeply mining a single omics layer.
{"title":"Beyond Volume and Toward Coherence: A research parasite's perspective.","authors":"Gina Turco","doi":"10.1093/gigascience/giag001","DOIUrl":"https://doi.org/10.1093/gigascience/giag001","url":null,"abstract":"<p><p>The Pacific Symposium on Biocomputing (PSB) recognized my work with the 2024 Junior Research Parasite Award, an honor established to highlight the scientific value of reanalyzing, integrating, and reinterpreting existing datasets. The award invites recipients to reflect on the role of research parasites within the broader ecosystem of computational biology and data reuse. For me, this perspective is rooted in years of working across diverse -omics datasets, where I've seen firsthand how the structure, resolution, and context of a dataset shape the biological insight it can support. Rather than focusing on data volume alone, meaningful discovery often emerges from understanding what each dataset can-and cannot-reveal. Here, I outline how different modes of secondary analysis, from integrating complementary datasets to deeply mining a single omics layer.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146009879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1093/gigascience/giag006
Souvik Seal, Brian Neelon
Advances in spatial omics enable measurement of genes (spatial transcriptomics) and peptides, lipids, or N-glycans (mass spectrometry imaging) across thousands of locations within a tissue. While detecting spatially variable molecules is a well-studied problem, robust methods for identifying spatially varying co-expression between molecule pairs remain limited. We introduce SpaceBF, a Bayesian fused modeling framework that estimates co-expression at both local (location-specific) and global (tissue-wide) levels. SpaceBF enforces spatial smoothness via a fused horseshoe prior on the edges of a predefined spatial adjacency graph, allowing large, edge-specific differences to escape shrinkage while preserving overall structure. In extensive simulations, SpaceBF achieves higher specificity and power than commonly used methods that leverage geospatial metrics, including bivariate Moran's I and Lee's L. We also benchmark the proposed prior against standard alternatives, such as intrinsic conditional autoregressive (ICAR) and Matérn priors. Applied to spatial transcriptomics and proteomics datasets, SpaceBF reveals cancer-relevant molecular interactions and patterns of cell-cell communication (e.g., ligand-receptor signaling), demonstrating its utility for principled, uncertainty-aware co-expression analysis of spatial omics data.
{"title":"SpaceBF: Spatial coexpression analysis using Bayesian Fused approaches in spatial omics datasets.","authors":"Souvik Seal, Brian Neelon","doi":"10.1093/gigascience/giag006","DOIUrl":"10.1093/gigascience/giag006","url":null,"abstract":"<p><p>Advances in spatial omics enable measurement of genes (spatial transcriptomics) and peptides, lipids, or N-glycans (mass spectrometry imaging) across thousands of locations within a tissue. While detecting spatially variable molecules is a well-studied problem, robust methods for identifying spatially varying co-expression between molecule pairs remain limited. We introduce SpaceBF, a Bayesian fused modeling framework that estimates co-expression at both local (location-specific) and global (tissue-wide) levels. SpaceBF enforces spatial smoothness via a fused horseshoe prior on the edges of a predefined spatial adjacency graph, allowing large, edge-specific differences to escape shrinkage while preserving overall structure. In extensive simulations, SpaceBF achieves higher specificity and power than commonly used methods that leverage geospatial metrics, including bivariate Moran's I and Lee's L. We also benchmark the proposed prior against standard alternatives, such as intrinsic conditional autoregressive (ICAR) and Matérn priors. Applied to spatial transcriptomics and proteomics datasets, SpaceBF reveals cancer-relevant molecular interactions and patterns of cell-cell communication (e.g., ligand-receptor signaling), demonstrating its utility for principled, uncertainty-aware co-expression analysis of spatial omics data.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146009819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}