Pub Date : 2026-01-20DOI: 10.1093/gigascience/giag003
Melanie Segado, Laura A Prosser, Andrea F Duncan, Michelle J Johnson, Konrad P Kording
Cerebral Palsy (CP), affecting approximately 1 in 500 children due to abnormal brain development, impacts movement control. Early risk assessment via the General Movements Assessment (GMA) at 3-4 months is highly predictive for CP but relies on trained clinicians. Machine-learning-based approaches for predicting GMA score from video have shown considerable promise, but typically rely on dataset-specific preprocessing, custom feature sets, and manually designed model pipelines, which make external benchmarking more difficult. This, combined with strict privacy constraints on sharing data, makes it challenging to train and evaluate models across datasets, which is important for assessing clinical utility. There is therefore a need to develop approaches that will work across different datasets to enable multi-site dataset aggregation and model training. To address this gap, we developed an end-to-end pipeline that uses off-the-shelf pose estimation, general-purpose feature extraction, and automated machine learning - none of which are tuned to a specific dataset. We applied this approach to a newly generated large dataset of 1053 infants (with approximately 10-12% positive class for adverse GMA outcome, drawn from a high-risk clinical cohort) within a preregistered study design. Model performance was evaluated on a strict "lock-box" test set, which remained untouched during any phase of model development or preprocessing optimization, and only used for evaluation once the final model and pipeline had been preregistered. The developed model achieved moderate predictive accuracy for clinician-assessed GMA scores (Area Under the Receiver Operating Characteristic Curve, ROC-AUC = 0.77; Area Under the Precision-Recall Curve, PR-AUC = 0.41). The moderate accuracy is noteworthy given the 10-12% positive class prevalence, and power-law scaling of ROC-AUC as a function of increasing dataset size. By releasing de-identified feature data and open-source code, and simplifying the training pipeline using AutoML, our work establishes essential groundwork for future robust, globally relevant CP screening tools suitable for low-resource settings.
{"title":"A preregistered, open pipeline for early cerebral palsy risk assessment from Infant Videos.","authors":"Melanie Segado, Laura A Prosser, Andrea F Duncan, Michelle J Johnson, Konrad P Kording","doi":"10.1093/gigascience/giag003","DOIUrl":"10.1093/gigascience/giag003","url":null,"abstract":"<p><p>Cerebral Palsy (CP), affecting approximately 1 in 500 children due to abnormal brain development, impacts movement control. Early risk assessment via the General Movements Assessment (GMA) at 3-4 months is highly predictive for CP but relies on trained clinicians. Machine-learning-based approaches for predicting GMA score from video have shown considerable promise, but typically rely on dataset-specific preprocessing, custom feature sets, and manually designed model pipelines, which make external benchmarking more difficult. This, combined with strict privacy constraints on sharing data, makes it challenging to train and evaluate models across datasets, which is important for assessing clinical utility. There is therefore a need to develop approaches that will work across different datasets to enable multi-site dataset aggregation and model training. To address this gap, we developed an end-to-end pipeline that uses off-the-shelf pose estimation, general-purpose feature extraction, and automated machine learning - none of which are tuned to a specific dataset. We applied this approach to a newly generated large dataset of 1053 infants (with approximately 10-12% positive class for adverse GMA outcome, drawn from a high-risk clinical cohort) within a preregistered study design. Model performance was evaluated on a strict \"lock-box\" test set, which remained untouched during any phase of model development or preprocessing optimization, and only used for evaluation once the final model and pipeline had been preregistered. The developed model achieved moderate predictive accuracy for clinician-assessed GMA scores (Area Under the Receiver Operating Characteristic Curve, ROC-AUC = 0.77; Area Under the Precision-Recall Curve, PR-AUC = 0.41). The moderate accuracy is noteworthy given the 10-12% positive class prevalence, and power-law scaling of ROC-AUC as a function of increasing dataset size. By releasing de-identified feature data and open-source code, and simplifying the training pipeline using AutoML, our work establishes essential groundwork for future robust, globally relevant CP screening tools suitable for low-resource settings.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146009833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-19DOI: 10.1093/gigascience/giaf156
Diane Duroux, Paul P Meyer, Giovanni Visoná, Niko Beerenwinkel
Background: The deployment of machine learning in clinical settings is often hindered by the limited generalizability of the models. Models that perform well during development tend to underperform in new environments, limiting their clinical utility. This issue affects models designed for the rapid identification of antimicrobial resistance, which is essential to guide treatment decisions. Traditional susceptibility tests can take up to three days, whereas integrating MALDI-TOF mass spectrometry with machine learning has the potential to reduce this to one day. However, model performance declines drastically in hospitals or time frames outside the training data.
Results: To improve robustness, we develop advanced feature representations using masked autoencoders (MAE) for MALDI-TOF spectra, and chemical language models and SELF-referencing embedded strings (SELFIES) for antimicrobials. Cross-validated on data from four medical institutions, our models demonstrate improved performance and stability. The MAE and SELFIES encodings increase the area under the precision-recall curve by 4% when evaluated on unseen time periods, while the MAE and Molformer language model encodings improve it by 10% when applied across different hospitals.
Conclusions: These results underscore the value of combining deep learning with chemical and spectral information to build generalizable, high-impact clinical AI.
{"title":"Generalizable machine learning models for rapid antimicrobial resistance prediction in unseen healthcare settings.","authors":"Diane Duroux, Paul P Meyer, Giovanni Visoná, Niko Beerenwinkel","doi":"10.1093/gigascience/giaf156","DOIUrl":"https://doi.org/10.1093/gigascience/giaf156","url":null,"abstract":"<p><strong>Background: </strong>The deployment of machine learning in clinical settings is often hindered by the limited generalizability of the models. Models that perform well during development tend to underperform in new environments, limiting their clinical utility. This issue affects models designed for the rapid identification of antimicrobial resistance, which is essential to guide treatment decisions. Traditional susceptibility tests can take up to three days, whereas integrating MALDI-TOF mass spectrometry with machine learning has the potential to reduce this to one day. However, model performance declines drastically in hospitals or time frames outside the training data.</p><p><strong>Results: </strong>To improve robustness, we develop advanced feature representations using masked autoencoders (MAE) for MALDI-TOF spectra, and chemical language models and SELF-referencing embedded strings (SELFIES) for antimicrobials. Cross-validated on data from four medical institutions, our models demonstrate improved performance and stability. The MAE and SELFIES encodings increase the area under the precision-recall curve by 4% when evaluated on unseen time periods, while the MAE and Molformer language model encodings improve it by 10% when applied across different hospitals.</p><p><strong>Conclusions: </strong>These results underscore the value of combining deep learning with chemical and spectral information to build generalizable, high-impact clinical AI.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145997116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-19DOI: 10.1093/gigascience/giag005
Jannes Spangenberg, Christian Höner Zu Siederdissen, Winfried Goettsch, Lennart Köhler, Liz Maria Luke, Kai Papenfort, Manja Marz
Background: Oxford Nanopore Technologies (Oxford Nanopore Technologies (ONT)) sequencing enables direct, long-read sequencing of DNA and RNA, preserving nucleotide modifications. During basecalling, deep neural networks translate raw nanopore signals into nucleotide sequences, internally segmenting the signal to align it with the corresponding bases. This is a challenging task due to uneven motor protein rotation, signal variability, low-quality reads, and the presence of nucleotide modifications. However, the signal to nucleotide assignment is critical for novel downstream signal analysis. Existing tools, such as Tombo Resquiggle, f5c Eventalign, f5c Resquiggle, and Uncalled4, operate after basecalling and rely on event-based segmentation and mapping approaches, that often fail to align low-quality or modified reads and lack confidence estimates for segmentation accuracy.
Results: Here, we present a large-scale comparative study in which 5 segmentation tools, including our novel tool Dynamont, are applied to 16 ONT-sequenced data sets spanning different kingdoms of life. Overall, we segmented 160 000 reads and evaluated the tools performance on a combination of 12 signal and downstream assembly metrics. Our study is accompanied by a comprehensive and extensible Supplement that summarizes all data sets, execution instructions, and evaluation results. We score the segmentation results using an aggregated metric score, created from all our analysed metrics.
Conclusions: No tool delivered the best results for all data sets. We recommend a careful choice and normalization of evaluation metrics to select the best segmentation tool as a critical step in the process of ONT signal segmentation. Across nearly all RNA data sets, Dynamont outperforms other segmentation tools in terms of aggregated metric scores. For DNA data sets, however, the performance is more variable, with mixed results observed across tools.
{"title":"Dynamont: A comprehensive cross-species comparison of ONT segmentation tools.","authors":"Jannes Spangenberg, Christian Höner Zu Siederdissen, Winfried Goettsch, Lennart Köhler, Liz Maria Luke, Kai Papenfort, Manja Marz","doi":"10.1093/gigascience/giag005","DOIUrl":"https://doi.org/10.1093/gigascience/giag005","url":null,"abstract":"<p><strong>Background: </strong>Oxford Nanopore Technologies (Oxford Nanopore Technologies (ONT)) sequencing enables direct, long-read sequencing of DNA and RNA, preserving nucleotide modifications. During basecalling, deep neural networks translate raw nanopore signals into nucleotide sequences, internally segmenting the signal to align it with the corresponding bases. This is a challenging task due to uneven motor protein rotation, signal variability, low-quality reads, and the presence of nucleotide modifications. However, the signal to nucleotide assignment is critical for novel downstream signal analysis. Existing tools, such as Tombo Resquiggle, f5c Eventalign, f5c Resquiggle, and Uncalled4, operate after basecalling and rely on event-based segmentation and mapping approaches, that often fail to align low-quality or modified reads and lack confidence estimates for segmentation accuracy.</p><p><strong>Results: </strong>Here, we present a large-scale comparative study in which 5 segmentation tools, including our novel tool Dynamont, are applied to 16 ONT-sequenced data sets spanning different kingdoms of life. Overall, we segmented 160 000 reads and evaluated the tools performance on a combination of 12 signal and downstream assembly metrics. Our study is accompanied by a comprehensive and extensible Supplement that summarizes all data sets, execution instructions, and evaluation results. We score the segmentation results using an aggregated metric score, created from all our analysed metrics.</p><p><strong>Conclusions: </strong>No tool delivered the best results for all data sets. We recommend a careful choice and normalization of evaluation metrics to select the best segmentation tool as a critical step in the process of ONT signal segmentation. Across nearly all RNA data sets, Dynamont outperforms other segmentation tools in terms of aggregated metric scores. For DNA data sets, however, the performance is more variable, with mixed results observed across tools.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145997919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.1093/gigascience/giag002
Xiaotong Yang, Wenting Liu, Zhixin Mao, Yuheng Du, Cameron Lassiter, Fadhl M AlAkwaa, Paula A Benny, Lana X Garmire
Background: Preeclampsia is a severe pregnancy complication that threatens maternal and neonatal health and well-being. Previous studies on epigenome-wide association analysis (EWAS) of preeclampsia produced inconsistent results in cord blood tissues, and one possible explanation is their failure to rigorously adjust for cell proportions, gestational age, or other necessary variables.
Methods: Here, we calculated the DNA methylation change in cord blood from newborns affected by preeclampsia, using a multi-ethnic cohort from the Hawaii population (24 cases, 38 controls). We comprehensively adjusted for variables such as maternal age, body mass index (BMI), parity, and estimated the cell proportions. We also re-analyzed two previous datasets with adjustments to estimated cell proportions and conducted a pooled analysis by merging all three datasets together to increase the statistical power (58 cases, 71 controls). Lastly, we include idiopathic preterm (preterm delivery with no known reasons) cord blood samples (n=11) to disentangle the effect of severe preeclampsia and small gestational age.
Results: We showed that after adjusting cell type proportions and patient clinical characteristics, most of the so-called statistically significant CpG methylation changes associated with severe preeclampsia disappeared in our own data, two public datasets, and the pooled analysis combining all three datasets. This result still holds after including idiopathic preterm samples in the control group. Rather, we found that gestation progression is accompanied by statistically significant proportion changes in several cell types, such as granulocytes, nRBCs, CD8Ts, and B cells, which contribute to most DNA methylation differences between case and control groups. Preeclampsia has interactions on cell proportion changes in granulocytes, monocytes, and nRBCs.
Conclusions: In summary, our study shows that the previously reported differentially methylated patterns in cord blood are actually artifacts due to not properly adjusting for cell type heterogeneity, gestational age, and clinical covariates. Severe preeclampsia is not associated with statistically significant DNA methylation changes but changes in cell proportion. This finding alerts to the scientific rigor needed in EWAS.
{"title":"Cord blood DNA methylation and cell type composition are not significantly associated with severe preeclampsia, after cell type and clinical covariate adjustment.","authors":"Xiaotong Yang, Wenting Liu, Zhixin Mao, Yuheng Du, Cameron Lassiter, Fadhl M AlAkwaa, Paula A Benny, Lana X Garmire","doi":"10.1093/gigascience/giag002","DOIUrl":"https://doi.org/10.1093/gigascience/giag002","url":null,"abstract":"<p><strong>Background: </strong>Preeclampsia is a severe pregnancy complication that threatens maternal and neonatal health and well-being. Previous studies on epigenome-wide association analysis (EWAS) of preeclampsia produced inconsistent results in cord blood tissues, and one possible explanation is their failure to rigorously adjust for cell proportions, gestational age, or other necessary variables.</p><p><strong>Methods: </strong>Here, we calculated the DNA methylation change in cord blood from newborns affected by preeclampsia, using a multi-ethnic cohort from the Hawaii population (24 cases, 38 controls). We comprehensively adjusted for variables such as maternal age, body mass index (BMI), parity, and estimated the cell proportions. We also re-analyzed two previous datasets with adjustments to estimated cell proportions and conducted a pooled analysis by merging all three datasets together to increase the statistical power (58 cases, 71 controls). Lastly, we include idiopathic preterm (preterm delivery with no known reasons) cord blood samples (n=11) to disentangle the effect of severe preeclampsia and small gestational age.</p><p><strong>Results: </strong>We showed that after adjusting cell type proportions and patient clinical characteristics, most of the so-called statistically significant CpG methylation changes associated with severe preeclampsia disappeared in our own data, two public datasets, and the pooled analysis combining all three datasets. This result still holds after including idiopathic preterm samples in the control group. Rather, we found that gestation progression is accompanied by statistically significant proportion changes in several cell types, such as granulocytes, nRBCs, CD8Ts, and B cells, which contribute to most DNA methylation differences between case and control groups. Preeclampsia has interactions on cell proportion changes in granulocytes, monocytes, and nRBCs.</p><p><strong>Conclusions: </strong>In summary, our study shows that the previously reported differentially methylated patterns in cord blood are actually artifacts due to not properly adjusting for cell type heterogeneity, gestational age, and clinical covariates. Severe preeclampsia is not associated with statistically significant DNA methylation changes but changes in cell proportion. This finding alerts to the scientific rigor needed in EWAS.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145989009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The genomes of mangrove Acanthus species have not been reported, despite their ecological and medicinal importance. Using PacBio and Hi-C data, we generated a chromosome-scale genome assembly of the recently identified allotetraploid species Acanthus tetraploideus (2n = 96). The genomes of diploid progenitors, Acanthus ilicifolius and Acanthus ebracteatus (2n = 48), were assembled from stLFR data. We identified an Acanthus-specific whole-genome duplication (WGD) event that occurred ∼43 million years ago (Mya). Ancestral karyotype reconstruction revealed a shift in haploid chromosome number from 11 to 24 in the progenitors, following the WGD and subsequent chromosomal fission events. The hybridization that formed A. tetraploideus was estimated to have occurred 0.7-1.8 Mya. Phylogenomic and synteny analyses clearly showed that A. tetraploideus inherited subgenomes SG1 and SG2 from A. ilicifolius and A. ebracteatus, respectively. Gene structure and retention analyses revealed a smaller and more structurally flexible genome in A. ebracteatus and SG2 compared with A. ilicifolius and SG1. Gene family and machine learning analyses identified expansions in protein families related to Casparian strip formation, root development, and salt stress response. Several of these families were expanded in A. ilicifolius and SG1 but contracted in A. ebracteatus and SG2. These genomic patterns might have contributed to the establishment of A. tetraploideus within the habitat of A. ebracteatus. For all three species, population analysis revealed clear genetic divergence between samples from the eastern and western coasts of Thailand. This study provides valuable genomic resources and insights into the evolutionary adaptation of plants to intertidal environments.
{"title":"Genome Assembly of Three Shrub Mangroves in the Genus Acanthus Reveals Two Polyploidy Events and Expansion of Genes Linked to Root Adaptation in Coastal Habitats.","authors":"Wanapinun Nawae, Chaiwat Naktang, Peeraphat Paenpong, Duangjai Sangsrakru, Thippawan Yoocha, Sonicha U-Thoomporn, Wasitthee Kongkachana, Poonsri Wanthongchai, Suchart Yamprasai, Chonlawit Samart, Sithichoke Tangphatsornruang, Wirulda Pootakham","doi":"10.1093/gigascience/giaf162","DOIUrl":"https://doi.org/10.1093/gigascience/giaf162","url":null,"abstract":"<p><p>The genomes of mangrove Acanthus species have not been reported, despite their ecological and medicinal importance. Using PacBio and Hi-C data, we generated a chromosome-scale genome assembly of the recently identified allotetraploid species Acanthus tetraploideus (2n = 96). The genomes of diploid progenitors, Acanthus ilicifolius and Acanthus ebracteatus (2n = 48), were assembled from stLFR data. We identified an Acanthus-specific whole-genome duplication (WGD) event that occurred ∼43 million years ago (Mya). Ancestral karyotype reconstruction revealed a shift in haploid chromosome number from 11 to 24 in the progenitors, following the WGD and subsequent chromosomal fission events. The hybridization that formed A. tetraploideus was estimated to have occurred 0.7-1.8 Mya. Phylogenomic and synteny analyses clearly showed that A. tetraploideus inherited subgenomes SG1 and SG2 from A. ilicifolius and A. ebracteatus, respectively. Gene structure and retention analyses revealed a smaller and more structurally flexible genome in A. ebracteatus and SG2 compared with A. ilicifolius and SG1. Gene family and machine learning analyses identified expansions in protein families related to Casparian strip formation, root development, and salt stress response. Several of these families were expanded in A. ilicifolius and SG1 but contracted in A. ebracteatus and SG2. These genomic patterns might have contributed to the establishment of A. tetraploideus within the habitat of A. ebracteatus. For all three species, population analysis revealed clear genetic divergence between samples from the eastern and western coasts of Thailand. This study provides valuable genomic resources and insights into the evolutionary adaptation of plants to intertidal environments.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145888972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-26DOI: 10.1093/gigascience/giaf159
Maria J P Sousa, Mari Toppinen, Lari Pyöriä, Klaus Hedman, Antti Sajantila, Maria F Perdomo, Diogo Pratas
Background: The increasing availability of viral sequencing data has led to the emergence of many optimized viral genome reconstruction tools. Given that the number of new tools is steadily increasing, it is complex to identify functional and optimized tools that offer an equilibrium between accuracy and computational resources as well as the features that each tool provides.
Results: In this paper, we surveyed open-source computational tools (including pipelines) used for human viral genome reconstruction, identifying specific characteristics, features, similarities, and dissimilarities between these tools. For quantitative comparison, we created an open-source reconstruction benchmark based on viral data. The benchmark was executed using both synthetic and real datasets. With the former, we evaluated the effects to the reconstruction process of using different human DNA viruses with simulated mutation rates, contamination and mitochondrial DNA inclusion, and various coverage depths. Each reconstruction program was also evaluated using real datasets, demonstrating their performance in real-life scenarios. The evaluation measures include the identity, a Normalized Compression Semi-Distance, and the Normalized Relative Compression between the genomes before and after reconstruction, as well as metrics regarding the length of the genomes reconstructed, computational time and resources spent by each tool.
Conclusions: We provide a fully reproducible benchmark capable of evaluating currently available reconstruction programs. The benchmark is open-source and freely available at https://github.com/viromelab/HVRS. Additionally, based on the knowledge obtained from the systematic review and the benchmark, we provide some program recommendations for different reconstruction scenarios.
{"title":"An evaluation of computational methods for reconstruction of human viral DNA genomes.","authors":"Maria J P Sousa, Mari Toppinen, Lari Pyöriä, Klaus Hedman, Antti Sajantila, Maria F Perdomo, Diogo Pratas","doi":"10.1093/gigascience/giaf159","DOIUrl":"https://doi.org/10.1093/gigascience/giaf159","url":null,"abstract":"<p><strong>Background: </strong>The increasing availability of viral sequencing data has led to the emergence of many optimized viral genome reconstruction tools. Given that the number of new tools is steadily increasing, it is complex to identify functional and optimized tools that offer an equilibrium between accuracy and computational resources as well as the features that each tool provides.</p><p><strong>Results: </strong>In this paper, we surveyed open-source computational tools (including pipelines) used for human viral genome reconstruction, identifying specific characteristics, features, similarities, and dissimilarities between these tools. For quantitative comparison, we created an open-source reconstruction benchmark based on viral data. The benchmark was executed using both synthetic and real datasets. With the former, we evaluated the effects to the reconstruction process of using different human DNA viruses with simulated mutation rates, contamination and mitochondrial DNA inclusion, and various coverage depths. Each reconstruction program was also evaluated using real datasets, demonstrating their performance in real-life scenarios. The evaluation measures include the identity, a Normalized Compression Semi-Distance, and the Normalized Relative Compression between the genomes before and after reconstruction, as well as metrics regarding the length of the genomes reconstructed, computational time and resources spent by each tool.</p><p><strong>Conclusions: </strong>We provide a fully reproducible benchmark capable of evaluating currently available reconstruction programs. The benchmark is open-source and freely available at https://github.com/viromelab/HVRS. Additionally, based on the knowledge obtained from the systematic review and the benchmark, we provide some program recommendations for different reconstruction scenarios.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145843567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-22DOI: 10.1093/gigascience/giaf160
Sarah L F Martin, Renato La Torre, Bram Danneels, Ave Tooming-Klunderud, Morten Skage, Spyridon Kollias, Ole Kristian Tørresen, Mohsen Falahati Anbaran, Elisabeth Stur, Kjetill S Jakobsen, Michael D Martin, Torbjørn Ekrem
Background: Arctic and alpine insects experience extreme environmental stressors, yet the genomic basis of their adaptation is poorly understood. Diamesa midges (Diptera: Chironomidae) are cold-adapted insects inhabiting glacial and high-altitude freshwater ecosystems, but no chromosome-level genomes have been available to date.
Findings: We present the first haplotype-resolved, chromosome-level genomes for four Diamesa species (D. hyperborea, D. lindrothi, D. serratosioi and D. tonsa), assembled using PacBio HiFi sequencing and Hi-C scaffolding. The assemblies show high completeness and k-mer representation. Phylogenomic analyses place Diamesinae as sister to other Chironomidae except Podonominae, and comparisons suggest introgression between the distinct species D. hyperborea and D. tonsa. Comparative genomic analyses across 20 Diptera species identified significant gene family contractions in Diamesa related to oxygen transport and metabolism, consistent with adaptation to high-altitude, low-oxygen environments. Expansions were observed in histone-related and Toll-like receptor gene families, suggesting roles in chromatin remodeling and immune regulation under cold stress. A glucose dehydrogenase gene family was significantly expanded across all cold-adapted species studied, implicating it in cryoprotectant synthesis and oxidative stress mitigation. Diamesa exhibited the largest gene family contraction at any phylogenetic node, with limited overlap in expansions with other cold-adapted Diptera, indicating lineage-specific adaptation.
Conclusions: Our findings support the hypothesis that genome size condensation and selective gene family changes underpin survival in cold environments. These new genome assemblies provide a valuable resource for studying adaptation, speciation, and conservation in cold-specialist insects. Future integration of gene expression and population genomics will further clarify the evolutionary resilience of Diamesa in a warming world.
背景:北极和高山昆虫经历极端的环境压力,但其适应的基因组基础知之甚少。蠓(双翅目:蠓科)是一种适应寒冷环境的昆虫,生活在冰川和高海拔的淡水生态系统中,但迄今为止还没有染色体水平的基因组。研究结果:利用PacBio HiFi测序和Hi-C脚手架,我们首次获得了四种Diamesa物种(D. hyperborea, D. lindrothi, D. serratosioi和D. tonsa)的单倍型染色体水平基因组。该组合具有较高的完备性和k-mer表征性。系统基因组学分析表明,除足尾虫科外,蝶尾虫科是其他手尾虫科的姐妹,并且比较表明在不同的物种d.p orborea和d.t onsa之间存在渐渗现象。对20个双翅目物种的比较基因组分析发现,双翅目昆虫与氧运输和代谢相关的基因家族显著收缩,这与对高海拔、低氧环境的适应一致。在组蛋白相关和toll样受体基因家族中观察到扩增,提示在冷胁迫下染色质重塑和免疫调节中起作用。葡萄糖脱氢酶基因家族在所有研究的冷适应物种中显著扩展,暗示其与低温保护剂合成和氧化应激缓解有关。在任何系统发育节点上,双翅目蝶的基因家族收缩最大,与其他冷适应双翅目的扩展重叠有限,表明了谱系特异性适应。结论:我们的研究结果支持了基因组大小凝聚和选择性基因家族变化是寒冷环境下生存的基础的假设。这些新的基因组组合为研究嗜冷昆虫的适应、物种形成和保护提供了宝贵的资源。基因表达和种群基因组学的未来整合将进一步阐明Diamesa在变暖世界中的进化弹性。
{"title":"Haplotype-resolved chromosome-level genome assemblies of four Diamesa species reveal the genetic basis of cold tolerance and high-altitude adaptations in arctic chironomids.","authors":"Sarah L F Martin, Renato La Torre, Bram Danneels, Ave Tooming-Klunderud, Morten Skage, Spyridon Kollias, Ole Kristian Tørresen, Mohsen Falahati Anbaran, Elisabeth Stur, Kjetill S Jakobsen, Michael D Martin, Torbjørn Ekrem","doi":"10.1093/gigascience/giaf160","DOIUrl":"https://doi.org/10.1093/gigascience/giaf160","url":null,"abstract":"<p><strong>Background: </strong>Arctic and alpine insects experience extreme environmental stressors, yet the genomic basis of their adaptation is poorly understood. Diamesa midges (Diptera: Chironomidae) are cold-adapted insects inhabiting glacial and high-altitude freshwater ecosystems, but no chromosome-level genomes have been available to date.</p><p><strong>Findings: </strong>We present the first haplotype-resolved, chromosome-level genomes for four Diamesa species (D. hyperborea, D. lindrothi, D. serratosioi and D. tonsa), assembled using PacBio HiFi sequencing and Hi-C scaffolding. The assemblies show high completeness and k-mer representation. Phylogenomic analyses place Diamesinae as sister to other Chironomidae except Podonominae, and comparisons suggest introgression between the distinct species D. hyperborea and D. tonsa. Comparative genomic analyses across 20 Diptera species identified significant gene family contractions in Diamesa related to oxygen transport and metabolism, consistent with adaptation to high-altitude, low-oxygen environments. Expansions were observed in histone-related and Toll-like receptor gene families, suggesting roles in chromatin remodeling and immune regulation under cold stress. A glucose dehydrogenase gene family was significantly expanded across all cold-adapted species studied, implicating it in cryoprotectant synthesis and oxidative stress mitigation. Diamesa exhibited the largest gene family contraction at any phylogenetic node, with limited overlap in expansions with other cold-adapted Diptera, indicating lineage-specific adaptation.</p><p><strong>Conclusions: </strong>Our findings support the hypothesis that genome size condensation and selective gene family changes underpin survival in cold environments. These new genome assemblies provide a valuable resource for studying adaptation, speciation, and conservation in cold-specialist insects. Future integration of gene expression and population genomics will further clarify the evolutionary resilience of Diamesa in a warming world.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145804052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-19DOI: 10.1093/gigascience/giaf157
Jorge Mas-Gómez, Manuel Rubio, Federico Dicenta, Pedro José Martínez-García
Background: High-throughput phenotyping is addressing the current bottleneck in phenotyping within breeding programs. Imaging tools are becoming the primary resource for improving the efficiency of phenotyping processes and providing large datasets for genomic selection approaches. The advent of AI brings new advantages by enhancing phenotyping methods using imaging, making them more accessible to breeding programs. In this context, we have developed an open Python workflow for analyzing morphology, colour and morphometric traits using AI, which can be applied to fruits and other plant organs.
Results: The workflow was implemented in almond (Prunus dulcis (Mill.) D. A. Webb), a species where breeding efficiency is critical due to its long breeding cycle. Over 25,000 kernels, more than 20,000 nuts, and over 600 individuals were phenotyped, making this the largest morphological study conducted in almond so far. The best segmentation and reconstruction approaches achieved error rates below 1%. Weight and area variables enabled accurate estimation of kernel thickness, with a root mean squared error (RMSE) of 0.47. Fifty-five heritable morphological, morphometric and colour traits were identified, highlighting their potential as target traits in breeding programs.
Conclusion: The proposed workflow demonstrated robust performance across diverse datasets and being effective with limited training data for fine-tuning. Its compatibility with the output of AI-based labelling tools allows users to fully leverage the advantages of these technologies-reducing manual effort, accelerating dataset preparation, and streamlining the fine-tuning process of segmentation models. This flexibility enhances the scalability and practical applicability of the workflow in real-world phenotyping scenarios, especially in the context of breeding programs.
背景:高通量表型分析正在解决育种计划中表型分析的当前瓶颈。成像工具正在成为提高表型过程效率和为基因组选择方法提供大型数据集的主要资源。人工智能的出现带来了新的优势,它增强了使用成像的表型分析方法,使它们更容易用于育种计划。在此背景下,我们开发了一个开放的Python工作流,用于使用AI分析形态,颜色和形态特征,可应用于水果和其他植物器官。结果:该流程可在扁桃(Prunus dulcis, Mill.)中实现。D. a . Webb),由于其繁殖周期长,繁殖效率至关重要。超过25,000粒,20,000多个坚果,600多个个体进行了表型分析,这是迄今为止对杏仁进行的最大规模的形态学研究。最好的分割和重建方法使错误率低于1%。权重和面积变量能够准确估计核厚,均方根误差(RMSE)为0.47。鉴定了55个可遗传的形态、形态计量和颜色性状,突出了它们作为育种目标性状的潜力。结论:所提出的工作流在不同的数据集上表现出稳健的性能,并且在有限的训练数据上进行微调是有效的。它与基于人工智能的标签工具的输出的兼容性允许用户充分利用这些技术的优势-减少人工劳动,加速数据集准备,并简化分割模型的微调过程。这种灵活性增强了工作流程在现实世界表型场景中的可扩展性和实际适用性,特别是在育种计划的背景下。
{"title":"Open RGB Imaging Workflow for Morphological and Morphometric Analysis of Fruits using Deep Learning: A Case Study on Almonds.","authors":"Jorge Mas-Gómez, Manuel Rubio, Federico Dicenta, Pedro José Martínez-García","doi":"10.1093/gigascience/giaf157","DOIUrl":"https://doi.org/10.1093/gigascience/giaf157","url":null,"abstract":"<p><strong>Background: </strong>High-throughput phenotyping is addressing the current bottleneck in phenotyping within breeding programs. Imaging tools are becoming the primary resource for improving the efficiency of phenotyping processes and providing large datasets for genomic selection approaches. The advent of AI brings new advantages by enhancing phenotyping methods using imaging, making them more accessible to breeding programs. In this context, we have developed an open Python workflow for analyzing morphology, colour and morphometric traits using AI, which can be applied to fruits and other plant organs.</p><p><strong>Results: </strong>The workflow was implemented in almond (Prunus dulcis (Mill.) D. A. Webb), a species where breeding efficiency is critical due to its long breeding cycle. Over 25,000 kernels, more than 20,000 nuts, and over 600 individuals were phenotyped, making this the largest morphological study conducted in almond so far. The best segmentation and reconstruction approaches achieved error rates below 1%. Weight and area variables enabled accurate estimation of kernel thickness, with a root mean squared error (RMSE) of 0.47. Fifty-five heritable morphological, morphometric and colour traits were identified, highlighting their potential as target traits in breeding programs.</p><p><strong>Conclusion: </strong>The proposed workflow demonstrated robust performance across diverse datasets and being effective with limited training data for fine-tuning. Its compatibility with the output of AI-based labelling tools allows users to fully leverage the advantages of these technologies-reducing manual effort, accelerating dataset preparation, and streamlining the fine-tuning process of segmentation models. This flexibility enhances the scalability and practical applicability of the workflow in real-world phenotyping scenarios, especially in the context of breeding programs.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145793888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-18DOI: 10.1093/gigascience/giaf158
Hangwei Liu, Lihong Lei, Fan Jiang, Bo Zhang, Hengchao Wang, Yutong Zhang, Hanbo Zhao, Guirong Wang, Wei Fan
Background: Praying mantises, members of the order Mantodea, play important roles in agriculture, medicine, bionics, and entertainment. However, the scarcity of genomic resources has hindered extensive studies on mantis evolution and behaviour.
Results: Here, we present the chromosome-scale reference genomes of five mantis species: the European mantis (Mantis religiosa), Chinese mantis (Tenodera sinensis), triangle dead leaf mantis (Deroplatys truncata), orchid mantis (Hymenopus coronatus), and metallic mantis (Metallyticus violacea). The assembled genome sizes range ∼2.3-4.2 Gb, with contig N50 size 1-109 Mb and 85-99% of sequence anchored to chromosomes. The annotated protein-coding gene number ranges 17,804-19,017, with BUSCO complete rate 96.7-98.4%. We found that transposable element expansion is the major force governing genome size in Mantodea, and suggest that translocations between the X chromosome and an autosome have occurred in the lineage of the family Mantidae. In addition, we found the lineage of M. violacea has accumulated fewer substitutions than the lineages of other mantises. Furthermore, our genome-wide analyses showed that D. truncata is sister to H. coronatus than M. religiosa and T. sinensis, helps resolve the phylogenic controversies of Deroplatys genus.
Conclusions: The high-quality genome assemblies of the five mantises provide a valuable resource for evolution studies of Mantodea and genetic improvement and breeding of beneficial biological control agents.
{"title":"The genomes of five mantises provide insights into sex chromosome evolution and Mantodea phylogeny clarification.","authors":"Hangwei Liu, Lihong Lei, Fan Jiang, Bo Zhang, Hengchao Wang, Yutong Zhang, Hanbo Zhao, Guirong Wang, Wei Fan","doi":"10.1093/gigascience/giaf158","DOIUrl":"https://doi.org/10.1093/gigascience/giaf158","url":null,"abstract":"<p><strong>Background: </strong>Praying mantises, members of the order Mantodea, play important roles in agriculture, medicine, bionics, and entertainment. However, the scarcity of genomic resources has hindered extensive studies on mantis evolution and behaviour.</p><p><strong>Results: </strong>Here, we present the chromosome-scale reference genomes of five mantis species: the European mantis (Mantis religiosa), Chinese mantis (Tenodera sinensis), triangle dead leaf mantis (Deroplatys truncata), orchid mantis (Hymenopus coronatus), and metallic mantis (Metallyticus violacea). The assembled genome sizes range ∼2.3-4.2 Gb, with contig N50 size 1-109 Mb and 85-99% of sequence anchored to chromosomes. The annotated protein-coding gene number ranges 17,804-19,017, with BUSCO complete rate 96.7-98.4%. We found that transposable element expansion is the major force governing genome size in Mantodea, and suggest that translocations between the X chromosome and an autosome have occurred in the lineage of the family Mantidae. In addition, we found the lineage of M. violacea has accumulated fewer substitutions than the lineages of other mantises. Furthermore, our genome-wide analyses showed that D. truncata is sister to H. coronatus than M. religiosa and T. sinensis, helps resolve the phylogenic controversies of Deroplatys genus.</p><p><strong>Conclusions: </strong>The high-quality genome assemblies of the five mantises provide a valuable resource for evolution studies of Mantodea and genetic improvement and breeding of beneficial biological control agents.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145774156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-12DOI: 10.1093/gigascience/giaf152
Sierra A T Moxon, Harold Solbrig, Nomi L Harris, Patrick Kalita, Mark A Miller, Sujay Patil, Kevin Schaper, Chris Bizon, J Harry Caufield, Silvano Cirujano Cuesta, Corey Cox, Frank Dekervel, Damion M Dooley, William D Duncan, Tim Fliss, Sarah Gehrke, Adam S L Graefe, Harshad Hegde, A J Ireland, Julius O B Jacobsen, Madan Krishnamurthy, Carlo Kroll, David Linke, Ryan Ly, Nicolas Matentzoglu, James A Overton, Jonny L Saunders, Deepak R Unni, Gaurav Vaidya, Wouter-Michiel A M Vierdag, Oliver Ruebel, Christopher G Chute, Matthew H Brush, Melissa A Haendel, Christopher J Mungall
Background: Scientific research relies on well-structured, standardized data; however, much of it is stored in formats such as free-text lab notebooks, non-standardized spreadsheets, or data repositories. This lack of structure challenges interoperability, making data integration, validation, and reuse difficult.
Findings: LinkML (Linked Data Modeling Language) is an open framework that simplifies the process of authoring, validating, and sharing data. LinkML can describe a range of data structures, from flat, list-based models to complex, interrelated, and normalized models that utilize polymorphism and compound inheritance. It offers an approachable syntax that is not tied to any one technical architecture and can be integrated seamlessly with many existing frameworks. The LinkML syntax provides a standard way to describe schemas, classes, and relationships, allowing modelers to build well-defined, stable, and optionally ontology-aligned data structures. Once defined, LinkML schemas may be imported into other LinkML schemas. These key features make LinkML an accessible platform for interdisciplinary collaboration and a reliable way to define and share data semantics.
Conclusions: LinkML helps reduce heterogeneity, complexity, and the proliferation of single-use data models while simultaneously enabling compliance with FAIR data standards. LinkML has seen increasing adoption in various fields, including biology, chemistry, biomedicine, microbiome research, finance, electrical engineering, transportation, and commercial software development. In short, LinkML makes implicit models explicitly computable and allows data to be standardized at its origin. LinkML documentation and code are available at linkml.io.
{"title":"LinkML: An Open Data Modeling Framework.","authors":"Sierra A T Moxon, Harold Solbrig, Nomi L Harris, Patrick Kalita, Mark A Miller, Sujay Patil, Kevin Schaper, Chris Bizon, J Harry Caufield, Silvano Cirujano Cuesta, Corey Cox, Frank Dekervel, Damion M Dooley, William D Duncan, Tim Fliss, Sarah Gehrke, Adam S L Graefe, Harshad Hegde, A J Ireland, Julius O B Jacobsen, Madan Krishnamurthy, Carlo Kroll, David Linke, Ryan Ly, Nicolas Matentzoglu, James A Overton, Jonny L Saunders, Deepak R Unni, Gaurav Vaidya, Wouter-Michiel A M Vierdag, Oliver Ruebel, Christopher G Chute, Matthew H Brush, Melissa A Haendel, Christopher J Mungall","doi":"10.1093/gigascience/giaf152","DOIUrl":"https://doi.org/10.1093/gigascience/giaf152","url":null,"abstract":"<p><strong>Background: </strong>Scientific research relies on well-structured, standardized data; however, much of it is stored in formats such as free-text lab notebooks, non-standardized spreadsheets, or data repositories. This lack of structure challenges interoperability, making data integration, validation, and reuse difficult.</p><p><strong>Findings: </strong>LinkML (Linked Data Modeling Language) is an open framework that simplifies the process of authoring, validating, and sharing data. LinkML can describe a range of data structures, from flat, list-based models to complex, interrelated, and normalized models that utilize polymorphism and compound inheritance. It offers an approachable syntax that is not tied to any one technical architecture and can be integrated seamlessly with many existing frameworks. The LinkML syntax provides a standard way to describe schemas, classes, and relationships, allowing modelers to build well-defined, stable, and optionally ontology-aligned data structures. Once defined, LinkML schemas may be imported into other LinkML schemas. These key features make LinkML an accessible platform for interdisciplinary collaboration and a reliable way to define and share data semantics.</p><p><strong>Conclusions: </strong>LinkML helps reduce heterogeneity, complexity, and the proliferation of single-use data models while simultaneously enabling compliance with FAIR data standards. LinkML has seen increasing adoption in various fields, including biology, chemistry, biomedicine, microbiome research, finance, electrical engineering, transportation, and commercial software development. In short, LinkML makes implicit models explicitly computable and allows data to be standardized at its origin. LinkML documentation and code are available at linkml.io.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145742108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}