Pub Date : 2025-12-12DOI: 10.1093/gigascience/giaf152
Sierra A T Moxon, Harold Solbrig, Nomi L Harris, Patrick Kalita, Mark A Miller, Sujay Patil, Kevin Schaper, Chris Bizon, J Harry Caufield, Silvano Cirujano Cuesta, Corey Cox, Frank Dekervel, Damion M Dooley, William D Duncan, Tim Fliss, Sarah Gehrke, Adam S L Graefe, Harshad Hegde, A J Ireland, Julius O B Jacobsen, Madan Krishnamurthy, Carlo Kroll, David Linke, Ryan Ly, Nicolas Matentzoglu, James A Overton, Jonny L Saunders, Deepak R Unni, Gaurav Vaidya, Wouter-Michiel A M Vierdag, Oliver Ruebel, Christopher G Chute, Matthew H Brush, Melissa A Haendel, Christopher J Mungall
Background: Scientific research relies on well-structured, standardized data; however, much of it is stored in formats such as free-text lab notebooks, non-standardized spreadsheets, or data repositories. This lack of structure challenges interoperability, making data integration, validation, and reuse difficult.
Findings: LinkML (Linked Data Modeling Language) is an open framework that simplifies the process of authoring, validating, and sharing data. LinkML can describe a range of data structures, from flat, list-based models to complex, interrelated, and normalized models that utilize polymorphism and compound inheritance. It offers an approachable syntax that is not tied to any one technical architecture and can be integrated seamlessly with many existing frameworks. The LinkML syntax provides a standard way to describe schemas, classes, and relationships, allowing modelers to build well-defined, stable, and optionally ontology-aligned data structures. Once defined, LinkML schemas may be imported into other LinkML schemas. These key features make LinkML an accessible platform for interdisciplinary collaboration and a reliable way to define and share data semantics.
Conclusions: LinkML helps reduce heterogeneity, complexity, and the proliferation of single-use data models while simultaneously enabling compliance with FAIR data standards. LinkML has seen increasing adoption in various fields, including biology, chemistry, biomedicine, microbiome research, finance, electrical engineering, transportation, and commercial software development. In short, LinkML makes implicit models explicitly computable and allows data to be standardized at its origin. LinkML documentation and code are available at linkml.io.
{"title":"LinkML: An Open Data Modeling Framework.","authors":"Sierra A T Moxon, Harold Solbrig, Nomi L Harris, Patrick Kalita, Mark A Miller, Sujay Patil, Kevin Schaper, Chris Bizon, J Harry Caufield, Silvano Cirujano Cuesta, Corey Cox, Frank Dekervel, Damion M Dooley, William D Duncan, Tim Fliss, Sarah Gehrke, Adam S L Graefe, Harshad Hegde, A J Ireland, Julius O B Jacobsen, Madan Krishnamurthy, Carlo Kroll, David Linke, Ryan Ly, Nicolas Matentzoglu, James A Overton, Jonny L Saunders, Deepak R Unni, Gaurav Vaidya, Wouter-Michiel A M Vierdag, Oliver Ruebel, Christopher G Chute, Matthew H Brush, Melissa A Haendel, Christopher J Mungall","doi":"10.1093/gigascience/giaf152","DOIUrl":"https://doi.org/10.1093/gigascience/giaf152","url":null,"abstract":"<p><strong>Background: </strong>Scientific research relies on well-structured, standardized data; however, much of it is stored in formats such as free-text lab notebooks, non-standardized spreadsheets, or data repositories. This lack of structure challenges interoperability, making data integration, validation, and reuse difficult.</p><p><strong>Findings: </strong>LinkML (Linked Data Modeling Language) is an open framework that simplifies the process of authoring, validating, and sharing data. LinkML can describe a range of data structures, from flat, list-based models to complex, interrelated, and normalized models that utilize polymorphism and compound inheritance. It offers an approachable syntax that is not tied to any one technical architecture and can be integrated seamlessly with many existing frameworks. The LinkML syntax provides a standard way to describe schemas, classes, and relationships, allowing modelers to build well-defined, stable, and optionally ontology-aligned data structures. Once defined, LinkML schemas may be imported into other LinkML schemas. These key features make LinkML an accessible platform for interdisciplinary collaboration and a reliable way to define and share data semantics.</p><p><strong>Conclusions: </strong>LinkML helps reduce heterogeneity, complexity, and the proliferation of single-use data models while simultaneously enabling compliance with FAIR data standards. LinkML has seen increasing adoption in various fields, including biology, chemistry, biomedicine, microbiome research, finance, electrical engineering, transportation, and commercial software development. In short, LinkML makes implicit models explicitly computable and allows data to be standardized at its origin. LinkML documentation and code are available at linkml.io.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145742108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giae098
Mathieu Dugré, Yohan Chatelain, Tristan Glatard
Magnetic resonance imaging (MRI) preprocessing is a critical step for neuroimaging analysis. However, the computational cost of MRI preprocessing pipelines is a major bottleneck for large cohort studies and some clinical applications. While high-performance computing and, more recently, deep learning have been adopted to accelerate the computations, these techniques require costly hardware and are not accessible to all researchers. Therefore, it is important to understand the performance bottlenecks of MRI preprocessing pipelines to improve their performance. Using the Intel VTune profiler, we characterized the bottlenecks of several commonly used MRI preprocessing pipelines from the Advanced Normalization Tools (ANTs), FMRIB Software Library, and FreeSurfer toolboxes. We found few functions contributed to most of the CPU time and that linear interpolation was the largest contributor. Data access was also a substantial bottleneck. We identified a bug in the Insight Segmentation and Registration Toolkit library that impacts the performance of the ANTs pipeline in single precision and a potential issue with the OpenMP scaling in FreeSurfer recon-all. Our results provide a reference for future efforts to optimize MRI preprocessing pipelines.
{"title":"An analysis of performance bottlenecks in MRI preprocessing.","authors":"Mathieu Dugré, Yohan Chatelain, Tristan Glatard","doi":"10.1093/gigascience/giae098","DOIUrl":"10.1093/gigascience/giae098","url":null,"abstract":"<p><p>Magnetic resonance imaging (MRI) preprocessing is a critical step for neuroimaging analysis. However, the computational cost of MRI preprocessing pipelines is a major bottleneck for large cohort studies and some clinical applications. While high-performance computing and, more recently, deep learning have been adopted to accelerate the computations, these techniques require costly hardware and are not accessible to all researchers. Therefore, it is important to understand the performance bottlenecks of MRI preprocessing pipelines to improve their performance. Using the Intel VTune profiler, we characterized the bottlenecks of several commonly used MRI preprocessing pipelines from the Advanced Normalization Tools (ANTs), FMRIB Software Library, and FreeSurfer toolboxes. We found few functions contributed to most of the CPU time and that linear interpolation was the largest contributor. Data access was also a substantial bottleneck. We identified a bug in the Insight Segmentation and Registration Toolkit library that impacts the performance of the ANTs pipeline in single precision and a potential issue with the OpenMP scaling in FreeSurfer recon-all. Our results provide a reference for future efforts to optimize MRI preprocessing pipelines.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11899568/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143614576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giae110
Arseniy Lobov, Polina Kuchur, Nadezhda Boyarskaya, Daria Perepletchikova, Ivan Taraskin, Andrei Ivashkin, Daria Kostina, Irina Khvorova, Vladimir Uspensky, Egor Repkin, Evgeny Denisov, Tatiana Gerashchenko, Rashid Tikhilov, Svetlana Bozhkova, Vitaly Karelkin, Chunli Wang, Kang Xu, Anna Malashicheva
Osteogenic differentiation is crucial in normal bone formation and pathological calcification, such as calcific aortic valve disease (CAVD). Understanding the proteomic and transcriptomic landscapes underlying this differentiation can unveil potential therapeutic targets for CAVD. In this study, we employed RNA sequencing transcriptomics and proteomics on a timsTOF Pro platform to explore the multiomics profiles of valve interstitial cells (VICs) and osteoblasts during osteogenic differentiation. For proteomics, we utilized 3 data acquisition/analysis techniques: data-dependent acquisition (DDA)-parallel accumulation serial fragmentation (PASEF) and data-independent acquisition (DIA)-PASEF with a classic library-based (DIA) and machine learning-based library-free search (DIA-ML). Using RNA sequencing data as a biological reference, we compared these 3 analytical techniques in the context of actual biological experiments. We use this comprehensive dataset to reveal distinct proteomic and transcriptomic profiles between VICs and osteoblasts, highlighting specific biological processes in their osteogenic differentiation pathways. The study identified potential therapeutic targets specific for VICs osteogenic differentiation in CAVD, including the MAOA and ERK1/2 pathway. From a technical perspective, we found that DIA-based methods demonstrate even higher superiority against DDA for more sophisticated human primary cell cultures than it was shown before on HeLa samples. While the classic library-based DIA approach has proved to be a gold standard for shotgun proteomics research, the DIA-ML offers significant advantages with a relatively minor compromise in data reliability, making it the method of choice for routine proteomics.
成骨分化在正常骨形成和病理钙化(如钙化性主动脉瓣病(CAVD))中至关重要。了解这种分化背后的蛋白质组和转录组图谱可以揭示治疗 CAVD 的潜在靶点。在这项研究中,我们在timsTOF Pro平台上采用了RNA测序转录组学和蛋白质组学,以探索成骨分化过程中瓣膜间质细胞(VICs)和成骨细胞的多组学特征。在蛋白质组学方面,我们采用了3种数据采集/分析技术:数据依赖性采集(DDA)-平行累积序列片段(PASEF)和数据无关性采集(DIA)-PASEF,以及基于经典文库的搜索(DIA)和基于机器学习的无文库搜索(DIA-ML)。我们使用 RNA 测序数据作为生物参考,在实际生物实验中对这 3 种分析技术进行了比较。我们利用这个全面的数据集揭示了 VICs 和成骨细胞之间不同的蛋白质组和转录组特征,突出了它们成骨分化途径中的特定生物过程。研究发现了CAVD中VICs成骨分化的潜在治疗靶点,包括MAOA和ERK1/2通路。从技术角度看,我们发现基于 DIA 的方法在更复杂的人类原代细胞培养物上比 DDA 更有优势,这一点在 HeLa 样品上已经得到证实。虽然经典的基于文库的 DIA 方法已被证明是枪式蛋白质组学研究的黄金标准,但 DIA-ML 在数据可靠性方面的妥协相对较小,却提供了显著的优势,使其成为常规蛋白质组学研究的首选方法。
{"title":"Similar, but not the same: multiomics comparison of human valve interstitial cells and osteoblast osteogenic differentiation expanded with an estimation of data-dependent and data-independent PASEF proteomics.","authors":"Arseniy Lobov, Polina Kuchur, Nadezhda Boyarskaya, Daria Perepletchikova, Ivan Taraskin, Andrei Ivashkin, Daria Kostina, Irina Khvorova, Vladimir Uspensky, Egor Repkin, Evgeny Denisov, Tatiana Gerashchenko, Rashid Tikhilov, Svetlana Bozhkova, Vitaly Karelkin, Chunli Wang, Kang Xu, Anna Malashicheva","doi":"10.1093/gigascience/giae110","DOIUrl":"10.1093/gigascience/giae110","url":null,"abstract":"<p><p>Osteogenic differentiation is crucial in normal bone formation and pathological calcification, such as calcific aortic valve disease (CAVD). Understanding the proteomic and transcriptomic landscapes underlying this differentiation can unveil potential therapeutic targets for CAVD. In this study, we employed RNA sequencing transcriptomics and proteomics on a timsTOF Pro platform to explore the multiomics profiles of valve interstitial cells (VICs) and osteoblasts during osteogenic differentiation. For proteomics, we utilized 3 data acquisition/analysis techniques: data-dependent acquisition (DDA)-parallel accumulation serial fragmentation (PASEF) and data-independent acquisition (DIA)-PASEF with a classic library-based (DIA) and machine learning-based library-free search (DIA-ML). Using RNA sequencing data as a biological reference, we compared these 3 analytical techniques in the context of actual biological experiments. We use this comprehensive dataset to reveal distinct proteomic and transcriptomic profiles between VICs and osteoblasts, highlighting specific biological processes in their osteogenic differentiation pathways. The study identified potential therapeutic targets specific for VICs osteogenic differentiation in CAVD, including the MAOA and ERK1/2 pathway. From a technical perspective, we found that DIA-based methods demonstrate even higher superiority against DDA for more sophisticated human primary cell cultures than it was shown before on HeLa samples. While the classic library-based DIA approach has proved to be a gold standard for shotgun proteomics research, the DIA-ML offers significant advantages with a relatively minor compromise in data reliability, making it the method of choice for routine proteomics.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11724719/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143055932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf016
Matthieu Doutreligne, Gaël Varoquaux
Background: We investigate which procedure selects the most trustworthy predictive model to explain the effect of an intervention and support decision-making.
Methods: We study a large variety of model selection procedures in practical settings: finite samples settings and without a theoretical assumption of well-specified models. Beyond standard cross-validation or internal validation procedures, we also study elaborate causal risks. These build proxies of the causal error using "nuisance" reweighting to compute it on the observed data. We evaluate whether empirically estimated nuisances, which are necessarily noisy, add noise to model selection and compare different metrics for causal model selection in an extensive empirical study based on a simulation and 3 health care datasets based on real covariates.
Results: Among all metrics, the mean squared error, classically used to evaluate predictive modes, is worse. Reweighting it with a propensity score does not bring much improvement in most cases. On average, the $Rtext{-risk}$, which uses as nuisances a model of mean outcome and propensity scores, leads to the best performances. Nuisance corrections are best estimated with flexible estimators such as a super learner.
Conclusions: When predictive models are used to explain the effect of an intervention, they must be evaluated with different procedures than standard predictive settings, using the $Rtext{-risk}$ from causal inference.
{"title":"How to select predictive models for decision-making or causal inference.","authors":"Matthieu Doutreligne, Gaël Varoquaux","doi":"10.1093/gigascience/giaf016","DOIUrl":"10.1093/gigascience/giaf016","url":null,"abstract":"<p><strong>Background: </strong>We investigate which procedure selects the most trustworthy predictive model to explain the effect of an intervention and support decision-making.</p><p><strong>Methods: </strong>We study a large variety of model selection procedures in practical settings: finite samples settings and without a theoretical assumption of well-specified models. Beyond standard cross-validation or internal validation procedures, we also study elaborate causal risks. These build proxies of the causal error using \"nuisance\" reweighting to compute it on the observed data. We evaluate whether empirically estimated nuisances, which are necessarily noisy, add noise to model selection and compare different metrics for causal model selection in an extensive empirical study based on a simulation and 3 health care datasets based on real covariates.</p><p><strong>Results: </strong>Among all metrics, the mean squared error, classically used to evaluate predictive modes, is worse. Reweighting it with a propensity score does not bring much improvement in most cases. On average, the $Rtext{-risk}$, which uses as nuisances a model of mean outcome and propensity scores, leads to the best performances. Nuisance corrections are best estimated with flexible estimators such as a super learner.</p><p><strong>Conclusions: </strong>When predictive models are used to explain the effect of an intervention, they must be evaluated with different procedures than standard predictive settings, using the $Rtext{-risk}$ from causal inference.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11927402/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143673822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf044
Weixue Mu, Joshua Casey Darian, Wing-Kin Sung, Xing Guo, Tuo Yang, Mandy Wai Man Tang, Ziqiang Chen, Steve Kwan Hok Tong, Irene Wing Shan Chik, Robert L Davidson, Scott C Edmunds, Tong Wei, Stephen Kwok-Wing Tsui
Background: The Hong Kong orchid tree Bauhinia × blakeana Dunn has long been proposed to be a sterile interspecific hybrid exhibiting flower heterosis when compared to its likely parental species, Bauhinia purpurea L. and Bauhinia variegata L. Here, we report comparative genomic and transcriptomic analyses of the 3 Bauhinia species.
Findings: We generated chromosome-level assemblies for the parental species and applied a trio-binning approach to construct a haplotype-resolved telomere-to-telomere (T2T) genome for B. blakeana. Comparative chloroplast genome analysis confirmed B. purpurea as the maternal parent. Transcriptome profiling of flower tissues highlighted a closer resemblance of B. blakeana to its maternal parent. Differential gene expression analyses revealed distinct expression patterns among the 3 species, particularly in biosynthetic and metabolic processes. To investigate the genetic basis of flower heterosis observed in B. blakeana, we focused on gene expression patterns within pigment biosynthesis-related pathways. High-parent dominance and overdominance expression patterns were observed, particularly in genes associated with carotenoid biosynthesis. Additionally, allele-specific expression analysis revealed a balanced contribution of maternal and paternal alleles in shaping the gene expression patterns in B. blakeana.
Conclusions: Our study offers valuable insights into the genome architecture of hybrid B. blakeana, establishing a comprehensive genomic and transcriptomic resource for future functional genetics research within the Bauhinia genus. It also serves as a model for exploring the characteristics of hybrid species using T2T haplotype-resolved genomes, providing a novel approach to understanding genetic interactions and evolutionary mechanisms in complex genomes with high heterozygosity.
{"title":"The haplotype-resolved T2T genome for Bauhinia × blakeana sheds light on the genetic basis of flower heterosis.","authors":"Weixue Mu, Joshua Casey Darian, Wing-Kin Sung, Xing Guo, Tuo Yang, Mandy Wai Man Tang, Ziqiang Chen, Steve Kwan Hok Tong, Irene Wing Shan Chik, Robert L Davidson, Scott C Edmunds, Tong Wei, Stephen Kwok-Wing Tsui","doi":"10.1093/gigascience/giaf044","DOIUrl":"https://doi.org/10.1093/gigascience/giaf044","url":null,"abstract":"<p><strong>Background: </strong>The Hong Kong orchid tree Bauhinia × blakeana Dunn has long been proposed to be a sterile interspecific hybrid exhibiting flower heterosis when compared to its likely parental species, Bauhinia purpurea L. and Bauhinia variegata L. Here, we report comparative genomic and transcriptomic analyses of the 3 Bauhinia species.</p><p><strong>Findings: </strong>We generated chromosome-level assemblies for the parental species and applied a trio-binning approach to construct a haplotype-resolved telomere-to-telomere (T2T) genome for B. blakeana. Comparative chloroplast genome analysis confirmed B. purpurea as the maternal parent. Transcriptome profiling of flower tissues highlighted a closer resemblance of B. blakeana to its maternal parent. Differential gene expression analyses revealed distinct expression patterns among the 3 species, particularly in biosynthetic and metabolic processes. To investigate the genetic basis of flower heterosis observed in B. blakeana, we focused on gene expression patterns within pigment biosynthesis-related pathways. High-parent dominance and overdominance expression patterns were observed, particularly in genes associated with carotenoid biosynthesis. Additionally, allele-specific expression analysis revealed a balanced contribution of maternal and paternal alleles in shaping the gene expression patterns in B. blakeana.</p><p><strong>Conclusions: </strong>Our study offers valuable insights into the genome architecture of hybrid B. blakeana, establishing a comprehensive genomic and transcriptomic resource for future functional genetics research within the Bauhinia genus. It also serves as a model for exploring the characteristics of hybrid species using T2T haplotype-resolved genomes, providing a novel approach to understanding genetic interactions and evolutionary mechanisms in complex genomes with high heterozygosity.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12012898/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143964846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf038
Mitchell Shiell, Rosi Bajari, Dusan Andric, Jon Eubank, Brandon F Chan, Anders J Richardsson, Azher Ali, Bashar Allabadi, Yelizar Alturmessov, Jared Baker, Ann Catton, Kim Cullion, Daniel DeMaria, Patrick Dos Santos, Henrich Feher, Francois Gerthoffert, Minh Ha, Robin A Haw, Atul Kachru, Alexandru Lepsa, Alexis Li, Rakesh N Mistry, Hardeep K Nahal-Bose, Aleksandra Pejovic, Samantha Rich, Leonardo Rivera, Ciarán Schütte, Edmund Su, Robert Tisma, Jaser Uddin, Chang Wang, Alex N Wilmer, Linda Xiang, Junjun Zhang, Lincoln D Stein, Vincent Ferretti, Mélanie Courtot, Christina K Yung
Background: Next-generation sequencing has created many new technological challenges in organizing and distributing genomics datasets, which now can routinely reach petabyte scales. Coupled with data-hungry artificial intelligence and machine learning applications, findable, accessible, interoperable, and reusable genomics datasets have never been more valuable. While major archives like the Genomics Data Commons, Sequence Reads Archive, and European Genome-Phenome Archive have improved researchers' ability to share and reuse data, and general-purpose repositories such as Zenodo and Figshare provide valuable platforms for research data publication, the diversity of genomics research precludes any one-size-fits-all approach. In many cases, bespoke solutions are required, and despite funding agencies and journals increasingly mandating reusable data practices, researchers still lack the technical support needed to meet the multifaceted challenges of data reuse.
Findings: Overture bridges this gap by providing open-source software for building and deploying customizable genomics data platforms. Its architecture consists of modular microservices, each of which is generalized with narrow responsibilities that together combine to create complete data management systems. These systems enable researchers to organize, share, and explore their genomics data at any scale. Through Overture, researchers can connect their data to both humans and machines, fostering reproducibility and enabling new insights through controlled data sharing and reuse.
Conclusions: By making these tools freely available, we can accelerate the development of reliable genomic data management across the research community quickly, flexibly, and at multiple scales. Overture is an open-source project licensed under AGPLv3.0 with all source code publicly available from https://github.com/overture-stack and documentation on development, deployment, and usage available from www.overture.bio.
{"title":"Overture: an open-source genomics data platform.","authors":"Mitchell Shiell, Rosi Bajari, Dusan Andric, Jon Eubank, Brandon F Chan, Anders J Richardsson, Azher Ali, Bashar Allabadi, Yelizar Alturmessov, Jared Baker, Ann Catton, Kim Cullion, Daniel DeMaria, Patrick Dos Santos, Henrich Feher, Francois Gerthoffert, Minh Ha, Robin A Haw, Atul Kachru, Alexandru Lepsa, Alexis Li, Rakesh N Mistry, Hardeep K Nahal-Bose, Aleksandra Pejovic, Samantha Rich, Leonardo Rivera, Ciarán Schütte, Edmund Su, Robert Tisma, Jaser Uddin, Chang Wang, Alex N Wilmer, Linda Xiang, Junjun Zhang, Lincoln D Stein, Vincent Ferretti, Mélanie Courtot, Christina K Yung","doi":"10.1093/gigascience/giaf038","DOIUrl":"https://doi.org/10.1093/gigascience/giaf038","url":null,"abstract":"<p><strong>Background: </strong>Next-generation sequencing has created many new technological challenges in organizing and distributing genomics datasets, which now can routinely reach petabyte scales. Coupled with data-hungry artificial intelligence and machine learning applications, findable, accessible, interoperable, and reusable genomics datasets have never been more valuable. While major archives like the Genomics Data Commons, Sequence Reads Archive, and European Genome-Phenome Archive have improved researchers' ability to share and reuse data, and general-purpose repositories such as Zenodo and Figshare provide valuable platforms for research data publication, the diversity of genomics research precludes any one-size-fits-all approach. In many cases, bespoke solutions are required, and despite funding agencies and journals increasingly mandating reusable data practices, researchers still lack the technical support needed to meet the multifaceted challenges of data reuse.</p><p><strong>Findings: </strong>Overture bridges this gap by providing open-source software for building and deploying customizable genomics data platforms. Its architecture consists of modular microservices, each of which is generalized with narrow responsibilities that together combine to create complete data management systems. These systems enable researchers to organize, share, and explore their genomics data at any scale. Through Overture, researchers can connect their data to both humans and machines, fostering reproducibility and enabling new insights through controlled data sharing and reuse.</p><p><strong>Conclusions: </strong>By making these tools freely available, we can accelerate the development of reliable genomic data management across the research community quickly, flexibly, and at multiple scales. Overture is an open-source project licensed under AGPLv3.0 with all source code publicly available from https://github.com/overture-stack and documentation on development, deployment, and usage available from www.overture.bio.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12020472/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143996787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Electric eels evolved remarkable electric organs that enable them to instantaneously discharge hundreds of volts for predation, defense, and communication. However, the absence of a high-quality reference genome has extremely constrained the studies of electric eels in various aspects.
Results: Using high-depth, multiplatform sequencing data, we successfully assembled the first telomere-to-telomere high-quality reference genome of Electrophorus electricus, which has a genome size of 833.43 Mb and comprises 26 chromosomes. Multiple evaluations, including N50 statistics (30.38 Mb), BUSCO scores (97.30%), and mapping ratio of short-insert sequencing data (99.91%), demonstrate the high contiguity and completeness of the electric eel genome assembly we obtained. Genome annotation predicted 396.63 Mb repetitive sequences and 20,992 protein-coding genes. Furthermore, evolutionary analyses indicate that Gymnotiformes, which the electric eel belongs to, has a closer relationship with Characiformes than Siluriformes and diverged from Characiformes 95.00 million years ago. Pairwise sequentially Markovian coalescent analysis found a sharply decreased trend of the population size of E. electricus over the past few hundred thousand years. Furthermore, many regulatory factors related to neurotransmitters and classical signaling pathways during embryonic development were significantly expanded, potentially contributing to the generation of high-voltage electricity.
Conclusions: This study not only provided the first high-quality telomere-to-telomere reference genome of E. electricus but also greatly enhanced our understanding of electric eels.
{"title":"Telomere-to-telomere genome assembly of Electrophorus electricus provides insights into the evolution of electric eels.","authors":"Zan Qi, Qun Liu, Haorong Li, Yaolei Zhang, Ziwei Yu, Wenkai Luo, Kun Wang, Yuxin Zhang, Shoupeng Pan, Chao Wang, Hui Jiang, Qiang Qiu, Wen Wang, Guangyi Fan, Yongxin Li","doi":"10.1093/gigascience/giaf024","DOIUrl":"10.1093/gigascience/giaf024","url":null,"abstract":"<p><strong>Background: </strong>Electric eels evolved remarkable electric organs that enable them to instantaneously discharge hundreds of volts for predation, defense, and communication. However, the absence of a high-quality reference genome has extremely constrained the studies of electric eels in various aspects.</p><p><strong>Results: </strong>Using high-depth, multiplatform sequencing data, we successfully assembled the first telomere-to-telomere high-quality reference genome of Electrophorus electricus, which has a genome size of 833.43 Mb and comprises 26 chromosomes. Multiple evaluations, including N50 statistics (30.38 Mb), BUSCO scores (97.30%), and mapping ratio of short-insert sequencing data (99.91%), demonstrate the high contiguity and completeness of the electric eel genome assembly we obtained. Genome annotation predicted 396.63 Mb repetitive sequences and 20,992 protein-coding genes. Furthermore, evolutionary analyses indicate that Gymnotiformes, which the electric eel belongs to, has a closer relationship with Characiformes than Siluriformes and diverged from Characiformes 95.00 million years ago. Pairwise sequentially Markovian coalescent analysis found a sharply decreased trend of the population size of E. electricus over the past few hundred thousand years. Furthermore, many regulatory factors related to neurotransmitters and classical signaling pathways during embryonic development were significantly expanded, potentially contributing to the generation of high-voltage electricity.</p><p><strong>Conclusions: </strong>This study not only provided the first high-quality telomere-to-telomere reference genome of E. electricus but also greatly enhanced our understanding of electric eels.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11959694/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143752095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giae101
Teresa García-Lezana, Maciej Bobowicz, Santiago Frid, Michael Rutherford, Mikel Recuero, Katrine Riklund, Aldar Cabrelles, Marlena Rygusik, Lauren Fromont, Roberto Francischello, Emanuele Neri, Salvador Capella, Arcadi Navarro, Fred Prior, Jonathan Bona, Pilar Nicolas, Martijn P A Starmans, Karim Lekadir, Jordi Rambla
Background: An unprecedented amount of personal health data, with the potential to revolutionize precision medicine, is generated at health care institutions worldwide. The exploitation of such data using artificial intelligence (AI) relies on the ability to combine heterogeneous, multicentric, multimodal, and multiparametric data, as well as thoughtful representation of knowledge and data availability. Despite these possibilities, significant methodological challenges and ethicolegal constraints still impede the real-world implementation of data models.
Technical details: The EuCanImage is an international consortium aimed at developing AI algorithms for precision medicine in oncology and enabling secondary use of the data based on necessary ethical approvals. The use of well-defined clinical data standards to allow interoperability was a central element within the initiative. The consortium is focused on 3 different cancer types and addresses 7 unmet clinical needs. We have conceived and implemented an innovative process to capture clinical data from hospitals, transform it into the newly developed EuCanImage data models, and then store the standardized data in permanent repositories. This new workflow combines recognized software (REDCap for data capture), data standards (FHIR for data structuring), and an existing repository (EGA for permanent data storage and sharing), with newly developed custom tools for data transformation and quality control purposes (ETL pipeline, QC scripts) to complement the gaps.
Conclusion: This article synthesizes our experience and procedures for health care data interoperability, standardization, and reproducibility.
{"title":"New implementation of data standards for AI in oncology: Experience from the EuCanImage project.","authors":"Teresa García-Lezana, Maciej Bobowicz, Santiago Frid, Michael Rutherford, Mikel Recuero, Katrine Riklund, Aldar Cabrelles, Marlena Rygusik, Lauren Fromont, Roberto Francischello, Emanuele Neri, Salvador Capella, Arcadi Navarro, Fred Prior, Jonathan Bona, Pilar Nicolas, Martijn P A Starmans, Karim Lekadir, Jordi Rambla","doi":"10.1093/gigascience/giae101","DOIUrl":"10.1093/gigascience/giae101","url":null,"abstract":"<p><strong>Background: </strong>An unprecedented amount of personal health data, with the potential to revolutionize precision medicine, is generated at health care institutions worldwide. The exploitation of such data using artificial intelligence (AI) relies on the ability to combine heterogeneous, multicentric, multimodal, and multiparametric data, as well as thoughtful representation of knowledge and data availability. Despite these possibilities, significant methodological challenges and ethicolegal constraints still impede the real-world implementation of data models.</p><p><strong>Technical details: </strong>The EuCanImage is an international consortium aimed at developing AI algorithms for precision medicine in oncology and enabling secondary use of the data based on necessary ethical approvals. The use of well-defined clinical data standards to allow interoperability was a central element within the initiative. The consortium is focused on 3 different cancer types and addresses 7 unmet clinical needs. We have conceived and implemented an innovative process to capture clinical data from hospitals, transform it into the newly developed EuCanImage data models, and then store the standardized data in permanent repositories. This new workflow combines recognized software (REDCap for data capture), data standards (FHIR for data structuring), and an existing repository (EGA for permanent data storage and sharing), with newly developed custom tools for data transformation and quality control purposes (ETL pipeline, QC scripts) to complement the gaps.</p><p><strong>Conclusion: </strong>This article synthesizes our experience and procedures for health care data interoperability, standardization, and reproducibility.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12071370/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144010593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf041
Mara K N Lawniczak, Kevin M Kocot, Jonas J Astrin, Mark Blaxter, Cibele G Sotero-Caio, Katharine B Barker, Anna K Childers, Jonathan Coddington, Paul Davis, Kerstin Howe, Warren E Johnson, Duane D McKenna, Jeremy G Wideman, Olga Vinnere Pettersson, Verena Ras, Bernardo F Santos
The Earth BioGenome Project has the extremely ambitious goal of generating, at scale, high-quality reference genomes across the entire Tree of Life. Currently in its first phase, the project is targeting family-level representatives and is progressing rapidly. Here we outline recommended standards and considerations in sample acquisition and processing for those involved in biodiverse reference genome creation. These standards and recommendations will evolve with advances in related processes. Additionally, we discuss the challenges raised by the ambitions for later phases of the project, highlighting topics related to sample collection and processing that require further development.
{"title":"Best-practice guidance for Earth BioGenome Project sample collection and processing: progress and challenges in biodiverse reference genome creation.","authors":"Mara K N Lawniczak, Kevin M Kocot, Jonas J Astrin, Mark Blaxter, Cibele G Sotero-Caio, Katharine B Barker, Anna K Childers, Jonathan Coddington, Paul Davis, Kerstin Howe, Warren E Johnson, Duane D McKenna, Jeremy G Wideman, Olga Vinnere Pettersson, Verena Ras, Bernardo F Santos","doi":"10.1093/gigascience/giaf041","DOIUrl":"10.1093/gigascience/giaf041","url":null,"abstract":"<p><p>The Earth BioGenome Project has the extremely ambitious goal of generating, at scale, high-quality reference genomes across the entire Tree of Life. Currently in its first phase, the project is targeting family-level representatives and is progressing rapidly. Here we outline recommended standards and considerations in sample acquisition and processing for those involved in biodiverse reference genome creation. These standards and recommendations will evolve with advances in related processes. Additionally, we discuss the challenges raised by the ambitions for later phases of the project, highlighting topics related to sample collection and processing that require further development.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12121479/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144173608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf035
Maximilian Wess, Maria K Andersen, Elise Midtbust, Juan Carlos Cabellos Guillem, Trond Viset, Øystein Størkersen, Sebastian Krossa, Morten Beck Rye, May-Britt Tessem
Background: Truly understanding the cancer biology of heterogeneous tumors in precision medicine requires capturing the complexities of multiple omics levels and the spatial heterogeneity of cancer tissue. Techniques like mass spectrometry imaging (MSI) and spatial transcriptomics (ST) achieve this by spatially detecting metabolites and RNA but are often applied to serial sections. To fully leverage the advantage of such multi-omics data, the individual measurements need to be integrated into 1 dataset.
Results: We present the Multi-Omics Imaging Integration Toolset (MIIT), a Python framework for integrating spatially resolved multi-omics data. A key component of MIIT's integration is the registration of serial sections for which we developed a nonrigid registration algorithm, GreedyFHist. We validated GreedyFHist on 244 images from fresh-frozen serial sections, achieving state-of-the-art performance. As a proof of concept, we used MIIT to integrate ST and MSI data from prostate tissue samples and assessed the correlation of a gene signature for citrate-spermine secretion derived from ST with metabolic measurements from MSI.
Conclusion: MIIT is a highly accurate, customizable, open-source framework for integrating spatial omics technologies performed on different serial sections.
{"title":"Spatial integration of multi-omics data from serial sections using the novel Multi-Omics Imaging Integration Toolset.","authors":"Maximilian Wess, Maria K Andersen, Elise Midtbust, Juan Carlos Cabellos Guillem, Trond Viset, Øystein Størkersen, Sebastian Krossa, Morten Beck Rye, May-Britt Tessem","doi":"10.1093/gigascience/giaf035","DOIUrl":"10.1093/gigascience/giaf035","url":null,"abstract":"<p><strong>Background: </strong>Truly understanding the cancer biology of heterogeneous tumors in precision medicine requires capturing the complexities of multiple omics levels and the spatial heterogeneity of cancer tissue. Techniques like mass spectrometry imaging (MSI) and spatial transcriptomics (ST) achieve this by spatially detecting metabolites and RNA but are often applied to serial sections. To fully leverage the advantage of such multi-omics data, the individual measurements need to be integrated into 1 dataset.</p><p><strong>Results: </strong>We present the Multi-Omics Imaging Integration Toolset (MIIT), a Python framework for integrating spatially resolved multi-omics data. A key component of MIIT's integration is the registration of serial sections for which we developed a nonrigid registration algorithm, GreedyFHist. We validated GreedyFHist on 244 images from fresh-frozen serial sections, achieving state-of-the-art performance. As a proof of concept, we used MIIT to integrate ST and MSI data from prostate tissue samples and assessed the correlation of a gene signature for citrate-spermine secretion derived from ST with metabolic measurements from MSI.</p><p><strong>Conclusion: </strong>MIIT is a highly accurate, customizable, open-source framework for integrating spatial omics technologies performed on different serial sections.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12077394/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144076950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}