Pub Date : 2025-01-06DOI: 10.1093/gigascience/giae113
Yuqi Liu, Abdulkadir Elmas, Kuan-Lin Huang
Background: Cancer mutations are often assumed to alter proteins, thus promoting tumorigenesis. However, how mutations affect protein expression-in addition to gene expression-has rarely been systematically investigated. This is significant as mRNA and protein levels frequently show only moderate correlation, driven by factors such as translation efficiency and protein degradation. Proteogenomic datasets from large tumor cohorts provide an opportunity to systematically analyze the effects of somatic mutations on mRNA and protein abundance and identify mutations with distinct impacts on these molecular levels.
Results: We conduct a comprehensive analysis of mutation impacts on mRNA- and protein-level expressions of 953 cancer cases with paired genomics and global proteomic profiling across 6 cancer types. Protein-level impacts are validated for 47.2% of the somatic expression quantitative trait loci (seQTLs), including CDH1 and MSH3 truncations, as well as other mutations from likely "long-tail" driver genes. Devising a statistical pipeline for identifying somatic protein-specific QTLs (spsQTLs), we reveal several gene mutations, including NF1 and MAP2K4 truncations and TP53 missenses showing disproportional influence on protein abundance not readily explained by transcriptomics. Cross-validating with data from massively parallel assays of variant effects (MAVE), TP53 missenses associated with high tumor TP53 proteins are more likely to be experimentally confirmed as functional.
Conclusion: This study reveals that somatic mutations can exhibit distinct impacts on mRNA and protein levels, underscoring the necessity of integrating proteogenomic data to comprehensively identify functionally significant cancer mutations. These insights provide a framework for prioritizing mutations for further functional validation and therapeutic targeting.
{"title":"Mutation impact on mRNA versus protein expression across human cancers.","authors":"Yuqi Liu, Abdulkadir Elmas, Kuan-Lin Huang","doi":"10.1093/gigascience/giae113","DOIUrl":"10.1093/gigascience/giae113","url":null,"abstract":"<p><strong>Background: </strong>Cancer mutations are often assumed to alter proteins, thus promoting tumorigenesis. However, how mutations affect protein expression-in addition to gene expression-has rarely been systematically investigated. This is significant as mRNA and protein levels frequently show only moderate correlation, driven by factors such as translation efficiency and protein degradation. Proteogenomic datasets from large tumor cohorts provide an opportunity to systematically analyze the effects of somatic mutations on mRNA and protein abundance and identify mutations with distinct impacts on these molecular levels.</p><p><strong>Results: </strong>We conduct a comprehensive analysis of mutation impacts on mRNA- and protein-level expressions of 953 cancer cases with paired genomics and global proteomic profiling across 6 cancer types. Protein-level impacts are validated for 47.2% of the somatic expression quantitative trait loci (seQTLs), including CDH1 and MSH3 truncations, as well as other mutations from likely \"long-tail\" driver genes. Devising a statistical pipeline for identifying somatic protein-specific QTLs (spsQTLs), we reveal several gene mutations, including NF1 and MAP2K4 truncations and TP53 missenses showing disproportional influence on protein abundance not readily explained by transcriptomics. Cross-validating with data from massively parallel assays of variant effects (MAVE), TP53 missenses associated with high tumor TP53 proteins are more likely to be experimentally confirmed as functional.</p><p><strong>Conclusion: </strong>This study reveals that somatic mutations can exhibit distinct impacts on mRNA and protein levels, underscoring the necessity of integrating proteogenomic data to comprehensively identify functionally significant cancer mutations. These insights provide a framework for prioritizing mutations for further functional validation and therapeutic targeting.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11702362/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142947474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giae121
Zhihui Yuan, Maximilian Rembe, Martin Mascher, Nils Stein, Axel Himmelbach, Murukarthick Jayakodi, Andreas Börner, Klaus Oldach, Ahmed Jahoor, Jens Due Jensen, Julia Rudloff, Viktoria-Elisabeth Dohrendorf, Luisa Pauline Kuhfus, Emmanuelle Dyrszka, Matthieu Conte, Frederik Hinz, Salim Trouchaud, Jochen C Reif, Samira El Hanafi
Background: Genebanks around the globe serve as valuable repositories of genetic diversity, offering not only access to a broad spectrum of plant material but also critical resources for enhancing crop resilience, advancing scientific research, and supporting global food security. To this end, traditional genebanks are evolving into biodigital resource centers where the integration of phenotypic and genotypic data for accessions can drive more informed decision-making, optimize resource allocation, and unlock new opportunities for plant breeding and research. However, the curation and availability of interoperable phenotypic and genotypic data for genebank accessions is still in its infancy and represents an obstacle to rapid scientific discoveries in this field. Therefore, effectively promoting FAIR (i.e., findable, accessible, interoperable, and reusable) access to these data is vital for maximizing the potential of genebanks and driving progress in agricultural innovation.
Findings: Here we provide whole genome sequencing data of 812 barley (Hordeum vulgare L.) plant genetic resources and 298 European elite materials released between 1949 and 2021, as well as the phenotypic data for 4 disease resistance traits and 3 agronomic traits. The robustness of the investigated traits and the interoperability of genomic and phenotypic data were assessed in the current publication, aiming to make this panel publicly available as a resource for future genetic research in barley.
Conclusions: The data showed broad phenotypic variability and high association mapping potential, offering a key resource for identifying genebank donors with untapped genes to advance barley breeding while safeguarding genetic diversity.
{"title":"High-quality phenotypic and genotypic dataset of barley genebank core collection to unlock untapped genetic diversity.","authors":"Zhihui Yuan, Maximilian Rembe, Martin Mascher, Nils Stein, Axel Himmelbach, Murukarthick Jayakodi, Andreas Börner, Klaus Oldach, Ahmed Jahoor, Jens Due Jensen, Julia Rudloff, Viktoria-Elisabeth Dohrendorf, Luisa Pauline Kuhfus, Emmanuelle Dyrszka, Matthieu Conte, Frederik Hinz, Salim Trouchaud, Jochen C Reif, Samira El Hanafi","doi":"10.1093/gigascience/giae121","DOIUrl":"10.1093/gigascience/giae121","url":null,"abstract":"<p><strong>Background: </strong>Genebanks around the globe serve as valuable repositories of genetic diversity, offering not only access to a broad spectrum of plant material but also critical resources for enhancing crop resilience, advancing scientific research, and supporting global food security. To this end, traditional genebanks are evolving into biodigital resource centers where the integration of phenotypic and genotypic data for accessions can drive more informed decision-making, optimize resource allocation, and unlock new opportunities for plant breeding and research. However, the curation and availability of interoperable phenotypic and genotypic data for genebank accessions is still in its infancy and represents an obstacle to rapid scientific discoveries in this field. Therefore, effectively promoting FAIR (i.e., findable, accessible, interoperable, and reusable) access to these data is vital for maximizing the potential of genebanks and driving progress in agricultural innovation.</p><p><strong>Findings: </strong>Here we provide whole genome sequencing data of 812 barley (Hordeum vulgare L.) plant genetic resources and 298 European elite materials released between 1949 and 2021, as well as the phenotypic data for 4 disease resistance traits and 3 agronomic traits. The robustness of the investigated traits and the interoperability of genomic and phenotypic data were assessed in the current publication, aiming to make this panel publicly available as a resource for future genetic research in barley.</p><p><strong>Conclusions: </strong>The data showed broad phenotypic variability and high association mapping potential, offering a key resource for identifying genebank donors with untapped genes to advance barley breeding while safeguarding genetic diversity.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11811526/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143390809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giae120
Regan J Hayward, Titus Ebbecke, Hanna Fricke, Vo Quang Nguyen, Lars Barquist
Micromix is a flexible web platform for sharing and integrating microbial omics data, including RNA sequencing and transposon-insertion sequencing. Currently, the lack of solutions for making data web-accessible results in omics data being fragmented across supplementary spreadsheets or languishing as raw read data in public repositories. Micromix solves this problem and can be easily deployed on a standard web server or using cloud services. It is organism-agnostic, accommodates data and annotations from various sources, and allows filtering based on KEGG pathways, Gene Ontology terms, and curated gene sets. Visualizations are provided through a plug-in system that integrates existing visualization services and allows rapid development of new services, with available plug-ins currently supporting interactive heatmap and clustering functions. Users can upload their own data in a variety of formats to perform integrative analyses in the context of existing datasets. To support collaborative research, Micromix allows sharing of interactive sessions that maintain defined filtering and/or visualization options. We demonstrate the utility of Micromix with case studies focusing on the SPI-2 pathogenicity island in Salmonella enterica and polysaccharide utilization loci in Bacteroides thetaiotaomicron, showcasing the platform's capabilities for integrating, filtering, and visualizing diverse functional genomic datasets. Micromix is available at http://micromix.systems.
{"title":"Micromix: web infrastructure for visualizing and remixing microbial 'omics data.","authors":"Regan J Hayward, Titus Ebbecke, Hanna Fricke, Vo Quang Nguyen, Lars Barquist","doi":"10.1093/gigascience/giae120","DOIUrl":"10.1093/gigascience/giae120","url":null,"abstract":"<p><p>Micromix is a flexible web platform for sharing and integrating microbial omics data, including RNA sequencing and transposon-insertion sequencing. Currently, the lack of solutions for making data web-accessible results in omics data being fragmented across supplementary spreadsheets or languishing as raw read data in public repositories. Micromix solves this problem and can be easily deployed on a standard web server or using cloud services. It is organism-agnostic, accommodates data and annotations from various sources, and allows filtering based on KEGG pathways, Gene Ontology terms, and curated gene sets. Visualizations are provided through a plug-in system that integrates existing visualization services and allows rapid development of new services, with available plug-ins currently supporting interactive heatmap and clustering functions. Users can upload their own data in a variety of formats to perform integrative analyses in the context of existing datasets. To support collaborative research, Micromix allows sharing of interactive sessions that maintain defined filtering and/or visualization options. We demonstrate the utility of Micromix with case studies focusing on the SPI-2 pathogenicity island in Salmonella enterica and polysaccharide utilization loci in Bacteroides thetaiotaomicron, showcasing the platform's capabilities for integrating, filtering, and visualizing diverse functional genomic datasets. Micromix is available at http://micromix.systems.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11788673/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143079386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf040
Yin Zhang, Lin Tang, Shengyao Zhi, Bosu Hu, Zhixiang Zuo, Jian Ren, Yubin Xie, Xiaotong Luo
Background: Allelic gene-specific regulatory events are crucial mechanisms in organisms, pivotal to many fundamental biological processes such as embryonic development and chromosome inactivation. Allelic gene imbalance manifests at both RNA expression and epigenetic levels. Recent research has unveiled allelic-specific regulation of RNA N6-methyladenosine (m6A), emphasizing the need for its precise identification. However, prevailing approaches primarily focus on screening allele-specific genetic variations associated with m6A, but not truly identify allelic m6A events. Therefore, the construction of a novel algorithm dedicated to identifying allele-specific m6A (ASm6A) signals is still necessary for comprehensively understanding the regulatory mechanism of ASm6A.
Findings: To address this limitation, we have developed a meta-analysis approach using hierarchical Bayesian models to accurately detect ASm6A events at the peak level from MeRIP-seq data. For user convenience, we introduce a unified analysis pipeline named M6Allele, streamlining the assessment of significant ASm6A across single and paired samples. Applying M6Allele to MeRIP-seq data analysis of pulmonary fibrosis and lung adenocarcinoma reveals enrichment of ASm6A events in key regulatory genes associated with these diseases, suggesting their potential involvement in disease regulation.
Conclusions: Our effort provides a method for precisely identifying ASm6A events at the peak level, elucidates the interplay of m6A with human health and disease genetics, and paves a new visual angle for disease research. The M6Allele software is freely available at https://github.com/RenLabBioinformatics/M6Allele under the MIT license.
{"title":"M6Allele: a toolkit for detection of allele-specific RNA N6-methyladenosine modifications.","authors":"Yin Zhang, Lin Tang, Shengyao Zhi, Bosu Hu, Zhixiang Zuo, Jian Ren, Yubin Xie, Xiaotong Luo","doi":"10.1093/gigascience/giaf040","DOIUrl":"10.1093/gigascience/giaf040","url":null,"abstract":"<p><strong>Background: </strong>Allelic gene-specific regulatory events are crucial mechanisms in organisms, pivotal to many fundamental biological processes such as embryonic development and chromosome inactivation. Allelic gene imbalance manifests at both RNA expression and epigenetic levels. Recent research has unveiled allelic-specific regulation of RNA N6-methyladenosine (m6A), emphasizing the need for its precise identification. However, prevailing approaches primarily focus on screening allele-specific genetic variations associated with m6A, but not truly identify allelic m6A events. Therefore, the construction of a novel algorithm dedicated to identifying allele-specific m6A (ASm6A) signals is still necessary for comprehensively understanding the regulatory mechanism of ASm6A.</p><p><strong>Findings: </strong>To address this limitation, we have developed a meta-analysis approach using hierarchical Bayesian models to accurately detect ASm6A events at the peak level from MeRIP-seq data. For user convenience, we introduce a unified analysis pipeline named M6Allele, streamlining the assessment of significant ASm6A across single and paired samples. Applying M6Allele to MeRIP-seq data analysis of pulmonary fibrosis and lung adenocarcinoma reveals enrichment of ASm6A events in key regulatory genes associated with these diseases, suggesting their potential involvement in disease regulation.</p><p><strong>Conclusions: </strong>Our effort provides a method for precisely identifying ASm6A events at the peak level, elucidates the interplay of m6A with human health and disease genetics, and paves a new visual angle for disease research. The M6Allele software is freely available at https://github.com/RenLabBioinformatics/M6Allele under the MIT license.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12087454/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144101503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf028
David Kainer, Matthew Lane, Kyle A Sullivan, J Izaak Miller, Mikaela Cashman, Mallory Morgan, Ashley Cliff, Jonathon Romero, Angelica Walker, D Dakota Blair, Hari Chhetri, Yongqin Wang, Mirko Pavicic, Anna Furches, Jaclyn Noshay, Meghan Drake, A J Ireland, Ali Missaoui, Yun Kang, John C Sedbrook, Paramvir Dehal, Shane Canon, Daniel Jacobson
We introduce RWRtoolkit, a multiplex generation, exploration, and statistical package built for R and command-line users. RWRtoolkit enables the efficient exploration of large and highly complex biological networks generated from custom experimental data and/or from publicly available datasets, and is species agnostic. A range of functions can be used to find topological distances between biological entities, determine relationships within sets of interest, search for topological context around sets of interest, and statistically evaluate the strength of relationships within and between sets. The command-line interface is designed for parallelization on high-performance cluster systems, which enables high-throughput analysis such as permutation testing. Several tools in the package have also been made available for use in reproducible workflows via the KBase web application.
{"title":"RWRtoolkit: multi-omic network analysis using random walks on multiplex networks in any species.","authors":"David Kainer, Matthew Lane, Kyle A Sullivan, J Izaak Miller, Mikaela Cashman, Mallory Morgan, Ashley Cliff, Jonathon Romero, Angelica Walker, D Dakota Blair, Hari Chhetri, Yongqin Wang, Mirko Pavicic, Anna Furches, Jaclyn Noshay, Meghan Drake, A J Ireland, Ali Missaoui, Yun Kang, John C Sedbrook, Paramvir Dehal, Shane Canon, Daniel Jacobson","doi":"10.1093/gigascience/giaf028","DOIUrl":"https://doi.org/10.1093/gigascience/giaf028","url":null,"abstract":"<p><p>We introduce RWRtoolkit, a multiplex generation, exploration, and statistical package built for R and command-line users. RWRtoolkit enables the efficient exploration of large and highly complex biological networks generated from custom experimental data and/or from publicly available datasets, and is species agnostic. A range of functions can be used to find topological distances between biological entities, determine relationships within sets of interest, search for topological context around sets of interest, and statistically evaluate the strength of relationships within and between sets. The command-line interface is designed for parallelization on high-performance cluster systems, which enables high-throughput analysis such as permutation testing. Several tools in the package have also been made available for use in reproducible workflows via the KBase web application.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12020474/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143968343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: The common carp (Cyprinus carpio) is a key species in global freshwater aquaculture. One of its variants, the koi carp, is particularly prized for its aesthetic appeal. However, lacking a high-quality genome has limited genetic research and breeding efforts for common carp and koi carp.
Findings: This study presents a gap-free genome for the Taisho Sansyoku koi carp strain (C. carpio). The assembly achieved a total size of 1,555.86 Mb with a contig N50 of 30.45 Mb, comprising 50 gap-free pseudochromosomes ranging in length from 20.70 to 49.02 Mb. The BUSCO completeness score reached 99.20%, and the Genome Continuity Inspector score was 85.82, indicating high genome integrity and accuracy. Notably, 83 out of 100 telomeres were detected, resulting in 33 chromosomes possessing complete telomeres. Comparative genomic analysis showed that the expanded gene families and unique genes play essential roles in various biological traits, such as energy metabolism, endocrine regulation, cell proliferation, and immune response, potentially related to multiple metabolic diseases and health conditions. The positively selected genes are linked to various biological processes, such as the metalloendopeptidase activity, which plays a significant role in the central nervous system and is associated with diseases.
Conclusions: The koi carp genome assembly (CC 4.0) fills a critical gap in understanding common carp's biology and adaptation. It provides an invaluable resource for molecular-guided breeding and genetic enhancement strategies, underscoring the importance of common carp and koi carp in aquaculture and ecological research.
{"title":"A telomere-to-telomere genome assembly of koi carp (Cyprinus carpio) using long reads and Hi-C technology.","authors":"Jiandong Yuan, Jiang Li, Jun Yong, Xuewu Liao, Huijuan Guo, Yongchao Niu","doi":"10.1093/gigascience/giaf087","DOIUrl":"https://doi.org/10.1093/gigascience/giaf087","url":null,"abstract":"<p><strong>Background: </strong>The common carp (Cyprinus carpio) is a key species in global freshwater aquaculture. One of its variants, the koi carp, is particularly prized for its aesthetic appeal. However, lacking a high-quality genome has limited genetic research and breeding efforts for common carp and koi carp.</p><p><strong>Findings: </strong>This study presents a gap-free genome for the Taisho Sansyoku koi carp strain (C. carpio). The assembly achieved a total size of 1,555.86 Mb with a contig N50 of 30.45 Mb, comprising 50 gap-free pseudochromosomes ranging in length from 20.70 to 49.02 Mb. The BUSCO completeness score reached 99.20%, and the Genome Continuity Inspector score was 85.82, indicating high genome integrity and accuracy. Notably, 83 out of 100 telomeres were detected, resulting in 33 chromosomes possessing complete telomeres. Comparative genomic analysis showed that the expanded gene families and unique genes play essential roles in various biological traits, such as energy metabolism, endocrine regulation, cell proliferation, and immune response, potentially related to multiple metabolic diseases and health conditions. The positively selected genes are linked to various biological processes, such as the metalloendopeptidase activity, which plays a significant role in the central nervous system and is associated with diseases.</p><p><strong>Conclusions: </strong>The koi carp genome assembly (CC 4.0) fills a critical gap in understanding common carp's biology and adaptation. It provides an invaluable resource for molecular-guided breeding and genetic enhancement strategies, underscoring the importance of common carp and koi carp in aquaculture and ecological research.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12395963/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144950492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf090
Xiao Li, Ximo Pechuan-Jorge, Tyler Risom, Conrad Foo, Alexander Prilipko, Artem Zubkov, Caleb Chan, Patrick Chang, Frank Peale, James Ziai, Sandra Rost, Derrek Hibar, Lisa McGinnis, Evgeniy Tabatsky, Xin Ye, Hector Corrada Bravo, Zhen Shi, Malgorzata Nowicka, Jon Scherdin, James Cowan, Jennifer Giltnane, Darya Orlova, Rajiv Jesudason
Recent advancements in transcriptomics and proteomics have opened the possibility for spatially resolved molecular characterization of tissue architecture with the promise of enabling a deeper understanding of tissue biology in either homeostasis or disease. The wealth of data generated by these technologies has recently driven the development of a wide range of computational methods. These methods have the requirement of advanced coding fluency to be applied and integrated across the full spatial omics analysis process, thus presenting a hurdle for widespread adoption by the biology research community. To address this, we introduce SPEX (Spatial Expression Explorer), a web-based analysis platform that employs modular analysis pipeline design, accessible through a user-friendly interface. SPEX's infrastructure allows for streamlined access to open-source image data management systems, analysis modules, and fully integrated data visualization solutions. Analysis modules include essential steps covering image processing, single-cell analysis, and spatial analysis. We demonstrate SPEX's ability to facilitate the discovery of biological insights in spatially resolved omics datasets from healthy tissue to tumor samples.
{"title":"SPEX: A modular end-to-end platform for high-plex tissue spatial omics analysis.","authors":"Xiao Li, Ximo Pechuan-Jorge, Tyler Risom, Conrad Foo, Alexander Prilipko, Artem Zubkov, Caleb Chan, Patrick Chang, Frank Peale, James Ziai, Sandra Rost, Derrek Hibar, Lisa McGinnis, Evgeniy Tabatsky, Xin Ye, Hector Corrada Bravo, Zhen Shi, Malgorzata Nowicka, Jon Scherdin, James Cowan, Jennifer Giltnane, Darya Orlova, Rajiv Jesudason","doi":"10.1093/gigascience/giaf090","DOIUrl":"https://doi.org/10.1093/gigascience/giaf090","url":null,"abstract":"<p><p>Recent advancements in transcriptomics and proteomics have opened the possibility for spatially resolved molecular characterization of tissue architecture with the promise of enabling a deeper understanding of tissue biology in either homeostasis or disease. The wealth of data generated by these technologies has recently driven the development of a wide range of computational methods. These methods have the requirement of advanced coding fluency to be applied and integrated across the full spatial omics analysis process, thus presenting a hurdle for widespread adoption by the biology research community. To address this, we introduce SPEX (Spatial Expression Explorer), a web-based analysis platform that employs modular analysis pipeline design, accessible through a user-friendly interface. SPEX's infrastructure allows for streamlined access to open-source image data management systems, analysis modules, and fully integrated data visualization solutions. Analysis modules include essential steps covering image processing, single-cell analysis, and spatial analysis. We demonstrate SPEX's ability to facilitate the discovery of biological insights in spatially resolved omics datasets from healthy tissue to tumor samples.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12395962/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144950648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BioSample is a repository of experimental sample metadata. It is a comprehensive archive that enables searches of experiments, regardless of type. However, there is substantial variability in the submitted metadata due to the difficulty in defining comprehensive rules for describing them and the limited user awareness of best practices in creating them. This inconsistency poses considerable challenges to the findability and reusability of archived data. Given the scale of BioSample, which hosts over 40 million records, manual curation is impractical. Automatic rule-based ontology mapping methods have been proposed to address this issue, but their effectiveness is limited by the heterogeneity of the metadata. Recently, large language models (LLMs) have gained attention in natural language processing and are promising tools for automating metadata curation. In this study, we evaluated the performance of LLMs in extracting cell line names from BioSample descriptions using a gold-standard dataset derived from ChIP-Atlas, a secondary database of epigenomics experiment data in which samples were manually curated. The LLM-assisted methods outperformed traditional approaches, achieving higher accuracy and coverage. We further extended them to extract information about experimentally manipulated genes from metadata when manual curation had not yet been applied in ChIP-Atlas. This also yielded successful results, including the facilitation of more precise filtering of the data and the prevention of possible misinterpretations caused by the inclusion of unintended data. These findings underscore the potential of LLMs in improving the findability and reusability of experimental data in general, which would considerably reduce the user workload and enable more effective scientific data management.
{"title":"Extraction of biological terms using large language models enhances the usability of metadata in the BioSample database.","authors":"Shuya Ikeda, Zhaonan Zou, Hidemasa Bono, Yuki Moriya, Shuichi Kawashima, Toshiaki Katayama, Shinya Oki, Tazro Ohta","doi":"10.1093/gigascience/giaf070","DOIUrl":"10.1093/gigascience/giaf070","url":null,"abstract":"<p><p>BioSample is a repository of experimental sample metadata. It is a comprehensive archive that enables searches of experiments, regardless of type. However, there is substantial variability in the submitted metadata due to the difficulty in defining comprehensive rules for describing them and the limited user awareness of best practices in creating them. This inconsistency poses considerable challenges to the findability and reusability of archived data. Given the scale of BioSample, which hosts over 40 million records, manual curation is impractical. Automatic rule-based ontology mapping methods have been proposed to address this issue, but their effectiveness is limited by the heterogeneity of the metadata. Recently, large language models (LLMs) have gained attention in natural language processing and are promising tools for automating metadata curation. In this study, we evaluated the performance of LLMs in extracting cell line names from BioSample descriptions using a gold-standard dataset derived from ChIP-Atlas, a secondary database of epigenomics experiment data in which samples were manually curated. The LLM-assisted methods outperformed traditional approaches, achieving higher accuracy and coverage. We further extended them to extract information about experimentally manipulated genes from metadata when manual curation had not yet been applied in ChIP-Atlas. This also yielded successful results, including the facilitation of more precise filtering of the data and the prevention of possible misinterpretations caused by the inclusion of unintended data. These findings underscore the potential of LLMs in improving the findability and reusability of experimental data in general, which would considerably reduce the user workload and enable more effective scientific data management.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12205978/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144474817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf046
Mingyue Chen, Xingyu Yang, Lan Xun, Zhenlin Qu, Shihai Yang, Yunqiang Yang, Yongping Yang
Background: Dioecy, a common reproductive strategy in angiosperms, has evolved independently in various plant lineages, and this has resulted in the evolution of diverse sex chromosome systems and sex determination mechanisms. Hippophae is a genus of dioecious plants with an XY sex determination system, but the molecular underpinnings of this process have not yet been clarified. Most previously published sea buckthorn genome data have been derived from females, yet genomic data on males are critically important for clarifying our understanding of sex determination in this genus. Comparative genomic analyses of male and female sea buckthorn plants can shed light on the origins and evolution of sex. These studies can also enhance our understanding of the molecular mechanisms underlying sexual differentiation and provide novel insights and data for future research on sexual reproduction in plants.
Results: We conducted an in-depth analysis of the genomes of 2 sea buckthorn species, including a male Hippophae gyantsensis, a female Hippophae salicifolia, and 2 haplotypes of male H. salicifolia. The genome size of H. gyantsensis was 704.35 Mb, and that of the female H. salicifolia was 788.28 Mb. The sizes of the 2 haplotype genomes were 1,139.99 Mb and 1,097.34 Mb. The sex-determining region (SDR) of H. salicifolia was 29.71 Mb and contained 249 genes. A comparative analysis of the haplotypes of Chr02 of H. salicifolia revealed that the Y chromosome was shorter than the X chromosome. Chromosomal evolution analysis indicated that Hippophae has experienced significant chromosomal rearrangements following 2 whole-genome duplication events, and the fusion of 2 chromosomes has potentially led to the early formation of sex chromosomes in sea buckthorn. Multiple structural variations between Y and X sex-linked regions might have facilitated the rapid evolution of sex chromosomes in H. salicifolia. Comparison of the transcriptome data of male and female flower buds from H. gyantsensis and H. salicifolia revealed 11 genes specifically expressed in males. Three of these were identified as candidate genes involved in the sex determination of sea buckthorn. These findings will aid future studies of the sex determination mechanisms in sea buckthorn.
Conclusion: A comparative genomic analysis was performed to identify the SDR in H. salicifolia. The origins and evolutionary trajectories of sex chromosomes within Hippophae were also determined. Three potential candidate genes associated with sea buckthorn sex determination were identified. Overall, our findings will aid future studies aimed at clarifying the mechanisms of sex determination.
{"title":"The genome of Hippophae salicifolia provides new insights into the sexual differentiation of sea buckthorn.","authors":"Mingyue Chen, Xingyu Yang, Lan Xun, Zhenlin Qu, Shihai Yang, Yunqiang Yang, Yongping Yang","doi":"10.1093/gigascience/giaf046","DOIUrl":"10.1093/gigascience/giaf046","url":null,"abstract":"<p><strong>Background: </strong>Dioecy, a common reproductive strategy in angiosperms, has evolved independently in various plant lineages, and this has resulted in the evolution of diverse sex chromosome systems and sex determination mechanisms. Hippophae is a genus of dioecious plants with an XY sex determination system, but the molecular underpinnings of this process have not yet been clarified. Most previously published sea buckthorn genome data have been derived from females, yet genomic data on males are critically important for clarifying our understanding of sex determination in this genus. Comparative genomic analyses of male and female sea buckthorn plants can shed light on the origins and evolution of sex. These studies can also enhance our understanding of the molecular mechanisms underlying sexual differentiation and provide novel insights and data for future research on sexual reproduction in plants.</p><p><strong>Results: </strong>We conducted an in-depth analysis of the genomes of 2 sea buckthorn species, including a male Hippophae gyantsensis, a female Hippophae salicifolia, and 2 haplotypes of male H. salicifolia. The genome size of H. gyantsensis was 704.35 Mb, and that of the female H. salicifolia was 788.28 Mb. The sizes of the 2 haplotype genomes were 1,139.99 Mb and 1,097.34 Mb. The sex-determining region (SDR) of H. salicifolia was 29.71 Mb and contained 249 genes. A comparative analysis of the haplotypes of Chr02 of H. salicifolia revealed that the Y chromosome was shorter than the X chromosome. Chromosomal evolution analysis indicated that Hippophae has experienced significant chromosomal rearrangements following 2 whole-genome duplication events, and the fusion of 2 chromosomes has potentially led to the early formation of sex chromosomes in sea buckthorn. Multiple structural variations between Y and X sex-linked regions might have facilitated the rapid evolution of sex chromosomes in H. salicifolia. Comparison of the transcriptome data of male and female flower buds from H. gyantsensis and H. salicifolia revealed 11 genes specifically expressed in males. Three of these were identified as candidate genes involved in the sex determination of sea buckthorn. These findings will aid future studies of the sex determination mechanisms in sea buckthorn.</p><p><strong>Conclusion: </strong>A comparative genomic analysis was performed to identify the SDR in H. salicifolia. The origins and evolutionary trajectories of sex chromosomes within Hippophae were also determined. Three potential candidate genes associated with sea buckthorn sex determination were identified. Overall, our findings will aid future studies aimed at clarifying the mechanisms of sex determination.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12218201/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144553223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf047
Zheng-Feng Wang, En-Ping Yu, Lin Fu, Hua-Ge Deng, Wei-Guang Zhu, Feng-Xia Xu, Hong-Lin Cao
Background: The genus Ormosia belongs to the Fabaceae family; almost all Ormosia species are endemic to China, which is considered one of the centers of this genus. Thus, genomic studies on the genus are needed to better understand species evolution and ensure the conservation and utilization of these species. We performed a chromosome-scale assembly of O. purpureiflora and updated the chromosome-scale assemblies of O. emarginata and O. semicastrata for comparative genomics.
Findings: The genome assembly sizes of the 3 species ranged from 1.42 to 1.58 Gb, with O. purpureiflora being the largest. Repetitive sequences accounted for 74.0-76.3% of the genomes, and the predicted gene counts ranged from 50,517 to 55,061. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis indicated 97.0-98.4% genome completeness, whereas the long terminal repeat (LTR) assembly index values ranged from 13.66 to 17.56, meeting the "reference genome" quality standard. Gene completeness, assessed using BUSCO and OMArk, ranged from 95.1% to 96.3% and from 97.1% to 98.1%, respectively.Characterizing genome architectures further revealed that inversions were the main structural rearrangements in Ormosia. In numbers, density distributions of repetitive elements revealed the types of Helitron and terminal inverted repeat (TIR) elements and the types of Gypsy and unknown LTR retrotransposons (LTR-RTs) concentrated in different regions on the chromosomes, whereas Copia LTR-RTs were generally evenly distributed along the chromosomes in Ormosia.Compared with the sister species Lupinus albus, Ormosia species had lower numbers and percentages of resistance (R) genes and transcription factor genes. Genes related to alkaloid, terpene, and flavonoid biosynthesis were found to be duplicated through tandem or proximal duplications. Notably, some genes associated with growth and defense were absent in O. purpureiflora.By resequencing 153 genotypes (∼30 Gb of data per sample) from 6 O. purpureiflora (sub)populations, we identified 40,146 single nucleotide polymorphisms. Corresponding to its very small populations, O. purpureiflora exhibited low genetic diversity.
Conclusions: The Ormosia genome assemblies provide valuable resources for studying the evolution, conservation, and potential utility of both Ormosia and Fabaceae species.
{"title":"Chromosome-scale assemblies of three Ormosia species: repetitive sequences distribution and structural rearrangement.","authors":"Zheng-Feng Wang, En-Ping Yu, Lin Fu, Hua-Ge Deng, Wei-Guang Zhu, Feng-Xia Xu, Hong-Lin Cao","doi":"10.1093/gigascience/giaf047","DOIUrl":"10.1093/gigascience/giaf047","url":null,"abstract":"<p><strong>Background: </strong>The genus Ormosia belongs to the Fabaceae family; almost all Ormosia species are endemic to China, which is considered one of the centers of this genus. Thus, genomic studies on the genus are needed to better understand species evolution and ensure the conservation and utilization of these species. We performed a chromosome-scale assembly of O. purpureiflora and updated the chromosome-scale assemblies of O. emarginata and O. semicastrata for comparative genomics.</p><p><strong>Findings: </strong>The genome assembly sizes of the 3 species ranged from 1.42 to 1.58 Gb, with O. purpureiflora being the largest. Repetitive sequences accounted for 74.0-76.3% of the genomes, and the predicted gene counts ranged from 50,517 to 55,061. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis indicated 97.0-98.4% genome completeness, whereas the long terminal repeat (LTR) assembly index values ranged from 13.66 to 17.56, meeting the \"reference genome\" quality standard. Gene completeness, assessed using BUSCO and OMArk, ranged from 95.1% to 96.3% and from 97.1% to 98.1%, respectively.Characterizing genome architectures further revealed that inversions were the main structural rearrangements in Ormosia. In numbers, density distributions of repetitive elements revealed the types of Helitron and terminal inverted repeat (TIR) elements and the types of Gypsy and unknown LTR retrotransposons (LTR-RTs) concentrated in different regions on the chromosomes, whereas Copia LTR-RTs were generally evenly distributed along the chromosomes in Ormosia.Compared with the sister species Lupinus albus, Ormosia species had lower numbers and percentages of resistance (R) genes and transcription factor genes. Genes related to alkaloid, terpene, and flavonoid biosynthesis were found to be duplicated through tandem or proximal duplications. Notably, some genes associated with growth and defense were absent in O. purpureiflora.By resequencing 153 genotypes (∼30 Gb of data per sample) from 6 O. purpureiflora (sub)populations, we identified 40,146 single nucleotide polymorphisms. Corresponding to its very small populations, O. purpureiflora exhibited low genetic diversity.</p><p><strong>Conclusions: </strong>The Ormosia genome assemblies provide valuable resources for studying the evolution, conservation, and potential utility of both Ormosia and Fabaceae species.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12083454/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144077473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}