Pub Date : 2025-01-06DOI: 10.1093/gigascience/giae107
Camila L Goclowski, Julia Jakiela, Tyler Collins, Saskia Hiltemann, Morgan Howells, Marisa Loach, Jonathan Manning, Pablo Moreno, Alex Ostrovsky, Helena Rasche, Mehmet Tekman, Graeme Tyson, Pavankumar Videm, Wendi Bacon
Background: Bioinformatics is fundamental to biomedical sciences, but its mastery presents a steep learning curve for bench biologists and clinicians. Learning to code while analyzing data is difficult. The curve may be flattened by separating these two aspects and providing intermediate steps for budding bioinformaticians. Single-cell analysis is in great demand from biologists and biomedical scientists, as evidenced by the proliferation of training events, materials, and collaborative global efforts like the Human Cell Atlas. However, iterative analyses lacking reinstantiation, coupled with unstandardized pipelines, have made effective single-cell training a moving target.
Findings: To address these challenges, we present a Multi-Interface Galaxy Hands-on Training Suite (MIGHTS) for single-cell RNA sequencing (scRNA-seq) analysis, which offers parallel analytical methods using a graphical interface (buttons) or code. With clear, interoperable materials, MIGHTS facilitates smooth transitions between environments. Bridging the biologist-programmer gap, MIGHTS emphasizes interdisciplinary communication for effective learning at all levels. Real-world data analysis in MIGHTS promotes critical thinking and best practices, while FAIR data principles ensure validation of results. MIGHTS is freely available, hosted on the Galaxy Training Network, and leverages Galaxy interfaces for analyses in both settings. Given the ongoing popularity of Python-based (Scanpy) and R-based (Seurat & Monocle) scRNA-seq analyses, MIGHTS enables analyses using both.
Conclusions: MIGHTS consists of 11 tutorials, including recordings, slide decks, and interactive visualizations, and a demonstrated track record of sustainability via regular updates and community collaborations. Parallel pathways in MIGHTS enable concurrent training of scientists at any programming level, addressing the heterogeneous needs of novice bioinformaticians.
{"title":"Galaxy as a gateway to bioinformatics: Multi-Interface Galaxy Hands-on Training Suite (MIGHTS) for scRNA-seq.","authors":"Camila L Goclowski, Julia Jakiela, Tyler Collins, Saskia Hiltemann, Morgan Howells, Marisa Loach, Jonathan Manning, Pablo Moreno, Alex Ostrovsky, Helena Rasche, Mehmet Tekman, Graeme Tyson, Pavankumar Videm, Wendi Bacon","doi":"10.1093/gigascience/giae107","DOIUrl":"10.1093/gigascience/giae107","url":null,"abstract":"<p><strong>Background: </strong>Bioinformatics is fundamental to biomedical sciences, but its mastery presents a steep learning curve for bench biologists and clinicians. Learning to code while analyzing data is difficult. The curve may be flattened by separating these two aspects and providing intermediate steps for budding bioinformaticians. Single-cell analysis is in great demand from biologists and biomedical scientists, as evidenced by the proliferation of training events, materials, and collaborative global efforts like the Human Cell Atlas. However, iterative analyses lacking reinstantiation, coupled with unstandardized pipelines, have made effective single-cell training a moving target.</p><p><strong>Findings: </strong>To address these challenges, we present a Multi-Interface Galaxy Hands-on Training Suite (MIGHTS) for single-cell RNA sequencing (scRNA-seq) analysis, which offers parallel analytical methods using a graphical interface (buttons) or code. With clear, interoperable materials, MIGHTS facilitates smooth transitions between environments. Bridging the biologist-programmer gap, MIGHTS emphasizes interdisciplinary communication for effective learning at all levels. Real-world data analysis in MIGHTS promotes critical thinking and best practices, while FAIR data principles ensure validation of results. MIGHTS is freely available, hosted on the Galaxy Training Network, and leverages Galaxy interfaces for analyses in both settings. Given the ongoing popularity of Python-based (Scanpy) and R-based (Seurat & Monocle) scRNA-seq analyses, MIGHTS enables analyses using both.</p><p><strong>Conclusions: </strong>MIGHTS consists of 11 tutorials, including recordings, slide decks, and interactive visualizations, and a demonstrated track record of sustainability via regular updates and community collaborations. Parallel pathways in MIGHTS enable concurrent training of scientists at any programming level, addressing the heterogeneous needs of novice bioinformaticians.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11707610/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142947515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf002
Yuansheng Zhou, Chen Tang, Xue Xiao, Xiaowei Zhan, Tao Wang, Guanghua Xiao, Lin Xu
Background: Spatially resolved profiling technologies to quantify transcriptomes, epigenomes, and proteomes have been emerging as groundbreaking methods for comprehensive molecular characterizations. Dimensionality reduction and visualization is an essential step to analyze and interpret spatially resolved profiling data. However, state-of-the-art dimensionality reduction methods for single-cell sequencing data, such as the t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP), were not tailored for spatially resolved profiling data.
Results: Here we developed a spatially resolved t-SNE (SpaSNE) method to integrate both spatial and molecular information. We applied it to a variety of public spatially resolved profiling datasets that were generated from 3 experimental platforms and consisted of cells from different diseases, tissues, and cell types. To compare the performances of SpaSNE, t-SNE, and UMAP, we applied them to 4 spatially resolved profiling datasets obtained from 3 distinct experimental platforms (Visium, STARmap, and MERFISH) on both diseased and normal tissues. Comparisons between SpaSNE and these state-of-the-art approaches reveal that SpaSNE achieves more accurate and meaningful visualization that better elucidates the underlying spatial and molecular data structures.
Conclusions: This work demonstrates the broad application of SpaSNE for reliable and robust interpretation of cell types based on both molecular and spatial information, which can set the foundation for many subsequent analysis steps, such as differential gene expression and trajectory or pseudotime analysis on the spatially resolved profiling data.
{"title":"Dimensionality reduction for visualizing spatially resolved profiling data using SpaSNE.","authors":"Yuansheng Zhou, Chen Tang, Xue Xiao, Xiaowei Zhan, Tao Wang, Guanghua Xiao, Lin Xu","doi":"10.1093/gigascience/giaf002","DOIUrl":"10.1093/gigascience/giaf002","url":null,"abstract":"<p><strong>Background: </strong>Spatially resolved profiling technologies to quantify transcriptomes, epigenomes, and proteomes have been emerging as groundbreaking methods for comprehensive molecular characterizations. Dimensionality reduction and visualization is an essential step to analyze and interpret spatially resolved profiling data. However, state-of-the-art dimensionality reduction methods for single-cell sequencing data, such as the t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP), were not tailored for spatially resolved profiling data.</p><p><strong>Results: </strong>Here we developed a spatially resolved t-SNE (SpaSNE) method to integrate both spatial and molecular information. We applied it to a variety of public spatially resolved profiling datasets that were generated from 3 experimental platforms and consisted of cells from different diseases, tissues, and cell types. To compare the performances of SpaSNE, t-SNE, and UMAP, we applied them to 4 spatially resolved profiling datasets obtained from 3 distinct experimental platforms (Visium, STARmap, and MERFISH) on both diseased and normal tissues. Comparisons between SpaSNE and these state-of-the-art approaches reveal that SpaSNE achieves more accurate and meaningful visualization that better elucidates the underlying spatial and molecular data structures.</p><p><strong>Conclusions: </strong>This work demonstrates the broad application of SpaSNE for reliable and robust interpretation of cell types based on both molecular and spatial information, which can set the foundation for many subsequent analysis steps, such as differential gene expression and trajectory or pseudotime analysis on the spatially resolved profiling data.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11831803/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143440606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf006
Tao Wang, Maosen Yang, Xin Shi, Shilin Tian, Yan Li, Wenqian Xie, Zhengting Zou, Dong Leng, Ming Zhang, Chengli Zheng, Chungang Feng, Bo Zeng, Xiaolan Fan, Huimin Qiu, Jing Li, Guijun Zhao, Zhengrong Yuan, Diyan Li, Hang Jie
Background: Musk, secreted by the musk gland of adult male musk-secreting mammals, holds significant pharmaceutical and cosmetic potential. However, understanding the molecular mechanisms of musk secretion remains limited, largely due to the lack of comprehensive multiomics analyses and available platforms for relevant species, such as muskrat (Ondatra zibethicus Linnaeus) and Chinese forest musk deer (Moschus berezovskii Flerov).
Results: We generated chromosome-level genome assemblies for the 2 species of muskrat (Ondatra zibethicus Linnaeus) and musk deer (Moschus berezovskii Flerov), along with 168 transcriptomes from various muskrat tissues. Comparative analysis with 11 other vertebrate genomes revealed genes and amino acid sites with signs of adaptive convergent evolution, primarily linked to lipid metabolism, cell cycle regulation, protein binding, and immunity. Single-cell RNA sequencing in muskrat musk glands identified increased acinar/glandular epithelial cells during secretion, highlighting the role of lipometabolism in gland development and evolution. Additionally, we developed MuskDB (http://muskdb.cn/home/), a freely accessible multiomics database platform for musk-secreting mammals.
Conclusions: The study concludes that the evolution of musk secretion in muskrats and musk deer is likely driven by lipid metabolism and cell specialization. This underscores the complexity of the musk gland and calls for further investigation into musk secretion-specific genetic variants.
{"title":"Multiomics analysis provides insights into musk secretion in muskrat and musk deer.","authors":"Tao Wang, Maosen Yang, Xin Shi, Shilin Tian, Yan Li, Wenqian Xie, Zhengting Zou, Dong Leng, Ming Zhang, Chengli Zheng, Chungang Feng, Bo Zeng, Xiaolan Fan, Huimin Qiu, Jing Li, Guijun Zhao, Zhengrong Yuan, Diyan Li, Hang Jie","doi":"10.1093/gigascience/giaf006","DOIUrl":"10.1093/gigascience/giaf006","url":null,"abstract":"<p><strong>Background: </strong>Musk, secreted by the musk gland of adult male musk-secreting mammals, holds significant pharmaceutical and cosmetic potential. However, understanding the molecular mechanisms of musk secretion remains limited, largely due to the lack of comprehensive multiomics analyses and available platforms for relevant species, such as muskrat (Ondatra zibethicus Linnaeus) and Chinese forest musk deer (Moschus berezovskii Flerov).</p><p><strong>Results: </strong>We generated chromosome-level genome assemblies for the 2 species of muskrat (Ondatra zibethicus Linnaeus) and musk deer (Moschus berezovskii Flerov), along with 168 transcriptomes from various muskrat tissues. Comparative analysis with 11 other vertebrate genomes revealed genes and amino acid sites with signs of adaptive convergent evolution, primarily linked to lipid metabolism, cell cycle regulation, protein binding, and immunity. Single-cell RNA sequencing in muskrat musk glands identified increased acinar/glandular epithelial cells during secretion, highlighting the role of lipometabolism in gland development and evolution. Additionally, we developed MuskDB (http://muskdb.cn/home/), a freely accessible multiomics database platform for musk-secreting mammals.</p><p><strong>Conclusions: </strong>The study concludes that the evolution of musk secretion in muskrats and musk deer is likely driven by lipid metabolism and cell specialization. This underscores the complexity of the musk gland and calls for further investigation into musk secretion-specific genetic variants.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11878540/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143556460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giae082
Yichun Feng, Lu Zhou, Chao Ma, Yikai Zheng, Ruikun He, Yixue Li
Background: In recent years, large language models (LLMs) have shown promise in various domains, notably in biomedical sciences. However, their real-world application is often limited by issues like erroneous outputs and hallucinatory responses.
Results: We developed the knowledge graph-based thought (KGT) framework, an innovative solution that integrates LLMs with knowledge graphs (KGs) to improve their initial responses by utilizing verifiable information from KGs, thus significantly reducing factual errors in reasoning. The KGT framework demonstrates strong adaptability and performs well across various open-source LLMs. Notably, KGT can facilitate the discovery of new uses for existing drugs through potential drug-cancer associations and can assist in predicting resistance by analyzing relevant biomarkers and genetic mechanisms. To evaluate the knowledge graph question answering task within biomedicine, we utilize a pan-cancer knowledge graph to develop a pan-cancer question answering benchmark, named pan-cancer question answering.
Conclusions: The KGT framework substantially improves the accuracy and utility of LLMs in the biomedical field. This study serves as a proof of concept, demonstrating its exceptional performance in biomedical question answering.
{"title":"Knowledge graph-based thought: a knowledge graph-enhanced LLM framework for pan-cancer question answering.","authors":"Yichun Feng, Lu Zhou, Chao Ma, Yikai Zheng, Ruikun He, Yixue Li","doi":"10.1093/gigascience/giae082","DOIUrl":"10.1093/gigascience/giae082","url":null,"abstract":"<p><strong>Background: </strong>In recent years, large language models (LLMs) have shown promise in various domains, notably in biomedical sciences. However, their real-world application is often limited by issues like erroneous outputs and hallucinatory responses.</p><p><strong>Results: </strong>We developed the knowledge graph-based thought (KGT) framework, an innovative solution that integrates LLMs with knowledge graphs (KGs) to improve their initial responses by utilizing verifiable information from KGs, thus significantly reducing factual errors in reasoning. The KGT framework demonstrates strong adaptability and performs well across various open-source LLMs. Notably, KGT can facilitate the discovery of new uses for existing drugs through potential drug-cancer associations and can assist in predicting resistance by analyzing relevant biomarkers and genetic mechanisms. To evaluate the knowledge graph question answering task within biomedicine, we utilize a pan-cancer knowledge graph to develop a pan-cancer question answering benchmark, named pan-cancer question answering.</p><p><strong>Conclusions: </strong>The KGT framework substantially improves the accuracy and utility of LLMs in the biomedical field. This study serves as a proof of concept, demonstrating its exceptional performance in biomedical question answering.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11702363/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142947471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giae119
Ty Easley, Xiaoke Luo, Kayla Hannon, Petra Lenzini, Janine Bijsterbosch
Background: The use of machine learning to classify diagnostic cases versus controls defined based on diagnostic ontologies such as the International Classification of Diseases, Tenth Revision (ICD-10) from neuroimaging features is now commonplace across a wide range of diagnostic fields. However, transdiagnostic comparisons of such classifications are lacking. Such transdiagnostic comparisons are important to establish the specificity of classification models, set benchmarks, and assess the value of diagnostic ontologies.
Results: We investigated case-control classification accuracy in 17 different ICD-10 diagnostic groups from Chapter V (mental and behavioral disorders) and Chapter VI (diseases of the nervous system) using data from the UK Biobank. Classification models were trained using either neuroimaging (structural or functional brain magnetic resonance imaging feature sets) or sociodemographic features. Random forest classification models were adopted using rigorous shuffle-splits to estimate stability as well as accuracy of case-control classifications. Diagnostic classification accuracies were benchmarked against age classification (oldest vs. youngest) from the same feature sets and against additional classifier types (k-nearest neighbors and linear support vector machine). In contrast to age classification accuracy, which was high for all feature sets, few ICD-10 diagnostic groups were classified significantly above chance (namely, demyelinating diseases based on structural neuroimaging features and depression based on sociodemographic and functional neuroimaging features).
Conclusion: These findings highlight challenges with the current disease classification system, leading us to recommend caution with the use of ICD-10 diagnostic groups as target labels in brain-based disease prediction studies.
{"title":"Opaque ontology: neuroimaging classification of ICD-10 diagnostic groups in the UK Biobank.","authors":"Ty Easley, Xiaoke Luo, Kayla Hannon, Petra Lenzini, Janine Bijsterbosch","doi":"10.1093/gigascience/giae119","DOIUrl":"10.1093/gigascience/giae119","url":null,"abstract":"<p><strong>Background: </strong>The use of machine learning to classify diagnostic cases versus controls defined based on diagnostic ontologies such as the International Classification of Diseases, Tenth Revision (ICD-10) from neuroimaging features is now commonplace across a wide range of diagnostic fields. However, transdiagnostic comparisons of such classifications are lacking. Such transdiagnostic comparisons are important to establish the specificity of classification models, set benchmarks, and assess the value of diagnostic ontologies.</p><p><strong>Results: </strong>We investigated case-control classification accuracy in 17 different ICD-10 diagnostic groups from Chapter V (mental and behavioral disorders) and Chapter VI (diseases of the nervous system) using data from the UK Biobank. Classification models were trained using either neuroimaging (structural or functional brain magnetic resonance imaging feature sets) or sociodemographic features. Random forest classification models were adopted using rigorous shuffle-splits to estimate stability as well as accuracy of case-control classifications. Diagnostic classification accuracies were benchmarked against age classification (oldest vs. youngest) from the same feature sets and against additional classifier types (k-nearest neighbors and linear support vector machine). In contrast to age classification accuracy, which was high for all feature sets, few ICD-10 diagnostic groups were classified significantly above chance (namely, demyelinating diseases based on structural neuroimaging features and depression based on sociodemographic and functional neuroimaging features).</p><p><strong>Conclusion: </strong>These findings highlight challenges with the current disease classification system, leading us to recommend caution with the use of ICD-10 diagnostic groups as target labels in brain-based disease prediction studies.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11811528/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143390813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf003
Tianming Lan, Yinping Tian, Minhui Shi, Boyang Liu, Yu Lin, Yanling Xia, Yue Ma, Sunil Kumar Sahu, Qing Wang, Jun Li, Jin Chen, Fanghui Hou, Chuanling Yin, Kai Wang, Yuan Fu, Tengcheng Que, Wenjian Liu, Huan Liu, Haimeng Li, Yan Hua
A high-quality reference genome coupled with resequencing data is a promising strategy to address issues in conservation genomics. This has greatly enhanced the development of conservation plans for endangered species. Pangolins are fascinating animals with a variety of unique features. Unfortunately, they are the most trafficked wild animal in the world. In this study, we assembled a chromosome-scale genome with HiFi long reads and Hi-C short reads for the Chinese and Malayan pangolin and provided two new representative reference genomes for the pangolin species. We found a great improvement in the evaluation of genetic diversity and inbreeding based on these high-quality genomes and obtained different results for the detection of genome-wide extinction risks compared with genomes assembled using short reads. Moderate inbreeding and genetic diversity were reverified in these two pangolin species, except for one Malayan pangolin population with high inbreeding and low genetic diversity. Moreover, we identified a much higher inbreeding level (FROH = 0.54) in the Chinese pangolin individual from Taiwan Province compared with that from Mainland China, but more than 99.6% runs of homozygosity (ROH) fragments were restricted to less than 1 Mb, indicating that the high FROH in Taiwan Chinese pangolins may have accumulated from historical inbreeding events. Furthermore, our study is the first to detect relatively mild genetic purging in pangolin populations. These two high-quality reference genomes will provide valuable genetic resources for future studies and contribute to the protection and conservation of pangolins.
{"title":"Enhancing inbreeding estimation and global conservation insights through chromosome-level assemblies of the Chinese and Malayan pangolin.","authors":"Tianming Lan, Yinping Tian, Minhui Shi, Boyang Liu, Yu Lin, Yanling Xia, Yue Ma, Sunil Kumar Sahu, Qing Wang, Jun Li, Jin Chen, Fanghui Hou, Chuanling Yin, Kai Wang, Yuan Fu, Tengcheng Que, Wenjian Liu, Huan Liu, Haimeng Li, Yan Hua","doi":"10.1093/gigascience/giaf003","DOIUrl":"10.1093/gigascience/giaf003","url":null,"abstract":"<p><p>A high-quality reference genome coupled with resequencing data is a promising strategy to address issues in conservation genomics. This has greatly enhanced the development of conservation plans for endangered species. Pangolins are fascinating animals with a variety of unique features. Unfortunately, they are the most trafficked wild animal in the world. In this study, we assembled a chromosome-scale genome with HiFi long reads and Hi-C short reads for the Chinese and Malayan pangolin and provided two new representative reference genomes for the pangolin species. We found a great improvement in the evaluation of genetic diversity and inbreeding based on these high-quality genomes and obtained different results for the detection of genome-wide extinction risks compared with genomes assembled using short reads. Moderate inbreeding and genetic diversity were reverified in these two pangolin species, except for one Malayan pangolin population with high inbreeding and low genetic diversity. Moreover, we identified a much higher inbreeding level (FROH = 0.54) in the Chinese pangolin individual from Taiwan Province compared with that from Mainland China, but more than 99.6% runs of homozygosity (ROH) fragments were restricted to less than 1 Mb, indicating that the high FROH in Taiwan Chinese pangolins may have accumulated from historical inbreeding events. Furthermore, our study is the first to detect relatively mild genetic purging in pangolin populations. These two high-quality reference genomes will provide valuable genetic resources for future studies and contribute to the protection and conservation of pangolins.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11825179/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143412817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf011
Mark Schuiveling, Hong Liu, Daniel Eek, Gerben E Breimer, Karijn P M Suijkerbuijk, Willeke A M Blokx, Mitko Veta
Background: Melanoma is an aggressive form of skin cancer in which tumor-infiltrating lymphocytes (TILs) are a biomarker for recurrence and treatment response. Manual TIL assessment is prone to interobserver variability, and current deep learning models are not publicly accessible or have low performance. Deep learning models, however, have the potential of consistent spatial evaluation of TILs and other immune cell subsets with the potential of improved prognostic and predictive value. To make the development of these models possible, we created the Panoptic Segmentation of nUclei and tissue in advanced MelanomA (PUMA) dataset and assessed the performance of several state-of-the-art deep learning models. In addition, we show how to improve model performance further by using heuristic postprocessing in which nuclei classes are updated based on their tissue localization.
Results: The PUMA dataset includes 155 primary and 155 metastatic melanoma hematoxylin and eosin-stained regions of interest with nuclei and tissue annotations from a single melanoma referral institution. The Hover-NeXt model, trained on the PUMA dataset, demonstrated the best performance for lymphocyte detection, approaching human interobserver agreement. In addition, heuristic postprocessing of deep learning models improved the detection of noncommon classes, such as epithelial nuclei.
Conclusion: The PUMA dataset is the first melanoma-specific dataset that can be used to develop melanoma-specific nuclei and tissue segmentation models. These models can, in turn, be used for prognostic and predictive biomarker development. Incorporating tissue and nuclei segmentation is a step toward improved deep learning nuclei segmentation performance. To support the development of these models, this dataset is used in the PUMA challenge.
{"title":"A novel dataset for nuclei and tissue segmentation in melanoma with baseline nuclei segmentation and tissue segmentation benchmarks.","authors":"Mark Schuiveling, Hong Liu, Daniel Eek, Gerben E Breimer, Karijn P M Suijkerbuijk, Willeke A M Blokx, Mitko Veta","doi":"10.1093/gigascience/giaf011","DOIUrl":"10.1093/gigascience/giaf011","url":null,"abstract":"<p><strong>Background: </strong>Melanoma is an aggressive form of skin cancer in which tumor-infiltrating lymphocytes (TILs) are a biomarker for recurrence and treatment response. Manual TIL assessment is prone to interobserver variability, and current deep learning models are not publicly accessible or have low performance. Deep learning models, however, have the potential of consistent spatial evaluation of TILs and other immune cell subsets with the potential of improved prognostic and predictive value. To make the development of these models possible, we created the Panoptic Segmentation of nUclei and tissue in advanced MelanomA (PUMA) dataset and assessed the performance of several state-of-the-art deep learning models. In addition, we show how to improve model performance further by using heuristic postprocessing in which nuclei classes are updated based on their tissue localization.</p><p><strong>Results: </strong>The PUMA dataset includes 155 primary and 155 metastatic melanoma hematoxylin and eosin-stained regions of interest with nuclei and tissue annotations from a single melanoma referral institution. The Hover-NeXt model, trained on the PUMA dataset, demonstrated the best performance for lymphocyte detection, approaching human interobserver agreement. In addition, heuristic postprocessing of deep learning models improved the detection of noncommon classes, such as epithelial nuclei.</p><p><strong>Conclusion: </strong>The PUMA dataset is the first melanoma-specific dataset that can be used to develop melanoma-specific nuclei and tissue segmentation models. These models can, in turn, be used for prognostic and predictive biomarker development. Incorporating tissue and nuclei segmentation is a step toward improved deep learning nuclei segmentation performance. To support the development of these models, this dataset is used in the PUMA challenge.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11837757/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143457766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf012
Jingmin Kang, Qingsong Li, Jie Liu, Lin Du, Peng Liu, Fuyan Liu, Yue Wang, Xunan Shen, Xujiao Luo, Ninghe Wang, Renhua Wu, Lei Song, Jizheng Wang, Xin Liu
Background: Spatial transcriptomics is a powerful tool that integrates molecular data with spatial information, thereby facilitating a deeper comprehension of tissue morphology and cellular interactions. In our study, we utilized cutting-edge spatial transcriptome sequencing technology to explore the development of the mouse heart and construct a comprehensive spatiotemporal cell atlas of early murine cardiac development.
Results: Through the analysis of this atlas, we elucidated the spatial organization of cardiac cellular lineages and their interactions during the developmental process. Notably, we observed dynamic changes in gene expression within fibroblasts and cardiomyocytes. Moreover, we identified critical genes, such as Igf2, H19, and Tcap, as well as transcription factors Tcf12 and Plagl1, which may be associated with the loss of myocardial regeneration ability during early heart development. In addition, we successfully identified marker genes, like Adamts8 and Bmp10, that can distinguish between the left and right atria.
Conclusion: Our study provides novel insights into murine cardiac development and offers a valuable resource for future investigations in the field of heart research, highlighting the significance of spatial transcriptomics in understanding the complex processes of organ development.
{"title":"Exploring the cellular and molecular basis of murine cardiac development through spatiotemporal transcriptome sequencing.","authors":"Jingmin Kang, Qingsong Li, Jie Liu, Lin Du, Peng Liu, Fuyan Liu, Yue Wang, Xunan Shen, Xujiao Luo, Ninghe Wang, Renhua Wu, Lei Song, Jizheng Wang, Xin Liu","doi":"10.1093/gigascience/giaf012","DOIUrl":"10.1093/gigascience/giaf012","url":null,"abstract":"<p><strong>Background: </strong>Spatial transcriptomics is a powerful tool that integrates molecular data with spatial information, thereby facilitating a deeper comprehension of tissue morphology and cellular interactions. In our study, we utilized cutting-edge spatial transcriptome sequencing technology to explore the development of the mouse heart and construct a comprehensive spatiotemporal cell atlas of early murine cardiac development.</p><p><strong>Results: </strong>Through the analysis of this atlas, we elucidated the spatial organization of cardiac cellular lineages and their interactions during the developmental process. Notably, we observed dynamic changes in gene expression within fibroblasts and cardiomyocytes. Moreover, we identified critical genes, such as Igf2, H19, and Tcap, as well as transcription factors Tcf12 and Plagl1, which may be associated with the loss of myocardial regeneration ability during early heart development. In addition, we successfully identified marker genes, like Adamts8 and Bmp10, that can distinguish between the left and right atria.</p><p><strong>Conclusion: </strong>Our study provides novel insights into murine cardiac development and offers a valuable resource for future investigations in the field of heart research, highlighting the significance of spatial transcriptomics in understanding the complex processes of organ development.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11831923/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143440617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giaf007
Tangchao Kong, Yadong Wang, Bo Liu
Background: The development of long-read sequencing is promising for the high-quality and comprehensive de novo assembly for various species around the world. However, it is still challenging for assemblers to handle thousands of genomes, tens of gigabase-level assembly sizes, and terabase-level datasets efficiently, which is a bottleneck to large-scale de novo sequencing studies. A major cause is the read overlapping graph construction that state-of-the-art tools usually have to cost terabyte-level RAM space and tens of days for large genomes. Such lower performance and scalability are not suited to handle the numerous samples being sequenced.
Findings: Herein, we propose xRead, a novel iterative overlapping graph construction approach that achieves high performance, scalability, and yield simultaneously. Under the guidance of its coverage-based model, xRead converts read-overlapping to heuristic read-mapping and incremental graph construction tasks with highly controllable RAM space and faster speed. It enables the processing of very large datasets (such as the 1.28 Tb Ambystoma mexicanum dataset) with less than 64 GB RAM and obviously lower time costs. Moreover, benchmarks suggest that it can produce highly accurate and well-connected overlapping graphs, which are also supportive of various kinds of downstream assembly strategies.
Conclusions: xRead is able to break through the major bottleneck to graph construction and lays a new foundation for de novo assembly. This tool is suited to handle a large number of datasets from large genomes and may play important roles in many de novo sequencing studies.
{"title":"xRead: a coverage-guided approach for scalable construction of read overlapping graph.","authors":"Tangchao Kong, Yadong Wang, Bo Liu","doi":"10.1093/gigascience/giaf007","DOIUrl":"10.1093/gigascience/giaf007","url":null,"abstract":"<p><strong>Background: </strong>The development of long-read sequencing is promising for the high-quality and comprehensive de novo assembly for various species around the world. However, it is still challenging for assemblers to handle thousands of genomes, tens of gigabase-level assembly sizes, and terabase-level datasets efficiently, which is a bottleneck to large-scale de novo sequencing studies. A major cause is the read overlapping graph construction that state-of-the-art tools usually have to cost terabyte-level RAM space and tens of days for large genomes. Such lower performance and scalability are not suited to handle the numerous samples being sequenced.</p><p><strong>Findings: </strong>Herein, we propose xRead, a novel iterative overlapping graph construction approach that achieves high performance, scalability, and yield simultaneously. Under the guidance of its coverage-based model, xRead converts read-overlapping to heuristic read-mapping and incremental graph construction tasks with highly controllable RAM space and faster speed. It enables the processing of very large datasets (such as the 1.28 Tb Ambystoma mexicanum dataset) with less than 64 GB RAM and obviously lower time costs. Moreover, benchmarks suggest that it can produce highly accurate and well-connected overlapping graphs, which are also supportive of various kinds of downstream assembly strategies.</p><p><strong>Conclusions: </strong>xRead is able to break through the major bottleneck to graph construction and lays a new foundation for de novo assembly. This tool is suited to handle a large number of datasets from large genomes and may play important roles in many de novo sequencing studies.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11831799/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143440619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1093/gigascience/giae118
Chun Liu, Jianyu Zhang, Ranran Xu, Jinhui Lv, Zhu Qiao, Mingzhou Bai, Shancen Zhao, Lijuan Luo, Guodao Liu, Pandao Liu
Background: Drought is a major limiting factor for plant survival and crop productivity. Stylosanthes angustifolia, a pioneer plant, exhibits remarkable drought tolerance, yet the molecular mechanisms driving its drought resistance remain largely unexplored.
Results: We present a chromosome-scale reference genome of S. angustifolia, which provides insights into its genome evolution and drought tolerance mechanisms. The assembled genome is 645.88 Mb in size, containing 319.98 Mb of repetitive sequences and 36,857 protein-coding genes. The high quality of this genome assembly is demonstrated by the presence of 99.26% BUSCO and a 19.49 long terminal repeat assembly index. Evolutionary analyses revealed that S. angustifolia shares a whole-genome duplication (WGD) event with other legumes but lacks recent WGD. Additionally, S. angustifolia has undergone gene expansion through tandem duplication approximately 12.31 million years ago. Through integrative multiomics analyses, we identified 4 gene families-namely, xanthoxin dehydrogenase, 2-hydroxyisoflavanone dehydratase, patatin-related phospholipase A, and stachyose synthetase-that underwent tandem duplication and were significantly upregulated under drought stress. These gene families contribute to the biosynthesis of abscisic acid, genistein, daidzein, jasmonic acid, and stachyose, thereby enhancing drought tolerance.
Conclusions: The genome assembly of S. angustifolia represents a significant advancement in understanding the genetic mechanisms underlying drought tolerance in this pioneer plant species. This genomic resource provides critical insights into the evolution of drought resistance and offers valuable genetic information for breeding programs aimed at improving drought resistance in crops.
{"title":"A chromosome-scale genome assembly of the pioneer plant Stylosanthes angustifolia: insights into genome evolution and drought adaptation.","authors":"Chun Liu, Jianyu Zhang, Ranran Xu, Jinhui Lv, Zhu Qiao, Mingzhou Bai, Shancen Zhao, Lijuan Luo, Guodao Liu, Pandao Liu","doi":"10.1093/gigascience/giae118","DOIUrl":"10.1093/gigascience/giae118","url":null,"abstract":"<p><strong>Background: </strong>Drought is a major limiting factor for plant survival and crop productivity. Stylosanthes angustifolia, a pioneer plant, exhibits remarkable drought tolerance, yet the molecular mechanisms driving its drought resistance remain largely unexplored.</p><p><strong>Results: </strong>We present a chromosome-scale reference genome of S. angustifolia, which provides insights into its genome evolution and drought tolerance mechanisms. The assembled genome is 645.88 Mb in size, containing 319.98 Mb of repetitive sequences and 36,857 protein-coding genes. The high quality of this genome assembly is demonstrated by the presence of 99.26% BUSCO and a 19.49 long terminal repeat assembly index. Evolutionary analyses revealed that S. angustifolia shares a whole-genome duplication (WGD) event with other legumes but lacks recent WGD. Additionally, S. angustifolia has undergone gene expansion through tandem duplication approximately 12.31 million years ago. Through integrative multiomics analyses, we identified 4 gene families-namely, xanthoxin dehydrogenase, 2-hydroxyisoflavanone dehydratase, patatin-related phospholipase A, and stachyose synthetase-that underwent tandem duplication and were significantly upregulated under drought stress. These gene families contribute to the biosynthesis of abscisic acid, genistein, daidzein, jasmonic acid, and stachyose, thereby enhancing drought tolerance.</p><p><strong>Conclusions: </strong>The genome assembly of S. angustifolia represents a significant advancement in understanding the genetic mechanisms underlying drought tolerance in this pioneer plant species. This genomic resource provides critical insights into the evolution of drought resistance and offers valuable genetic information for breeding programs aimed at improving drought resistance in crops.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11758145/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143032998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}