Pub Date : 2024-01-02DOI: 10.1093/gigascience/giae074
Jinxin Zhao, Jiru Han, Yu-Wei Lin, Yan Zhu, Michael Aichem, Dimitar Garkov, Phillip J Bergen, Sue C Nang, Jian-Zhong Ye, Tieli Zhou, Tony Velkov, Jiangning Song, Falk Schreiber, Jian Li
Background: Antimicrobial resistance is a serious threat to global health. Due to the stagnant antibiotic discovery pipeline, bacteriophages (phages) have been proposed as an alternative therapy for the treatment of infections caused by multidrug-resistant pathogens. Genomic features play an important role in phage pharmacology. However, our knowledge of phage genomics is sparse, and the use of existing bioinformatic pipelines and tools requires considerable bioinformatic expertise. These challenges have substantially limited the clinical translation of phage therapy.
Findings: We have developed PhageGE (Phage Genome Explorer), a user-friendly graphical interface application for the interactive analysis of phage genomes. PhageGE enables users to perform key analyses, including phylogenetic analysis, visualization of phylogenetic trees, prediction of phage life cycle, and comparative analysis of phage genome annotations. The new R Shiny web server, PhageGE, integrates existing R packages and combines them with several newly developed functions to facilitate these analyses. Additionally, the web server provides interactive visualization capabilities and allows users to directly export publication-quality images.
Conclusions: PhageGE is a valuable tool that simplifies the analysis of phage genome data and may expedite the development and clinical translation of phage therapy. PhageGE is publicly available at https://jason-zhao.shinyapps.io/PhageGE_Update/.
背景:抗菌药耐药性是对全球健康的严重威胁。由于抗生素的研发停滞不前,噬菌体(phage)被提议作为治疗耐多药病原体感染的替代疗法。基因组特征在噬菌体药理学中发挥着重要作用。然而,我们对噬菌体基因组学的了解并不多,使用现有的生物信息学管道和工具需要大量的生物信息学专业知识。这些挑战极大地限制了噬菌体疗法的临床转化:我们开发了 PhageGE(噬菌体基因组资源管理器),这是一款用户友好型图形界面应用程序,用于交互式分析噬菌体基因组。PhageGE使用户能够进行关键分析,包括系统发育分析、系统发育树可视化、噬菌体生命周期预测以及噬菌体基因组注释比较分析。新的 R Shiny 网络服务器 PhageGE 整合了现有的 R 软件包,并将它们与几个新开发的功能相结合,为这些分析提供了便利。此外,网络服务器还提供交互式可视化功能,并允许用户直接导出出版物质量的图像:PhageGE是一个有价值的工具,它简化了噬菌体基因组数据的分析,可能会加快噬菌体疗法的开发和临床转化。PhageGE 可通过 https://jason-zhao.shinyapps.io/PhageGE_Update/ 公开获取。
{"title":"PhageGE: an interactive web platform for exploratory analysis and visualization of bacteriophage genomes.","authors":"Jinxin Zhao, Jiru Han, Yu-Wei Lin, Yan Zhu, Michael Aichem, Dimitar Garkov, Phillip J Bergen, Sue C Nang, Jian-Zhong Ye, Tieli Zhou, Tony Velkov, Jiangning Song, Falk Schreiber, Jian Li","doi":"10.1093/gigascience/giae074","DOIUrl":"10.1093/gigascience/giae074","url":null,"abstract":"<p><strong>Background: </strong>Antimicrobial resistance is a serious threat to global health. Due to the stagnant antibiotic discovery pipeline, bacteriophages (phages) have been proposed as an alternative therapy for the treatment of infections caused by multidrug-resistant pathogens. Genomic features play an important role in phage pharmacology. However, our knowledge of phage genomics is sparse, and the use of existing bioinformatic pipelines and tools requires considerable bioinformatic expertise. These challenges have substantially limited the clinical translation of phage therapy.</p><p><strong>Findings: </strong>We have developed PhageGE (Phage Genome Explorer), a user-friendly graphical interface application for the interactive analysis of phage genomes. PhageGE enables users to perform key analyses, including phylogenetic analysis, visualization of phylogenetic trees, prediction of phage life cycle, and comparative analysis of phage genome annotations. The new R Shiny web server, PhageGE, integrates existing R packages and combines them with several newly developed functions to facilitate these analyses. Additionally, the web server provides interactive visualization capabilities and allows users to directly export publication-quality images.</p><p><strong>Conclusions: </strong>PhageGE is a valuable tool that simplifies the analysis of phage genome data and may expedite the development and clinical translation of phage therapy. PhageGE is publicly available at https://jason-zhao.shinyapps.io/PhageGE_Update/.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11423353/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142344887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1093/gigascience/giae086
Jorge M Silva, Armando J Pinho, Diogo Pratas
Background: Most viral genome sequences generated during the latest pandemic have presented new challenges for computational analysis. Analyzing millions of viral genomes in multi-FASTA format is computationally demanding, especially when using alignment-based methods. Most existing methods are not designed to handle such large datasets, often requiring the analysis to be divided into smaller parts to obtain results using available computational resources.
Findings: We introduce AltaiR, a toolkit for analyzing multiple sequences in multi-FASTA format using exclusively alignment-free methodologies. AltaiR enables the identification of singularity and similarity patterns within sequences and computes static and temporal dynamics without restrictions on the number or size of input sequences. It automatically filters low-quality, biased, or deviant data. We demonstrate AltaiR's capabilities by analyzing more than 1.5 million full severe acute respiratory virus coronavirus 2 sequences, revealing interesting observations regarding viral genome characteristics over time, such as shifts in nucleotide composition, decreases in average Kolmogorov sequence complexity, and the evolution of the smallest sequences not found in the human host.
Conclusions: AltaiR can identify temporal characteristics and trends in large numbers of sequences, making it ideal for scenarios involving endemic or epidemic outbreaks with vast amounts of available sequence data. Implemented in C with multithreading and methodological optimizations, AltaiR is computationally efficient, flexible, and dependency-free. It accepts any sequence in FASTA format, including amino acid sequences. The complete toolkit is freely available at https://github.com/cobilab/altair.
背景:最近大流行期间产生的大多数病毒基因组序列给计算分析带来了新的挑战。分析多 FASTA 格式的数百万个病毒基因组对计算要求很高,尤其是在使用基于比对的方法时。大多数现有方法都不是为处理如此大的数据集而设计的,往往需要将分析分成较小的部分,才能利用现有计算资源获得结果:我们介绍了 AltaiR,这是一种完全采用无配准方法分析多 FASTA 格式多序列的工具包。AltaiR 能够识别序列中的奇异性和相似性模式,并计算静态和时间动态,而不受输入序列数量或大小的限制。它能自动过滤低质量、有偏见或偏差的数据。我们通过分析 150 多万条完整的严重急性呼吸道病毒冠状病毒 2 序列,展示了 AltaiR 的能力,揭示了病毒基因组随时间变化的有趣特征,如核苷酸组成的变化、平均柯尔莫哥洛夫序列复杂性的降低,以及人类宿主中未发现的最小序列的进化:AltaiR可以识别大量序列的时间特征和趋势,因此非常适合涉及流行病或疫情爆发、拥有大量可用序列数据的情况。AltaiR 采用 C 语言实现,具有多线程和方法优化功能,计算效率高、灵活性强且无依赖性。它接受任何 FASTA 格式的序列,包括氨基酸序列。完整的工具包可在 https://github.com/cobilab/altair 免费获取。
{"title":"AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data.","authors":"Jorge M Silva, Armando J Pinho, Diogo Pratas","doi":"10.1093/gigascience/giae086","DOIUrl":"10.1093/gigascience/giae086","url":null,"abstract":"<p><strong>Background: </strong>Most viral genome sequences generated during the latest pandemic have presented new challenges for computational analysis. Analyzing millions of viral genomes in multi-FASTA format is computationally demanding, especially when using alignment-based methods. Most existing methods are not designed to handle such large datasets, often requiring the analysis to be divided into smaller parts to obtain results using available computational resources.</p><p><strong>Findings: </strong>We introduce AltaiR, a toolkit for analyzing multiple sequences in multi-FASTA format using exclusively alignment-free methodologies. AltaiR enables the identification of singularity and similarity patterns within sequences and computes static and temporal dynamics without restrictions on the number or size of input sequences. It automatically filters low-quality, biased, or deviant data. We demonstrate AltaiR's capabilities by analyzing more than 1.5 million full severe acute respiratory virus coronavirus 2 sequences, revealing interesting observations regarding viral genome characteristics over time, such as shifts in nucleotide composition, decreases in average Kolmogorov sequence complexity, and the evolution of the smallest sequences not found in the human host.</p><p><strong>Conclusions: </strong>AltaiR can identify temporal characteristics and trends in large numbers of sequences, making it ideal for scenarios involving endemic or epidemic outbreaks with vast amounts of available sequence data. Implemented in C with multithreading and methodological optimizations, AltaiR is computationally efficient, flexible, and dependency-free. It accepts any sequence in FASTA format, including amino acid sequences. The complete toolkit is freely available at https://github.com/cobilab/altair.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11590114/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142715752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1093/gigascience/giae084
Lin Hong, Xin-Dong Xu, Lei Yang, Min Wang, Shuang Li, Haijian Yang, Si-Ying Ye, Ling-Ling Chen, Jia-Ming Song
Background: Sweet orange (Citrus sinensis Osbeck) is a fruit crop of high nutritional value that is widely consumed around the world. However, its susceptibility to low-temperature stress limits its cultivation and production in regions prone to frost damage, severely impacting the sustainable development of the sweet orange industry. Therefore, developing cold-resistant sweet orange varieties is of great necessity. Traditional hybrid breeding methods are not feasible due to the polyembryonic phenomenon in sweet oranges, necessitating the enhancement of its germplasm through molecular breeding. High-quality reference genomes are valuable for studying crop resistance to biotic and abiotic stresses. However, the lack of genomic resources for cold-resistant sweet orange varieties has hindered the progress in developing such varieties and researching their molecular mechanisms of cold resistance.
Findings: This study integrated PacBio HiFi, ONT, Hi-C, and Illumina sequencing data to assemble telomere-to-telomere (T2T) reference genomes for the cold-resistant sweet orange mutant "Longhuihong" (Citrus sinensis [L.] Osb. cv. LHH) and its wild-type counterpart "Newhall" (C. sinensis [L.] Osb. cv. Newhall). Comprehensive evaluations based on multiple criteria revealed that both genomes exhibit high continuity, completeness, and accuracy. The genome sizes were 340.28 Mb and 346.33 Mb, with contig N50 of 39.31 Mb and 36.77 Mb, respectively. In total, 31,456 and 30,021 gene models were annotated in the respective genomes. Leveraging these assembled genomes, comparative genomics analyses were performed, elucidating the evolutionary history of the sweet orange genome. Moreover, the study identified 2,886 structural variants between the 2 genomes, with several SVs located in the upstream, downstream, or intronic regions of homologous genes known to be associated with cold resistance.
Conclusions: The study de novo assembled 2 T2T reference genomes of sweet orange varieties exhibiting different levels of cold tolerance. These genomes serve as valuable foundational resources for genomic research and molecular breeding aimed at enhancing cold tolerance in sweet oranges. Additionally, they expand the existing repository of reference genomes and sequencing data resources for C. sinensis. Moreover, these genomes provide a critical data foundation for comparative genomics analyses across different plant species.
{"title":"Construction and analysis of telomere-to-telomere genomes for 2 sweet oranges: Longhuihong and Newhall (Citrus sinensis).","authors":"Lin Hong, Xin-Dong Xu, Lei Yang, Min Wang, Shuang Li, Haijian Yang, Si-Ying Ye, Ling-Ling Chen, Jia-Ming Song","doi":"10.1093/gigascience/giae084","DOIUrl":"10.1093/gigascience/giae084","url":null,"abstract":"<p><strong>Background: </strong>Sweet orange (Citrus sinensis Osbeck) is a fruit crop of high nutritional value that is widely consumed around the world. However, its susceptibility to low-temperature stress limits its cultivation and production in regions prone to frost damage, severely impacting the sustainable development of the sweet orange industry. Therefore, developing cold-resistant sweet orange varieties is of great necessity. Traditional hybrid breeding methods are not feasible due to the polyembryonic phenomenon in sweet oranges, necessitating the enhancement of its germplasm through molecular breeding. High-quality reference genomes are valuable for studying crop resistance to biotic and abiotic stresses. However, the lack of genomic resources for cold-resistant sweet orange varieties has hindered the progress in developing such varieties and researching their molecular mechanisms of cold resistance.</p><p><strong>Findings: </strong>This study integrated PacBio HiFi, ONT, Hi-C, and Illumina sequencing data to assemble telomere-to-telomere (T2T) reference genomes for the cold-resistant sweet orange mutant \"Longhuihong\" (Citrus sinensis [L.] Osb. cv. LHH) and its wild-type counterpart \"Newhall\" (C. sinensis [L.] Osb. cv. Newhall). Comprehensive evaluations based on multiple criteria revealed that both genomes exhibit high continuity, completeness, and accuracy. The genome sizes were 340.28 Mb and 346.33 Mb, with contig N50 of 39.31 Mb and 36.77 Mb, respectively. In total, 31,456 and 30,021 gene models were annotated in the respective genomes. Leveraging these assembled genomes, comparative genomics analyses were performed, elucidating the evolutionary history of the sweet orange genome. Moreover, the study identified 2,886 structural variants between the 2 genomes, with several SVs located in the upstream, downstream, or intronic regions of homologous genes known to be associated with cold resistance.</p><p><strong>Conclusions: </strong>The study de novo assembled 2 T2T reference genomes of sweet orange varieties exhibiting different levels of cold tolerance. These genomes serve as valuable foundational resources for genomic research and molecular breeding aimed at enhancing cold tolerance in sweet oranges. Additionally, they expand the existing repository of reference genomes and sequencing data resources for C. sinensis. Moreover, these genomes provide a critical data foundation for comparative genomics analyses across different plant species.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11590112/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142715757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1093/gigascience/giae090
Michael P Lynch, Yufei Wang, Shannan Ho Sui, Laurent Gatto, Aedin C Culhane
Background: Multiplexing single-cell RNA sequencing experiments reduces sequencing cost and facilitates larger-scale studies. However, factors such as cell hashing quality and class size imbalance impact demultiplexing algorithm performance, reducing cost-effectiveness.
Findings: We propose a supervised algorithm, demuxSNP, which leverages both cell hashing and genetic variation between individuals (single-nucletotide polymorphisms [SNPs]). demuxSNP addresses fundamental limitations in demultiplexing methods that use only one data modality. Some cells may be confidently demultiplexed using probabilistic hashing methods. demuxSNP uses these data to infer the genotype of singlet and doublet clusters and predict on cells assigned as negative, uncertain, or doublet using a nearest-neighbor approach adapted for missing data.We benchmarked demuxSNP against hashing, genotype-free SNP and hybrid methods on simulated and real data from renal cell cancer. demuxSNP outperformed standalone hashing methods on low-quality hashing data benchmark, improved overall classification accuracy, and allowed more high RNA quality cells to be recovered. Through varying simulated doublet rates, we showed that genotype-free SNP and hybrid methods that leverage them were impacted by class size imbalance and doublet rate. demuxSNP's supervised approach was more robust to doublet rate in experiments with class size imbalance.
Conclusions: demuxSNP uses hashing and SNP data to demultiplex datasets with low hashing quality where biological samples are genetically distinct. Unassigned or negative cells with high RNA quality are recovered, making more cells available for analysis. Data simulation and benchmarking pipelines as well as processed benchmarking data for 5-50% doublets are publicly available. demuxSNP is available as an R/Bioconductor package (https://doi.org/doi:10.18129/B9.bioc.demuxSNP).
{"title":"demuxSNP: supervised demultiplexing single-cell RNA sequencing using cell hashing and SNPs.","authors":"Michael P Lynch, Yufei Wang, Shannan Ho Sui, Laurent Gatto, Aedin C Culhane","doi":"10.1093/gigascience/giae090","DOIUrl":"10.1093/gigascience/giae090","url":null,"abstract":"<p><strong>Background: </strong>Multiplexing single-cell RNA sequencing experiments reduces sequencing cost and facilitates larger-scale studies. However, factors such as cell hashing quality and class size imbalance impact demultiplexing algorithm performance, reducing cost-effectiveness.</p><p><strong>Findings: </strong>We propose a supervised algorithm, demuxSNP, which leverages both cell hashing and genetic variation between individuals (single-nucletotide polymorphisms [SNPs]). demuxSNP addresses fundamental limitations in demultiplexing methods that use only one data modality. Some cells may be confidently demultiplexed using probabilistic hashing methods. demuxSNP uses these data to infer the genotype of singlet and doublet clusters and predict on cells assigned as negative, uncertain, or doublet using a nearest-neighbor approach adapted for missing data.We benchmarked demuxSNP against hashing, genotype-free SNP and hybrid methods on simulated and real data from renal cell cancer. demuxSNP outperformed standalone hashing methods on low-quality hashing data benchmark, improved overall classification accuracy, and allowed more high RNA quality cells to be recovered. Through varying simulated doublet rates, we showed that genotype-free SNP and hybrid methods that leverage them were impacted by class size imbalance and doublet rate. demuxSNP's supervised approach was more robust to doublet rate in experiments with class size imbalance.</p><p><strong>Conclusions: </strong>demuxSNP uses hashing and SNP data to demultiplex datasets with low hashing quality where biological samples are genetically distinct. Unassigned or negative cells with high RNA quality are recovered, making more cells available for analysis. Data simulation and benchmarking pipelines as well as processed benchmarking data for 5-50% doublets are publicly available. demuxSNP is available as an R/Bioconductor package (https://doi.org/doi:10.18129/B9.bioc.demuxSNP).</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11604057/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142750345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Deciphering spatial domains using spatially resolved transcriptomics (SRT) is of great value for characterizing and understanding tissue architecture. However, the inherent heterogeneity and varying spatial resolutions present challenges in the joint analysis of multimodal SRT data.
Results: We introduce a multimodal geometric deep learning method, named stMMR, to effectively integrate gene expression, spatial location, and histological information for accurate identifying spatial domains from SRT data. stMMR uses graph convolutional networks and a self-attention module for deep embedding of features within unimodality and incorporates similarity contrastive learning for integrating features across modalities.
Conclusions: Comprehensive benchmark analysis on various types of spatial data shows superior performance of stMMR in multiple analyses, including spatial domain identification, pseudo-spatiotemporal analysis, and domain-specific gene discovery. In chicken heart development, stMMR reconstructed the spatiotemporal lineage structures, indicating an accurate developmental sequence. In breast cancer and lung cancer, stMMR clearly delineated the tumor microenvironment and identified marker genes associated with diagnosis and prognosis. Overall, stMMR is capable of effectively utilizing the multimodal information of various SRT data to explore and characterize tissue architectures of homeostasis, development, and tumor.
{"title":"stMMR: accurate and robust spatial domain identification from spatially resolved transcriptomics with multimodal feature representation.","authors":"Daoliang Zhang, Na Yu, Zhiyuan Yuan, Wenrui Li, Xue Sun, Qi Zou, Xiangyu Li, Zhiping Liu, Wei Zhang, Rui Gao","doi":"10.1093/gigascience/giae089","DOIUrl":"10.1093/gigascience/giae089","url":null,"abstract":"<p><strong>Background: </strong>Deciphering spatial domains using spatially resolved transcriptomics (SRT) is of great value for characterizing and understanding tissue architecture. However, the inherent heterogeneity and varying spatial resolutions present challenges in the joint analysis of multimodal SRT data.</p><p><strong>Results: </strong>We introduce a multimodal geometric deep learning method, named stMMR, to effectively integrate gene expression, spatial location, and histological information for accurate identifying spatial domains from SRT data. stMMR uses graph convolutional networks and a self-attention module for deep embedding of features within unimodality and incorporates similarity contrastive learning for integrating features across modalities.</p><p><strong>Conclusions: </strong>Comprehensive benchmark analysis on various types of spatial data shows superior performance of stMMR in multiple analyses, including spatial domain identification, pseudo-spatiotemporal analysis, and domain-specific gene discovery. In chicken heart development, stMMR reconstructed the spatiotemporal lineage structures, indicating an accurate developmental sequence. In breast cancer and lung cancer, stMMR clearly delineated the tumor microenvironment and identified marker genes associated with diagnosis and prognosis. Overall, stMMR is capable of effectively utilizing the multimodal information of various SRT data to explore and characterize tissue architectures of homeostasis, development, and tumor.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11604062/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142750406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1093/gigascience/giae104
Yongxin Ji, Jiayu Shang, Jiaojiao Guan, Wei Zou, Herui Liao, Xubo Tang, Yanni Sun
Background: Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces 2 major challenges: the high diversity of functions and the limited availability of high-quality GO annotations.
Results: In this study, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against 7 state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the 3 GO categories, respectively, as measured on the novel protein test set.
Conclusions: PlasGO, a hierarchical tool incorporating protein language models and BERT, significantly expanded plasmid protein annotations by predicting high-confidence GO terms. These annotations have been compiled into a database, which will serve as a valuable contribution to downstream plasmid analysis and research.
{"title":"PlasGO: enhancing GO-based function prediction for plasmid-encoded proteins based on genetic structure.","authors":"Yongxin Ji, Jiayu Shang, Jiaojiao Guan, Wei Zou, Herui Liao, Xubo Tang, Yanni Sun","doi":"10.1093/gigascience/giae104","DOIUrl":"10.1093/gigascience/giae104","url":null,"abstract":"<p><strong>Background: </strong>Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces 2 major challenges: the high diversity of functions and the limited availability of high-quality GO annotations.</p><p><strong>Results: </strong>In this study, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against 7 state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the 3 GO categories, respectively, as measured on the novel protein test set.</p><p><strong>Conclusions: </strong>PlasGO, a hierarchical tool incorporating protein language models and BERT, significantly expanded plasmid protein annotations by predicting high-confidence GO terms. These annotations have been compiled into a database, which will serve as a valuable contribution to downstream plasmid analysis and research.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11659980/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142863067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1093/gigascience/giae027
Rafael Moysés Alves, Vinicius A C de Abreu, Rafaely Pantoja Oliveira, João Victor Dos Anjos Almeida, Mauro de Medeiros de Oliveira, Saura R Silva, Alexandre R Paschoal, Sintia S de Almeida, Pedro A F de Souza, Jesus A Ferro, Vitor F O Miranda, Antonio Figueira, Douglas S Domingues, Alessandro M Varani
Background: Theobroma grandiflorum (Malvaceae), known as cupuassu, is a tree indigenous to the Amazon basin, valued for its large fruits and seed pulp, contributing notably to the Amazonian bioeconomy. The seed pulp is utilized in desserts and beverages, and its seed butter is used in cosmetics. Here, we present the sequenced telomere-to-telomere genome of cupuassu, disclosing its genomic structure, evolutionary features, and phylogenetic relationships within the Malvaceae family.
Findings: The cupuassu genome spans 423 Mb, encodes 31,381 genes distributed in 10 chromosomes, and exhibits approximately 65% gene synteny with the Theobroma cacao genome, reflecting a conserved evolutionary history, albeit punctuated with unique genomic variations. The main changes are pronounced by bursts of long-terminal repeat retrotransposons at postspecies divergence, retrocopied and singleton genes, and gene families displaying distinctive patterns of expansion and contraction. Furthermore, positively selected genes are evident, particularly among retained and dispersed tandem and proximal duplicated genes associated with general fruit and seed traits and defense mechanisms, supporting the hypothesis of potential episodes of subfunctionalization and neofunctionalization following duplication, as well as impact from distinct domestication process. These genomic variations may underpin the differences observed in fruit and seed morphology, ripening, and disease resistance between cupuassu and the other Malvaceae species.
Conclusions: The cupuassu genome offers a foundational resource for both breeding improvement and conservation biology, yielding insights into the evolution and diversity within the genus Theobroma.
{"title":"Genomic decoding of Theobroma grandiflorum (cupuassu) at chromosomal scale: evolutionary insights for horticultural innovation.","authors":"Rafael Moysés Alves, Vinicius A C de Abreu, Rafaely Pantoja Oliveira, João Victor Dos Anjos Almeida, Mauro de Medeiros de Oliveira, Saura R Silva, Alexandre R Paschoal, Sintia S de Almeida, Pedro A F de Souza, Jesus A Ferro, Vitor F O Miranda, Antonio Figueira, Douglas S Domingues, Alessandro M Varani","doi":"10.1093/gigascience/giae027","DOIUrl":"10.1093/gigascience/giae027","url":null,"abstract":"<p><strong>Background: </strong>Theobroma grandiflorum (Malvaceae), known as cupuassu, is a tree indigenous to the Amazon basin, valued for its large fruits and seed pulp, contributing notably to the Amazonian bioeconomy. The seed pulp is utilized in desserts and beverages, and its seed butter is used in cosmetics. Here, we present the sequenced telomere-to-telomere genome of cupuassu, disclosing its genomic structure, evolutionary features, and phylogenetic relationships within the Malvaceae family.</p><p><strong>Findings: </strong>The cupuassu genome spans 423 Mb, encodes 31,381 genes distributed in 10 chromosomes, and exhibits approximately 65% gene synteny with the Theobroma cacao genome, reflecting a conserved evolutionary history, albeit punctuated with unique genomic variations. The main changes are pronounced by bursts of long-terminal repeat retrotransposons at postspecies divergence, retrocopied and singleton genes, and gene families displaying distinctive patterns of expansion and contraction. Furthermore, positively selected genes are evident, particularly among retained and dispersed tandem and proximal duplicated genes associated with general fruit and seed traits and defense mechanisms, supporting the hypothesis of potential episodes of subfunctionalization and neofunctionalization following duplication, as well as impact from distinct domestication process. These genomic variations may underpin the differences observed in fruit and seed morphology, ripening, and disease resistance between cupuassu and the other Malvaceae species.</p><p><strong>Conclusions: </strong>The cupuassu genome offers a foundational resource for both breeding improvement and conservation biology, yielding insights into the evolution and diversity within the genus Theobroma.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11152179/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141261605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1093/gigascience/giae049
Christian Gaser, Robert Dahnke, Paul M Thompson, Florian Kurth, Eileen Luders, The Alzheimer's Disease Neuroimaging Initiative
A large range of sophisticated brain image analysis tools have been developed by the neuroscience community, greatly advancing the field of human brain mapping. Here we introduce the Computational Anatomy Toolbox (CAT)-a powerful suite of tools for brain morphometric analyses with an intuitive graphical user interface but also usable as a shell script. CAT is suitable for beginners, casual users, experts, and developers alike, providing a comprehensive set of analysis options, workflows, and integrated pipelines. The available analysis streams-illustrated on an example dataset-allow for voxel-based, surface-based, and region-based morphometric analyses. Notably, CAT incorporates multiple quality control options and covers the entire analysis workflow, including the preprocessing of cross-sectional and longitudinal data, statistical analysis, and the visualization of results. The overarching aim of this article is to provide a complete description and evaluation of CAT while offering a citable standard for the neuroscience community.
{"title":"CAT: a computational anatomy toolbox for the analysis of structural MRI data.","authors":"Christian Gaser, Robert Dahnke, Paul M Thompson, Florian Kurth, Eileen Luders, The Alzheimer's Disease Neuroimaging Initiative","doi":"10.1093/gigascience/giae049","DOIUrl":"10.1093/gigascience/giae049","url":null,"abstract":"<p><p>A large range of sophisticated brain image analysis tools have been developed by the neuroscience community, greatly advancing the field of human brain mapping. Here we introduce the Computational Anatomy Toolbox (CAT)-a powerful suite of tools for brain morphometric analyses with an intuitive graphical user interface but also usable as a shell script. CAT is suitable for beginners, casual users, experts, and developers alike, providing a comprehensive set of analysis options, workflows, and integrated pipelines. The available analysis streams-illustrated on an example dataset-allow for voxel-based, surface-based, and region-based morphometric analyses. Notably, CAT incorporates multiple quality control options and covers the entire analysis workflow, including the preprocessing of cross-sectional and longitudinal data, statistical analysis, and the visualization of results. The overarching aim of this article is to provide a complete description and evaluation of CAT while offering a citable standard for the neuroscience community.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11299546/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141893242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1093/gigascience/giae050
Elena Castillo-Lorenzo, Elinor Breman, Pablo Gómez Barreiro, Juan Viruel
Background: The economic importance of the globally distributed Brassicaceae family resides in the large diversity of crops within the family and the substantial variety of agronomic and functional traits they possess. We reviewed the current classifications of crop wild relatives (CWRs) in the Brassicaceae family with the aim of identifying new potential cross-compatible species from a total of 1,242 species using phylogenetic approaches.
Results: In general, cross-compatibility data between wild species and crops, as well as phenotype and genotype characterisation data, were available for major crops but very limited for minor crops, restricting the identification of new potential CWRs. Around 70% of wild Brassicaceae did not have genetic sequence data available in public repositories, and only 40% had chromosome counts published. Using phylogenetic distances, we propose 103 new potential CWRs for this family, which we recommend as priorities for cross-compatibility tests with crops and for phenotypic characterisation, including 71 newly identified CWRs for 10 minor crops. From the total species used in this study, more than half had no records of being in ex situ conservation, and 80% were not assessed for their conservation status or were data deficient (IUCN Red List Assessments).
Conclusions: Great efforts are needed on ex situ conservation to have accessible material for characterising and evaluating the species for future breeding programmes. We identified the Mediterranean region as one key conservation area for wild Brassicaceae species, with great numbers of endemic and threatened species. Conservation assessments are urgently needed to evaluate most of these wild Brassicaceae.
{"title":"Current status of global conservation and characterisation of wild and cultivated Brassicaceae genetic resources.","authors":"Elena Castillo-Lorenzo, Elinor Breman, Pablo Gómez Barreiro, Juan Viruel","doi":"10.1093/gigascience/giae050","DOIUrl":"10.1093/gigascience/giae050","url":null,"abstract":"<p><strong>Background: </strong>The economic importance of the globally distributed Brassicaceae family resides in the large diversity of crops within the family and the substantial variety of agronomic and functional traits they possess. We reviewed the current classifications of crop wild relatives (CWRs) in the Brassicaceae family with the aim of identifying new potential cross-compatible species from a total of 1,242 species using phylogenetic approaches.</p><p><strong>Results: </strong>In general, cross-compatibility data between wild species and crops, as well as phenotype and genotype characterisation data, were available for major crops but very limited for minor crops, restricting the identification of new potential CWRs. Around 70% of wild Brassicaceae did not have genetic sequence data available in public repositories, and only 40% had chromosome counts published. Using phylogenetic distances, we propose 103 new potential CWRs for this family, which we recommend as priorities for cross-compatibility tests with crops and for phenotypic characterisation, including 71 newly identified CWRs for 10 minor crops. From the total species used in this study, more than half had no records of being in ex situ conservation, and 80% were not assessed for their conservation status or were data deficient (IUCN Red List Assessments).</p><p><strong>Conclusions: </strong>Great efforts are needed on ex situ conservation to have accessible material for characterising and evaluating the species for future breeding programmes. We identified the Mediterranean region as one key conservation area for wild Brassicaceae species, with great numbers of endemic and threatened species. Conservation assessments are urgently needed to evaluate most of these wild Brassicaceae.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11304946/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141901424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1093/gigascience/giad109
Zafran Hussain Shah, Marcel Müller, Wolfgang Hübner, Tung-Cheng Wang, Daniel Telman, Thomas Huser, Wolfram Schenck
Background: Convolutional neural network (CNN)-based methods have shown excellent performance in denoising and reconstruction of super-resolved structured illumination microscopy (SR-SIM) data. Therefore, CNN-based architectures have been the focus of existing studies. However, Swin Transformer, an alternative and recently proposed deep learning-based image restoration architecture, has not been fully investigated for denoising SR-SIM images. Furthermore, it has not been fully explored how well transfer learning strategies work for denoising SR-SIM images with different noise characteristics and recorded cell structures for these different types of deep learning-based methods. Currently, the scarcity of publicly available SR-SIM datasets limits the exploration of the performance and generalization capabilities of deep learning methods.
Results: In this work, we present SwinT-fairSIM, a novel method based on the Swin Transformer for restoring SR-SIM images with a low signal-to-noise ratio. The experimental results show that SwinT-fairSIM outperforms previous CNN-based denoising methods. Furthermore, as a second contribution, two types of transfer learning-namely, direct transfer and fine-tuning-were benchmarked in combination with SwinT-fairSIM and CNN-based methods for denoising SR-SIM data. Direct transfer did not prove to be a viable strategy, but fine-tuning produced results comparable to conventional training from scratch while saving computational time and potentially reducing the amount of training data required. As a third contribution, we publish four datasets of raw SIM images and already reconstructed SR-SIM images. These datasets cover two different types of cell structures, tubulin filaments and vesicle structures. Different noise levels are available for the tubulin filaments.
Conclusion: The SwinT-fairSIM method is well suited for denoising SR-SIM images. By fine-tuning, already trained models can be easily adapted to different noise characteristics and cell structures. Furthermore, the provided datasets are structured in a way that the research community can readily use them for research on denoising, super-resolution, and transfer learning strategies.
{"title":"Evaluation of Swin Transformer and knowledge transfer for denoising of super-resolution structured illumination microscopy data.","authors":"Zafran Hussain Shah, Marcel Müller, Wolfgang Hübner, Tung-Cheng Wang, Daniel Telman, Thomas Huser, Wolfram Schenck","doi":"10.1093/gigascience/giad109","DOIUrl":"10.1093/gigascience/giad109","url":null,"abstract":"<p><strong>Background: </strong>Convolutional neural network (CNN)-based methods have shown excellent performance in denoising and reconstruction of super-resolved structured illumination microscopy (SR-SIM) data. Therefore, CNN-based architectures have been the focus of existing studies. However, Swin Transformer, an alternative and recently proposed deep learning-based image restoration architecture, has not been fully investigated for denoising SR-SIM images. Furthermore, it has not been fully explored how well transfer learning strategies work for denoising SR-SIM images with different noise characteristics and recorded cell structures for these different types of deep learning-based methods. Currently, the scarcity of publicly available SR-SIM datasets limits the exploration of the performance and generalization capabilities of deep learning methods.</p><p><strong>Results: </strong>In this work, we present SwinT-fairSIM, a novel method based on the Swin Transformer for restoring SR-SIM images with a low signal-to-noise ratio. The experimental results show that SwinT-fairSIM outperforms previous CNN-based denoising methods. Furthermore, as a second contribution, two types of transfer learning-namely, direct transfer and fine-tuning-were benchmarked in combination with SwinT-fairSIM and CNN-based methods for denoising SR-SIM data. Direct transfer did not prove to be a viable strategy, but fine-tuning produced results comparable to conventional training from scratch while saving computational time and potentially reducing the amount of training data required. As a third contribution, we publish four datasets of raw SIM images and already reconstructed SR-SIM images. These datasets cover two different types of cell structures, tubulin filaments and vesicle structures. Different noise levels are available for the tubulin filaments.</p><p><strong>Conclusion: </strong>The SwinT-fairSIM method is well suited for denoising SR-SIM images. By fine-tuning, already trained models can be easily adapted to different noise characteristics and cell structures. Furthermore, the provided datasets are structured in a way that the research community can readily use them for research on denoising, super-resolution, and transfer learning strategies.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10787368/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139466408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}