Pub Date : 2024-01-02DOI: 10.1093/gigascience/giae070
Yuanting Shen, Lidan Tao, Rengang Zhang, Gang Yao, Minjie Zhou, Weibang Sun, Yongpeng Ma
Background: Advanced whole-genome sequencing techniques enable covering nearly all genome nucleotide variations and thus can provide deep insights into protecting endangered species. However, the use of genomic data to make conservation strategies is still rare, particularly for endangered plants. Here we performed comprehensive conservation genomic analysis for Malania oleifera, an endangered tree species with a high amount of nervonic acid. We used whole-genome resequencing data of 165 samples, covering 16 populations across the entire distribution range, to investigate the formation reasons of its extremely small population sizes and to evaluate the possible genomic offsets and changes of ecology niche suitability under future climate change.
Results: Although M. oleifera maintains relatively high genetic diversity among endangered woody plants (θπ = 3.87 × 10-3), high levels of inbreeding have been observed, which have reduced genetic diversity in 3 populations (JM, NP, and BM2) and caused the accumulation of deleterious mutations. Repeated bottleneck events, recent inbreeding (∼490 years ago), and anthropogenic disturbance to wild habitats have aggravated the fragmentation of M. oleifera and made it endangered. Due to the significant effect of higher average annual temperature, populations distributed in low altitude exhibit a greater genomic offset. Furthermore, ecological niche modeling shows the suitable habitats for M. oleifera will decrease by 71.15% and 98.79% in 2100 under scenarios SSP126 and SSP585, respectively.
Conclusions: The basic realizations concerning the threats to M. oleifera provide scientific foundation for defining management and adaptive units, as well as prioritizing populations for genetic rescue. Meanwhile, we highlight the importance of integrating genomic offset and ecological niche modeling to make targeted conservation actions under future climate change. Overall, our study provides a paradigm for genomics-directed conservation.
{"title":"Genomic insights into endangerment and conservation of the garlic-fruit tree (Malania oleifera), a plant species with extremely small populations.","authors":"Yuanting Shen, Lidan Tao, Rengang Zhang, Gang Yao, Minjie Zhou, Weibang Sun, Yongpeng Ma","doi":"10.1093/gigascience/giae070","DOIUrl":"10.1093/gigascience/giae070","url":null,"abstract":"<p><strong>Background: </strong>Advanced whole-genome sequencing techniques enable covering nearly all genome nucleotide variations and thus can provide deep insights into protecting endangered species. However, the use of genomic data to make conservation strategies is still rare, particularly for endangered plants. Here we performed comprehensive conservation genomic analysis for Malania oleifera, an endangered tree species with a high amount of nervonic acid. We used whole-genome resequencing data of 165 samples, covering 16 populations across the entire distribution range, to investigate the formation reasons of its extremely small population sizes and to evaluate the possible genomic offsets and changes of ecology niche suitability under future climate change.</p><p><strong>Results: </strong>Although M. oleifera maintains relatively high genetic diversity among endangered woody plants (θπ = 3.87 × 10-3), high levels of inbreeding have been observed, which have reduced genetic diversity in 3 populations (JM, NP, and BM2) and caused the accumulation of deleterious mutations. Repeated bottleneck events, recent inbreeding (∼490 years ago), and anthropogenic disturbance to wild habitats have aggravated the fragmentation of M. oleifera and made it endangered. Due to the significant effect of higher average annual temperature, populations distributed in low altitude exhibit a greater genomic offset. Furthermore, ecological niche modeling shows the suitable habitats for M. oleifera will decrease by 71.15% and 98.79% in 2100 under scenarios SSP126 and SSP585, respectively.</p><p><strong>Conclusions: </strong>The basic realizations concerning the threats to M. oleifera provide scientific foundation for defining management and adaptive units, as well as prioritizing populations for genetic rescue. Meanwhile, we highlight the importance of integrating genomic offset and ecological niche modeling to make targeted conservation actions under future climate change. Overall, our study provides a paradigm for genomics-directed conservation.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11417964/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142283910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1093/gigascience/giae064
Yan Lu, Fang Luo, An Zhou, Cun Yi, Hao Chen, Jian Li, Yunhai Guo, Yuxiang Xie, Wei Zhang, Datao Lin, Yaming Yang, Zhongdao Wu, Yi Zhang, Shuhua Xu, Wei Hu
Pomacea canaliculata, an invasive species native to South America, is recognized for its broad geographic distribution and adaptability to a variety of ecological conditions. The details concerning the evolution and adaptation of P. canaliculate remain unclear due to a lack of whole-genome resequencing data. We examined 173 P. canaliculata genomes representing 17 geographic populations in East and Southeast Asia. Interestingly, P. canaliculata showed a higher level of genetic diversity than other mollusks, and our analysis suggested that the dispersal of P. canaliculata could have been driven by climate changes and human activities. Notably, we identified a set of genes associated with low temperature adaptation, including Csde1, a cold shock protein coding gene. Further RNA sequencing analysis and reverse transcription quantitative polymerase chain reaction experiments demonstrated the gene's dynamic pattern and biological functions during cold exposure. Moreover, both positive selection and balancing selection are likely to have contributed to the rapid environmental adaptation of P. canaliculata populations. In particular, genes associated with energy metabolism and stress response were undergoing positive selection, while a large number of immune-related genes showed strong signatures of balancing selection. Our study has advanced our understanding of the evolution of P. canaliculata and has provided a valuable resource concerning an invasive species.
Pomacea canaliculata 是一种原产于南美洲的入侵物种,因其广泛的地理分布和对各种生态条件的适应性而被公认。由于缺乏全基因组重测序数据,有关P. canaliculata进化和适应的细节仍不清楚。我们研究了代表东亚和东南亚 17 个地理种群的 173 个 P. canaliculata 基因组。有趣的是,与其他软体动物相比,P. canaliculata 表现出更高水平的遗传多样性,我们的分析表明 P. canaliculata 的扩散可能是由气候变化和人类活动驱动的。值得注意的是,我们发现了一组与低温适应相关的基因,包括冷休克蛋白编码基因 Csde1。进一步的 RNA 测序分析和反转录定量聚合酶链反应实验证明了该基因在低温暴露过程中的动态模式和生物学功能。此外,正向选择和平衡选择都可能促成了 P. canaliculata 种群对环境的快速适应。特别是,与能量代谢和应激反应相关的基因正在经历正选择,而大量与免疫相关的基因则表现出强烈的平衡选择特征。我们的研究增进了我们对 P. canaliculata 进化的了解,并为研究这一入侵物种提供了宝贵的资料。
{"title":"Whole-genome sequencing of the invasive golden apple snail Pomacea canaliculata from Asia reveals rapid expansion and adaptive evolution.","authors":"Yan Lu, Fang Luo, An Zhou, Cun Yi, Hao Chen, Jian Li, Yunhai Guo, Yuxiang Xie, Wei Zhang, Datao Lin, Yaming Yang, Zhongdao Wu, Yi Zhang, Shuhua Xu, Wei Hu","doi":"10.1093/gigascience/giae064","DOIUrl":"10.1093/gigascience/giae064","url":null,"abstract":"<p><p>Pomacea canaliculata, an invasive species native to South America, is recognized for its broad geographic distribution and adaptability to a variety of ecological conditions. The details concerning the evolution and adaptation of P. canaliculate remain unclear due to a lack of whole-genome resequencing data. We examined 173 P. canaliculata genomes representing 17 geographic populations in East and Southeast Asia. Interestingly, P. canaliculata showed a higher level of genetic diversity than other mollusks, and our analysis suggested that the dispersal of P. canaliculata could have been driven by climate changes and human activities. Notably, we identified a set of genes associated with low temperature adaptation, including Csde1, a cold shock protein coding gene. Further RNA sequencing analysis and reverse transcription quantitative polymerase chain reaction experiments demonstrated the gene's dynamic pattern and biological functions during cold exposure. Moreover, both positive selection and balancing selection are likely to have contributed to the rapid environmental adaptation of P. canaliculata populations. In particular, genes associated with energy metabolism and stress response were undergoing positive selection, while a large number of immune-related genes showed strong signatures of balancing selection. Our study has advanced our understanding of the evolution of P. canaliculata and has provided a valuable resource concerning an invasive species.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11417965/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142283912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1093/gigascience/giae068
Danilo Bzdok, Guy Wolf, Jakub Kopal
Big neuroscience datasets are not big small datasets when it comes to quantitative data analysis. Neuroscience has now witnessed the advent of many population cohort studies that deep-profile participants, yielding hundreds of measures, capturing dimensions of each individual's position in the broader society. Indeed, there is a rebalancing from small, strictly selected, and thus homogenized cohorts toward always larger, more representative, and thus diverse cohorts. This shift in cohort composition is prompting the revision of incumbent modeling practices. Major sources of population stratification increasingly overshadow the subtle effects that neuroscientists are typically studying. In our opinion, as we sample individuals from always wider diversity backgrounds, we will require a new stack of quantitative tools to realize diversity-aware modeling. We here take inventory of candidate analytical frameworks. Better incorporating driving factors behind population structure will allow refining our understanding of how brain-behavior relationships depend on human subgroups.
{"title":"Harnessing population diversity: in search of tools of the trade.","authors":"Danilo Bzdok, Guy Wolf, Jakub Kopal","doi":"10.1093/gigascience/giae068","DOIUrl":"https://doi.org/10.1093/gigascience/giae068","url":null,"abstract":"<p><p>Big neuroscience datasets are not big small datasets when it comes to quantitative data analysis. Neuroscience has now witnessed the advent of many population cohort studies that deep-profile participants, yielding hundreds of measures, capturing dimensions of each individual's position in the broader society. Indeed, there is a rebalancing from small, strictly selected, and thus homogenized cohorts toward always larger, more representative, and thus diverse cohorts. This shift in cohort composition is prompting the revision of incumbent modeling practices. Major sources of population stratification increasingly overshadow the subtle effects that neuroscientists are typically studying. In our opinion, as we sample individuals from always wider diversity backgrounds, we will require a new stack of quantitative tools to realize diversity-aware modeling. We here take inventory of candidate analytical frameworks. Better incorporating driving factors behind population structure will allow refining our understanding of how brain-behavior relationships depend on human subgroups.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11427908/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142344886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1093/gigascience/giae080
Shizhuo Zhang, Jiyun Han, Juntao Liu
Identification of protein-protein and protein-nucleic acid binding sites provides insights into biological processes related to protein functions and technical guidance for disease diagnosis and drug design. However, accurate predictions by computational approaches remain highly challenging due to the limited knowledge of residue binding patterns. The binding pattern of a residue should be characterized by the spatial distribution of its neighboring residues combined with their physicochemical information interaction, which yet cannot be achieved by previous methods. Here, we design GraphRBF, a hierarchical geometric deep learning model to learn residue binding patterns from big data. To achieve it, GraphRBF describes physicochemical information interactions by designing an enhanced graph neural network and characterizes residue spatial distributions by introducing a prioritized radial basis function neural network. After training and testing, GraphRBF shows great improvements over existing state-of-the-art methods and strong interpretability of its learned representations. Applying GraphRBF to the SARS-CoV-2 omicron spike protein, it successfully identifies known epitopes of the protein. Moreover, it predicts multiple potential binding regions for new nanobodies or even new drugs with strong evidence. A user-friendly online server for GraphRBF is freely available at http://liulab.top/GraphRBF/server.
{"title":"Protein-protein and protein-nucleic acid binding site prediction via interpretable hierarchical geometric deep learning.","authors":"Shizhuo Zhang, Jiyun Han, Juntao Liu","doi":"10.1093/gigascience/giae080","DOIUrl":"10.1093/gigascience/giae080","url":null,"abstract":"<p><p>Identification of protein-protein and protein-nucleic acid binding sites provides insights into biological processes related to protein functions and technical guidance for disease diagnosis and drug design. However, accurate predictions by computational approaches remain highly challenging due to the limited knowledge of residue binding patterns. The binding pattern of a residue should be characterized by the spatial distribution of its neighboring residues combined with their physicochemical information interaction, which yet cannot be achieved by previous methods. Here, we design GraphRBF, a hierarchical geometric deep learning model to learn residue binding patterns from big data. To achieve it, GraphRBF describes physicochemical information interactions by designing an enhanced graph neural network and characterizes residue spatial distributions by introducing a prioritized radial basis function neural network. After training and testing, GraphRBF shows great improvements over existing state-of-the-art methods and strong interpretability of its learned representations. Applying GraphRBF to the SARS-CoV-2 omicron spike protein, it successfully identifies known epitopes of the protein. Moreover, it predicts multiple potential binding regions for new nanobodies or even new drugs with strong evidence. A user-friendly online server for GraphRBF is freely available at http://liulab.top/GraphRBF/server.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11528319/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142557605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1093/gigascience/giad113
Sheeba Samuel, Daniel Mietchen
Background: Jupyter notebooks facilitate the bundling of executable code with its documentation and output in one interactive environment, and they represent a popular mechanism to document and share computational workflows, including for research publications. The reproducibility of computational aspects of research is a key component of scientific reproducibility but has not yet been assessed at scale for Jupyter notebooks associated with biomedical publications.
Approach: We address computational reproducibility at 2 levels: (i) using fully automated workflows, we analyzed the computational reproducibility of Jupyter notebooks associated with publications indexed in the biomedical literature repository PubMed Central. We identified such notebooks by mining the article's full text, trying to locate them on GitHub, and attempting to rerun them in an environment as close to the original as possible. We documented reproduction success and exceptions and explored relationships between notebook reproducibility and variables related to the notebooks or publications. (ii) This study represents a reproducibility attempt in and of itself, using essentially the same methodology twice on PubMed Central over the course of 2 years, during which the corpus of Jupyter notebooks from articles indexed in PubMed Central has grown in a highly dynamic fashion.
Results: Out of 27,271 Jupyter notebooks from 2,660 GitHub repositories associated with 3,467 publications, 22,578 notebooks were written in Python, including 15,817 that had their dependencies declared in standard requirement files and that we attempted to rerun automatically. For 10,388 of these, all declared dependencies could be installed successfully, and we reran them to assess reproducibility. Of these, 1,203 notebooks ran through without any errors, including 879 that produced results identical to those reported in the original notebook and 324 for which our results differed from the originally reported ones. Running the other notebooks resulted in exceptions.
Conclusions: We zoom in on common problems and practices, highlight trends, and discuss potential improvements to Jupyter-related workflows associated with biomedical publications.
{"title":"Computational reproducibility of Jupyter notebooks from biomedical publications.","authors":"Sheeba Samuel, Daniel Mietchen","doi":"10.1093/gigascience/giad113","DOIUrl":"10.1093/gigascience/giad113","url":null,"abstract":"<p><strong>Background: </strong>Jupyter notebooks facilitate the bundling of executable code with its documentation and output in one interactive environment, and they represent a popular mechanism to document and share computational workflows, including for research publications. The reproducibility of computational aspects of research is a key component of scientific reproducibility but has not yet been assessed at scale for Jupyter notebooks associated with biomedical publications.</p><p><strong>Approach: </strong>We address computational reproducibility at 2 levels: (i) using fully automated workflows, we analyzed the computational reproducibility of Jupyter notebooks associated with publications indexed in the biomedical literature repository PubMed Central. We identified such notebooks by mining the article's full text, trying to locate them on GitHub, and attempting to rerun them in an environment as close to the original as possible. We documented reproduction success and exceptions and explored relationships between notebook reproducibility and variables related to the notebooks or publications. (ii) This study represents a reproducibility attempt in and of itself, using essentially the same methodology twice on PubMed Central over the course of 2 years, during which the corpus of Jupyter notebooks from articles indexed in PubMed Central has grown in a highly dynamic fashion.</p><p><strong>Results: </strong>Out of 27,271 Jupyter notebooks from 2,660 GitHub repositories associated with 3,467 publications, 22,578 notebooks were written in Python, including 15,817 that had their dependencies declared in standard requirement files and that we attempted to rerun automatically. For 10,388 of these, all declared dependencies could be installed successfully, and we reran them to assess reproducibility. Of these, 1,203 notebooks ran through without any errors, including 879 that produced results identical to those reported in the original notebook and 324 for which our results differed from the originally reported ones. Running the other notebooks resulted in exceptions.</p><p><strong>Conclusions: </strong>We zoom in on common problems and practices, highlight trends, and discuss potential improvements to Jupyter-related workflows associated with biomedical publications.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10783158/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139416803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1093/gigascience/giad111
Akshay Akshay, Mitali Katoch, Navid Shekarchizadeh, Masoud Abedi, Ankush Sharma, Fiona C Burkhard, Rosalyn M Adam, Katia Monastyrskaya, Ali Hashemi Gheinani
Background: Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.
Results: To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating 4 essential functionalities-namely, Data Exploration, AutoML, CustomML, and Visualization-MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on 6 distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme's feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.
Conclusion: MLme serves as a valuable resource for leveraging ML to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.
背景:机器学习(ML)已成为研究人员从复杂数据集中分析和提取有价值信息的重要资产。然而,开发有效而强大的 ML 管道是一项真正的挑战,需要花费大量的时间和精力,从而阻碍了研究的进展。该领域的现有工具需要对 ML 原理和编程技巧有深刻的理解。此外,用户还需要对其 ML 管道进行全面配置,以获得最佳性能:为了应对这些挑战,我们开发了一款名为 "机器学习变得简单"(MLme)的新型工具,它可以简化 ML 在研究中的使用,目前尤其侧重于分类问题。通过整合 4 项基本功能--即数据探索、自动ML、自定义ML 和可视化--MLme 可以满足研究人员的各种需求,同时无需大量编码工作。为了证明 MLme 的适用性,我们在 6 个不同的数据集上进行了严格的测试,每个数据集都具有独特的特征和挑战。我们的测试结果一致表明,该工具在不同的数据集上都有良好的表现,这再次证明了该工具的通用性和有效性。此外,通过利用 MLme 的特征选择功能,我们成功地鉴定出了 CD8+ 幼稚细胞群 (BACH2)、CD16+ 细胞群 (CD16) 和 CD14+ 细胞群 (VCAN) 的重要标记物:MLme 是利用 ML 促进深入数据分析和提高研究成果的宝贵资源,同时减轻了与复杂编码脚本相关的担忧。有关 MLme 的源代码和详细教程,请访问 https://github.com/FunctionalUrology/MLme。
{"title":"Machine Learning Made Easy (MLme): a comprehensive toolkit for machine learning-driven data analysis.","authors":"Akshay Akshay, Mitali Katoch, Navid Shekarchizadeh, Masoud Abedi, Ankush Sharma, Fiona C Burkhard, Rosalyn M Adam, Katia Monastyrskaya, Ali Hashemi Gheinani","doi":"10.1093/gigascience/giad111","DOIUrl":"10.1093/gigascience/giad111","url":null,"abstract":"<p><strong>Background: </strong>Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.</p><p><strong>Results: </strong>To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating 4 essential functionalities-namely, Data Exploration, AutoML, CustomML, and Visualization-MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on 6 distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme's feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.</p><p><strong>Conclusion: </strong>MLme serves as a valuable resource for leveraging ML to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10783149/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139416804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1093/gigascience/giae042
Chao Zhang, Lin Liu, Ying Zhang, Mei Li, Shuangsang Fang, Qiang Kang, Ao Chen, Xun Xu, Yong Zhang, Yuxiang Li
Background: Integrative analysis of spatially resolved transcriptomics datasets empowers a deeper understanding of complex biological systems. However, integrating multiple tissue sections presents challenges for batch effect removal, particularly when the sections are measured by various technologies or collected at different times.
Findings: We propose spatiAlign, an unsupervised contrastive learning model that employs the expression of all measured genes and the spatial location of cells, to integrate multiple tissue sections. It enables the joint downstream analysis of multiple datasets not only in low-dimensional embeddings but also in the reconstructed full expression space.
Conclusions: In benchmarking analysis, spatiAlign outperforms state-of-the-art methods in learning joint and discriminative representations for tissue sections, each potentially characterized by complex batch effects or distinct biological characteristics. Furthermore, we demonstrate the benefits of spatiAlign for the integrative analysis of time-series brain sections, including spatial clustering, differential expression analysis, and particularly trajectory inference that requires a corrected gene expression matrix.
{"title":"spatiAlign: an unsupervised contrastive learning model for data integration of spatially resolved transcriptomics.","authors":"Chao Zhang, Lin Liu, Ying Zhang, Mei Li, Shuangsang Fang, Qiang Kang, Ao Chen, Xun Xu, Yong Zhang, Yuxiang Li","doi":"10.1093/gigascience/giae042","DOIUrl":"10.1093/gigascience/giae042","url":null,"abstract":"<p><strong>Background: </strong>Integrative analysis of spatially resolved transcriptomics datasets empowers a deeper understanding of complex biological systems. However, integrating multiple tissue sections presents challenges for batch effect removal, particularly when the sections are measured by various technologies or collected at different times.</p><p><strong>Findings: </strong>We propose spatiAlign, an unsupervised contrastive learning model that employs the expression of all measured genes and the spatial location of cells, to integrate multiple tissue sections. It enables the joint downstream analysis of multiple datasets not only in low-dimensional embeddings but also in the reconstructed full expression space.</p><p><strong>Conclusions: </strong>In benchmarking analysis, spatiAlign outperforms state-of-the-art methods in learning joint and discriminative representations for tissue sections, each potentially characterized by complex batch effects or distinct biological characteristics. Furthermore, we demonstrate the benefits of spatiAlign for the integrative analysis of time-series brain sections, including spatial clustering, differential expression analysis, and particularly trajectory inference that requires a corrected gene expression matrix.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11258913/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141727100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: In the face of a growing disparity between high-throughput sequence data and low-throughput experimental studies, the emerging field of deep learning stands as a promising alternative. Generally, many data-driven approaches are capable of facilitating fast and accurate predictions of protein functions. Nevertheless, the inherent statistical nature of deep learning techniques may limit their generalization capabilities when applied to novel nonhomologous proteins that diverge significantly from existing ones.
Results: In this work, we herein propose a novel, generalized approach named Graph Adversarial Learning with Alignment (GALA) for protein function prediction. Our GALA method integrates a graph transformer architecture with an attention pooling module to extract embeddings from both protein sequences and structures, facilitating unified learning of protein representations. Particularly noteworthy, GALA incorporates a domain discriminator conditioned on both learnable representations and predicted probabilities, which undergoes adversarial learning to ensure representation invariance across diverse environments. To optimize the model with abundant label information, we generate label embeddings in the hidden space, explicitly aligning them with protein representations. Benchmarked on datasets derived from the PDB database and Swiss-Prot database, our GALA achieves considerable performance comparable to several state-of-the-art methods. Even more, GALA demonstrates wonderful biological interpretability by identifying significant functional residues associated with Gene Ontology terms through class activation mapping.
Conclusions: GALA, which leverages adversarial learning and label embedding alignment to acquire domain-invariant protein representations, exhibits outstanding generalizability in function prediction for proteins from previously unseen sequence space. By incorporating the structures predicted by AlphaFold2, GALA demonstrates significant potential for function annotation in newly discovered sequences. A detailed implementation of our GALA is available at https://github.com/fuyw-aisw/GALA.
背景:高通量序列数据与低通量实验研究之间的差距越来越大,面对这种情况,新兴的深度学习领域成为了一种前景广阔的替代方法。一般来说,许多数据驱动的方法都能快速准确地预测蛋白质的功能。然而,当深度学习技术应用于与现有蛋白质有显著差异的新型非同源蛋白质时,其固有的统计性质可能会限制其泛化能力:在这项工作中,我们提出了一种用于蛋白质功能预测的新型通用方法,名为 "图形对抗学习与配准(GALA)"。我们的 GALA 方法集成了图转换器架构和注意力集合模块,可从蛋白质序列和结构中提取嵌入,从而促进蛋白质表征的统一学习。尤其值得注意的是,GALA 包含了一个以可学习表征和预测概率为条件的领域判别器,该判别器经过对抗学习,以确保在不同环境下的表征不变性。为了利用丰富的标签信息优化模型,我们在隐藏空间中生成了标签嵌入,明确地将它们与蛋白质表征对齐。以来自 PDB 数据库和 Swiss-Prot 数据库的数据集为基准,我们的 GALA 取得了与几种最先进方法相当的性能。此外,GALA 还通过类激活映射识别了与基因本体术语相关的重要功能残基,从而展示了出色的生物可解释性:GALA利用对抗学习和标签嵌入比对来获取领域不变的蛋白质表征,在对来自以前未见过的序列空间的蛋白质进行功能预测时表现出了出色的普适性。通过结合 AlphaFold2 预测的结构,GALA 在新发现序列的功能注释方面展现出巨大潜力。有关 GALA 的详细实现过程,请访问 https://github.com/fuyw-aisw/GALA。
{"title":"Learning a generalized graph transformer for protein function prediction in dissimilar sequences.","authors":"Yiwei Fu, Zhonghui Gu, Xiao Luo, Qirui Guo, Luhua Lai, Minghua Deng","doi":"10.1093/gigascience/giae093","DOIUrl":"10.1093/gigascience/giae093","url":null,"abstract":"<p><strong>Background: </strong>In the face of a growing disparity between high-throughput sequence data and low-throughput experimental studies, the emerging field of deep learning stands as a promising alternative. Generally, many data-driven approaches are capable of facilitating fast and accurate predictions of protein functions. Nevertheless, the inherent statistical nature of deep learning techniques may limit their generalization capabilities when applied to novel nonhomologous proteins that diverge significantly from existing ones.</p><p><strong>Results: </strong>In this work, we herein propose a novel, generalized approach named Graph Adversarial Learning with Alignment (GALA) for protein function prediction. Our GALA method integrates a graph transformer architecture with an attention pooling module to extract embeddings from both protein sequences and structures, facilitating unified learning of protein representations. Particularly noteworthy, GALA incorporates a domain discriminator conditioned on both learnable representations and predicted probabilities, which undergoes adversarial learning to ensure representation invariance across diverse environments. To optimize the model with abundant label information, we generate label embeddings in the hidden space, explicitly aligning them with protein representations. Benchmarked on datasets derived from the PDB database and Swiss-Prot database, our GALA achieves considerable performance comparable to several state-of-the-art methods. Even more, GALA demonstrates wonderful biological interpretability by identifying significant functional residues associated with Gene Ontology terms through class activation mapping.</p><p><strong>Conclusions: </strong>GALA, which leverages adversarial learning and label embedding alignment to acquire domain-invariant protein representations, exhibits outstanding generalizability in function prediction for proteins from previously unseen sequence space. By incorporating the structures predicted by AlphaFold2, GALA demonstrates significant potential for function annotation in newly discovered sequences. A detailed implementation of our GALA is available at https://github.com/fuyw-aisw/GALA.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11734293/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142828050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1093/gigascience/giae095
Jen-Hung Wang, Jorge Pereda, Ching-Wen Du, Chia-Yu Chu, Maria Oberländer Christensen, Sanja Kezic, Ivone Jakasa, Jacob P Thyssen, Sreeja Satheesh, Edwin En-Te Hwu
Background: Corneocyte surface nanoscale topography (nanotexture) has recently emerged as a potential biomarker for inflammatory skin diseases, such as atopic dermatitis (AD). This assessment method involves quantifying circular nano-size objects (CNOs) in corneocyte nanotexture images, enabling noninvasive analysis via stratum corneum (SC) tape stripping. Current approaches for identifying CNOs rely on computer vision techniques with specific geometric criteria, resulting in inaccuracies due to the susceptibility of nano-imaging techniques to environmental noise and structural occlusion on the corneocyte.
Results: This study recruited 45 AD patients and 15 healthy controls, evenly divided into 4 severity groups based on their Eczema Area and Severity Index scores. Subsequently, we collected a dataset of over 1,000 corneocyte nanotexture images using our in-house high-speed dermal atomic force microscope. This dataset was utilized to train state-of-the-art deep learning object detectors for identifying CNOs. Additionally, we implemented a kernel density estimator to analyze the spatial distribution of CNOs, excluding ineffective regions with minimal CNO occurrence, such as ridges and occlusions, thereby enhancing accuracy in density calculations. After fine-tuning, our detection model achieved an overall accuracy of 91.4% in detecting CNOs.
Conclusions: By integrating deep learning object detector with spatial analysis algorithms, we developed a precise methodology for calculating CNO density, termed the Effective Corneocyte Topographical Index (ECTI). The ECTI demonstrated exceptional robustness to nano-imaging artifacts and presents substantial potential for advancing AD diagnostics by effectively distinguishing between SC samples of varying AD severity and healthy controls.
背景:角质细胞表面纳米级形貌(纳米纹理)最近已成为特应性皮炎(AD)等炎症性皮肤病的潜在生物标志物。这种评估方法涉及量化角质细胞纳米纹理图像中的圆形纳米尺寸物体(CNOs),通过剥离角质层(SC)胶带实现无创分析。目前识别 CNO 的方法依赖于具有特定几何标准的计算机视觉技术,但由于纳米成像技术易受环境噪声和角质层结构闭塞的影响,因此会产生误差:本研究招募了 45 名 AD 患者和 15 名健康对照者,根据他们的湿疹面积和严重程度指数评分平均分为 4 个严重程度组。随后,我们使用内部高速皮肤原子力显微镜收集了超过 1000 张角质细胞纳米纹理图像的数据集。该数据集用于训练最先进的深度学习对象检测器,以识别 CNO。此外,我们还采用了核密度估计器来分析 CNO 的空间分布,排除了 CNO 出现最少的无效区域,例如脊和闭塞区,从而提高了密度计算的准确性。经过微调后,我们的检测模型在检测 CNO 方面的总体准确率达到了 91.4%:通过将深度学习对象检测器与空间分析算法相结合,我们开发出了一种精确计算CNO密度的方法,称为有效角质细胞地形指数(ECTI)。ECTI 对纳米成像伪影表现出卓越的鲁棒性,通过有效区分不同严重程度的 AD SC 样本和健康对照组,为推进 AD 诊断提供了巨大的潜力。
{"title":"Stratum corneum nanotexture feature detection using deep learning and spatial analysis: a noninvasive tool for skin barrier assessment.","authors":"Jen-Hung Wang, Jorge Pereda, Ching-Wen Du, Chia-Yu Chu, Maria Oberländer Christensen, Sanja Kezic, Ivone Jakasa, Jacob P Thyssen, Sreeja Satheesh, Edwin En-Te Hwu","doi":"10.1093/gigascience/giae095","DOIUrl":"10.1093/gigascience/giae095","url":null,"abstract":"<p><strong>Background: </strong>Corneocyte surface nanoscale topography (nanotexture) has recently emerged as a potential biomarker for inflammatory skin diseases, such as atopic dermatitis (AD). This assessment method involves quantifying circular nano-size objects (CNOs) in corneocyte nanotexture images, enabling noninvasive analysis via stratum corneum (SC) tape stripping. Current approaches for identifying CNOs rely on computer vision techniques with specific geometric criteria, resulting in inaccuracies due to the susceptibility of nano-imaging techniques to environmental noise and structural occlusion on the corneocyte.</p><p><strong>Results: </strong>This study recruited 45 AD patients and 15 healthy controls, evenly divided into 4 severity groups based on their Eczema Area and Severity Index scores. Subsequently, we collected a dataset of over 1,000 corneocyte nanotexture images using our in-house high-speed dermal atomic force microscope. This dataset was utilized to train state-of-the-art deep learning object detectors for identifying CNOs. Additionally, we implemented a kernel density estimator to analyze the spatial distribution of CNOs, excluding ineffective regions with minimal CNO occurrence, such as ridges and occlusions, thereby enhancing accuracy in density calculations. After fine-tuning, our detection model achieved an overall accuracy of 91.4% in detecting CNOs.</p><p><strong>Conclusions: </strong>By integrating deep learning object detector with spatial analysis algorithms, we developed a precise methodology for calculating CNO density, termed the Effective Corneocyte Topographical Index (ECTI). The ECTI demonstrated exceptional robustness to nano-imaging artifacts and presents substantial potential for advancing AD diagnostics by effectively distinguishing between SC samples of varying AD severity and healthy controls.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11629979/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142828051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1093/gigascience/giae078
Xing Liu, Chi Qu, Chuandong Liu, Na Zhu, Huaqiang Huang, Fei Teng, Caili Huang, Bingying Luo, Xuanzhu Liu, Min Xie, Feng Xi, Mei Li, Liang Wu, Yuxiang Li, Ao Chen, Xun Xu, Sha Liao, Jiajun Zhang
Background: Spatial transcriptome (ST) technologies are emerging as powerful tools for studying tumor biology. However, existing tools for analyzing ST data are limited, as they mainly rely on algorithms developed for single-cell RNA sequencing data and do not fully utilize the spatial information. While some algorithms have been developed for ST data, they are often designed for specific tasks, lacking a comprehensive analytical framework for leveraging spatial information.
Results: In this study, we present StereoSiTE, an analytical framework that combines open-source bioinformatics tools with custom algorithms to accurately infer the functional spatial cell interaction intensity (SCII) within the cellular neighborhood (CN) of interest. We applied StereoSiTE to decode ST datasets from xenograft models and found that the CN efficiently distinguished different cellular contexts, while the SCII analysis provided more precise insights into intercellular interactions by incorporating spatial information. By applying StereoSiTE to multiple samples, we successfully identified a CN region dominated by neutrophils, suggesting their potential role in remodeling the immune tumor microenvironment (iTME) after treatment. Moreover, the SCII analysis within the CN region revealed neutrophil-mediated communication, supported by pathway enrichment, transcription factor regulon activities, and protein-protein interactions.
Conclusions: StereoSiTE represents a promising framework for unraveling the mechanisms underlying treatment response within the iTME by leveraging CN-based tissue domain identification and SCII-inferred spatial intercellular interactions. The software is designed to be scalable, modular, and user-friendly, making it accessible to a wide range of researchers.
{"title":"StereoSiTE: a framework to spatially and quantitatively profile the cellular neighborhood organized iTME.","authors":"Xing Liu, Chi Qu, Chuandong Liu, Na Zhu, Huaqiang Huang, Fei Teng, Caili Huang, Bingying Luo, Xuanzhu Liu, Min Xie, Feng Xi, Mei Li, Liang Wu, Yuxiang Li, Ao Chen, Xun Xu, Sha Liao, Jiajun Zhang","doi":"10.1093/gigascience/giae078","DOIUrl":"https://doi.org/10.1093/gigascience/giae078","url":null,"abstract":"<p><strong>Background: </strong>Spatial transcriptome (ST) technologies are emerging as powerful tools for studying tumor biology. However, existing tools for analyzing ST data are limited, as they mainly rely on algorithms developed for single-cell RNA sequencing data and do not fully utilize the spatial information. While some algorithms have been developed for ST data, they are often designed for specific tasks, lacking a comprehensive analytical framework for leveraging spatial information.</p><p><strong>Results: </strong>In this study, we present StereoSiTE, an analytical framework that combines open-source bioinformatics tools with custom algorithms to accurately infer the functional spatial cell interaction intensity (SCII) within the cellular neighborhood (CN) of interest. We applied StereoSiTE to decode ST datasets from xenograft models and found that the CN efficiently distinguished different cellular contexts, while the SCII analysis provided more precise insights into intercellular interactions by incorporating spatial information. By applying StereoSiTE to multiple samples, we successfully identified a CN region dominated by neutrophils, suggesting their potential role in remodeling the immune tumor microenvironment (iTME) after treatment. Moreover, the SCII analysis within the CN region revealed neutrophil-mediated communication, supported by pathway enrichment, transcription factor regulon activities, and protein-protein interactions.</p><p><strong>Conclusions: </strong>StereoSiTE represents a promising framework for unraveling the mechanisms underlying treatment response within the iTME by leveraging CN-based tissue domain identification and SCII-inferred spatial intercellular interactions. The software is designed to be scalable, modular, and user-friendly, making it accessible to a wide range of researchers.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11503478/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142498592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}