首页 > 最新文献

bioRxiv - Bioinformatics最新文献

英文 中文
RGAST: Relational Graph Attention Network for Spatial Transcriptome Analysis RGAST:用于空间转录组分析的关系图注意网络
Pub Date : 2024-08-10 DOI: 10.1101/2024.08.09.607420
Yuqiao Gong, Zhangsheng Yu
Recent advancements in spatially resolved transcriptomics have provided a powerful means to comprehensively capture gene expression patterns while preserving the spatial context of the tissue microenvironment. Accurately deciphering the spatial context of spots within a tissue necessitates the careful utilization of their spatial information, which in turn requires feature extraction from complex and detailed spatial patterns. In this study, we present RGAST (Relational Graph Attention network for Spatial Transcriptome analysis), a framework designed to learn low-dimensional representations of spatial transcriptome (ST) data. RGAST is the first to consider gene expression similarity and spatial neighbor relationships simultaneously in constructing a heterogeneous graph network in ST analysis. We further introduce a cross-attention mechanism to provide a more comprehensive and adaptive representation of spatial transcriptome data. We validate the effectiveness of RGAST in different downstream tasks using diverse spatial transcriptomics datasets obtained from different platforms with varying spatial resolutions. Our results demonstrate that RGAST enhances spatial domain identification accuracy by approximately 10% compared to the second method in 10X Visium DLPFC dataset. Furthermore, RGAST facilitates the discovery of spatially variable genes, uncovers spatially resolved cell-cell interactions, enables more precise cell trajectory inference and reveals intricate 3D spatial patterns across multiple sections of ST data. Our RGAST method is available as a Python package on PyPI at https://pypi.org/project/RGAST, free for academic use, and the source code is openly available from our GitHub repository at https://github.com/GYQ-form/RGAST.
空间分辨转录组学的最新进展为全面捕捉基因表达模式同时保留组织微环境的空间背景提供了强有力的手段。要准确解读组织内斑点的空间环境,就必须仔细利用它们的空间信息,而这反过来又需要从复杂而详细的空间模式中提取特征。在这项研究中,我们提出了用于空间转录组分析的关系图注意网络(RGAST),这是一个旨在学习空间转录组(ST)数据低维表征的框架。RGAST 首次在空间转录组分析中构建异构图网络时同时考虑了基因表达相似性和空间邻接关系。我们进一步引入了交叉关注机制,为空间转录组数据提供更全面的自适应表征。我们利用从不同平台获得的不同空间分辨率的空间转录组学数据集,验证了 RGAST 在不同下游任务中的有效性。我们的结果表明,在 10X Visium DLPFC 数据集中,与第二种方法相比,RGAST 提高了约 10% 的空间域识别准确率。此外,RGAST 还有助于发现空间可变基因,揭示空间解析的细胞-细胞相互作用,实现更精确的细胞轨迹推断,并揭示 ST 数据多个部分中错综复杂的三维空间模式。我们的 RGAST 方法作为 Python 软件包发布在 PyPI 上,网址是 https://pypi.org/project/RGAST,供学术界免费使用,源代码可从我们的 GitHub 存储库 https://github.com/GYQ-form/RGAST 公开获取。
{"title":"RGAST: Relational Graph Attention Network for Spatial Transcriptome Analysis","authors":"Yuqiao Gong, Zhangsheng Yu","doi":"10.1101/2024.08.09.607420","DOIUrl":"https://doi.org/10.1101/2024.08.09.607420","url":null,"abstract":"Recent advancements in spatially resolved transcriptomics have provided a powerful means to comprehensively capture gene expression patterns while preserving the spatial context of the tissue microenvironment. Accurately deciphering the spatial context of spots within a tissue necessitates the careful utilization of their spatial information, which in turn requires feature extraction from complex and detailed spatial patterns. In this study, we present RGAST (Relational Graph Attention network for Spatial Transcriptome analysis), a framework designed to learn low-dimensional representations of spatial transcriptome (ST) data. RGAST is the first to consider gene expression similarity and spatial neighbor relationships simultaneously in constructing a heterogeneous graph network in ST analysis. We further introduce a cross-attention mechanism to provide a more comprehensive and adaptive representation of spatial transcriptome data. We validate the effectiveness of RGAST in different downstream tasks using diverse spatial transcriptomics datasets obtained from different platforms with varying spatial resolutions. Our results demonstrate that RGAST enhances spatial domain identification accuracy by approximately 10% compared to the second method in 10X Visium DLPFC dataset. Furthermore, RGAST facilitates the discovery of spatially variable genes, uncovers spatially resolved cell-cell interactions, enables more precise cell trajectory inference and reveals intricate 3D spatial patterns across multiple sections of ST data. Our RGAST method is available as a Python package on PyPI at https://pypi.org/project/RGAST, free for academic use, and the source code is openly available from our GitHub repository at https://github.com/GYQ-form/RGAST.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"2010 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian Estimation of Allele-Specific Expression in the Presence of Phasing Uncertainty 在相位不确定的情况下对等位基因特异性表达的贝叶斯估计
Pub Date : 2024-08-10 DOI: 10.1101/2024.08.09.607371
Xue Zou, Zachary W. Gomez, Timothy E. Reddy, Andrew S. Allen, William H. Majoros
Motivation: Allele specific expression (ASE) analyses aim to detect imbalanced expression of maternal versus paternal copies of an autosomal gene. Such allelic imbalance can result from a variety of cis-acting causes, including disruptive mutations within one copy of a gene that impact the stability of transcripts, as well as regulatory variants outside the gene that impact transcription initiation. Current methods for ASE estimation suffer from a number of shortcomings, such as relying on only one variant within a gene, assuming perfect phasing information across multiple variants within a gene, or failing to account for alignment biases and possible genotyping errors. Results: We developed BEASTIE, a Bayesian hierarchical model designed for precise ASE quantification at the gene level, based on given genotypes and RNA-seq data. BEASTIE addresses the complexities of allelic mapping bias, genotyping error, and phasing errors by incorporating empirical phasing error rates derived from Genome-in-a-Bottle individual NA12878. BEASTIE surpasses existing methods in accuracy, especially in scenarios with high phasing errors. This improvement is critical for identifying rare genetic variants often obscured by such errors. Through rigorous validation on simulated data and application to real data from the 1000 Genomes Project, we establish the robustness of BEASTIE. These findings underscore the value of BEASTIE in revealing patterns of ASE across gene sets and pathways.
动机等位基因特异性表达(ASE)分析旨在检测常染色体基因母本与父本的不平衡表达。这种等位基因不平衡可由多种顺式作用原因导致,包括影响转录本稳定性的基因拷贝内的破坏性突变,以及影响转录起始的基因外调控变异。目前的 ASE 估算方法有很多不足之处,例如只依赖于一个基因内的一个变体,假设一个基因内多个变体的相位信息是完美的,或者没有考虑到比对偏差和可能的基因分型错误。结果:我们开发了贝叶斯分层模型 BEASTIE,该模型旨在根据给定的基因型和 RNA-seq 数据,在基因水平上精确量化 ASE。BEASTIE 结合了从 Genome-in-a-Bottle 个体 NA12878 中得出的经验分期误差率,解决了等位基因映射偏差、基因分型误差和分期误差等复杂问题。BEASTIE 的准确性超过了现有方法,尤其是在相位误差较大的情况下。这种改进对于识别经常被这种误差所掩盖的罕见遗传变异至关重要。通过对模拟数据的严格验证以及对来自 1000 基因组计划的真实数据的应用,我们确定了 BEASTIE 的稳健性。这些发现强调了 BEASTIE 在揭示跨基因组和通路的 ASE 模式方面的价值。
{"title":"Bayesian Estimation of Allele-Specific Expression in the Presence of Phasing Uncertainty","authors":"Xue Zou, Zachary W. Gomez, Timothy E. Reddy, Andrew S. Allen, William H. Majoros","doi":"10.1101/2024.08.09.607371","DOIUrl":"https://doi.org/10.1101/2024.08.09.607371","url":null,"abstract":"Motivation: Allele specific expression (ASE) analyses aim to detect imbalanced expression of maternal versus paternal copies of an autosomal gene. Such allelic imbalance can result from a variety of cis-acting causes, including disruptive mutations within one copy of a gene that impact the stability of transcripts, as well as regulatory variants outside the gene that impact transcription initiation. Current methods for ASE estimation suffer from a number of shortcomings, such as relying on only one variant within a gene, assuming perfect phasing information across multiple variants within a gene, or failing to account for alignment biases and possible genotyping errors. Results: We developed BEASTIE, a Bayesian hierarchical model designed for precise ASE quantification at the gene level, based on given genotypes and RNA-seq data. BEASTIE addresses the complexities of allelic mapping bias, genotyping error, and phasing errors by incorporating empirical phasing error rates derived from Genome-in-a-Bottle individual NA12878. BEASTIE surpasses existing methods in accuracy, especially in scenarios with high phasing errors. This improvement is critical for identifying rare genetic variants often obscured by such errors. Through rigorous validation on simulated data and application to real data from the 1000 Genomes Project, we establish the robustness of BEASTIE. These findings underscore the value of BEASTIE in revealing patterns of ASE across gene sets and pathways.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"57 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Path-based reasoning in biomedical knowledge graphs with BioPathNet 利用 BioPathNet 在生物医学知识图谱中进行基于路径的推理
Pub Date : 2024-08-10 DOI: 10.1101/2024.06.17.599219
Yue Hu, Svitlana Oleshko, Samuele Firmani, Zhaocheng Zhu, Hui Cheng, Maria Ulmer, Matthias Arnold, Maria Colome-Tatche, Jian Tang, Sophie Xhonneux, Annalisa Marsico
Understanding complex interactions in biomedical networks is crucial for advancements in biomedicine, but traditional link prediction (LP) methods are limited in capturing this complexity. Representation-based learning techniques improve prediction accuracy by mapping nodes to low-dimensional embeddings, yet they often struggle with interpretability and scalability. We present BioPathNet, a novel graph neural network framework based on the Neural Bellman-Ford Network (NBFNet), addressing these limitations through path-based reasoning for LP in biomedical knowledge graphs. Unlike node-embedding frameworks, BioPathNet learns representations between node pairs by considering all relations along paths, enhancing prediction accuracy and interpretability. This allows visualization of influential paths and facilitates biological validation. BioPathNet leverages a background regulatory graph (BRG) for enhanced message passing and uses stringent negative sampling to improve precision. In evaluations across various LP tasks, such as gene function annotation, drug-disease indication, synthetic lethality, and lncRNA-mRNA interaction prediction, BioPathNet consistently outperformed shallow node embedding methods, relational graph neural networks and task-specific state-of-the-art methods, demonstrating robust performance and versatility. Our study predicts novel drug indications for diseases like acute lymphoblastic leukemia (ALL) and Alzheimer's, validated by medical experts and clinical trials. We also identified new synthetic lethality gene pairs and regulatory interactions involving lncRNAs and target genes, confirmed through literature reviews. BioPathNet's interpretability will enable researchers to trace prediction paths and gain molecular insights, making it a valuable tool for drug discovery, personalized medicine and biology in general.
了解生物医学网络中复杂的相互作用对生物医学的发展至关重要,但传统的链接预测(LP)方法在捕捉这种复杂性方面存在局限性。基于表征的学习技术通过将节点映射到低维嵌入来提高预测的准确性,但它们往往在可解释性和可扩展性方面存在困难。我们介绍的 BioPathNet 是一种基于神经贝尔曼-福特网络(NBFNet)的新型图神经网络框架,通过在生物医学知识图谱中进行基于路径的 LP 推理来解决这些局限性。与节点嵌入框架不同,BioPathNet 通过考虑路径上的所有关系来学习节点对之间的表征,从而提高了预测的准确性和可解释性。这允许对有影响的路径进行可视化,并促进生物验证。BioPathNet 利用背景调控图 (BRG) 增强信息传递,并使用严格的负采样提高精确度。在对基因功能注释、药物-疾病适应症、合成致死率和 lncRNA-mRNA 相互作用预测等各种 LP 任务的评估中,BioPathNet 的表现始终优于浅层节点嵌入方法、关系图神经网络和特定任务的最先进方法,显示出强大的性能和多功能性。我们的研究预测了急性淋巴细胞白血病(ALL)和阿尔茨海默氏症等疾病的新药适应症,并得到了医学专家和临床试验的验证。我们还发现了新的合成致死基因对以及涉及 lncRNA 和靶基因的调控相互作用,这些都通过文献综述得到了证实。BioPathNet 的可解释性将使研究人员能够追踪预测路径并获得分子洞察力,使其成为药物发现、个性化医疗和一般生物学的宝贵工具。
{"title":"Path-based reasoning in biomedical knowledge graphs with BioPathNet","authors":"Yue Hu, Svitlana Oleshko, Samuele Firmani, Zhaocheng Zhu, Hui Cheng, Maria Ulmer, Matthias Arnold, Maria Colome-Tatche, Jian Tang, Sophie Xhonneux, Annalisa Marsico","doi":"10.1101/2024.06.17.599219","DOIUrl":"https://doi.org/10.1101/2024.06.17.599219","url":null,"abstract":"Understanding complex interactions in biomedical networks is crucial for advancements in biomedicine, but traditional link prediction (LP) methods are limited in capturing this complexity. Representation-based learning techniques improve prediction accuracy by mapping nodes to low-dimensional embeddings, yet they often struggle with interpretability and scalability. We present BioPathNet, a novel graph neural network framework based on the Neural Bellman-Ford Network (NBFNet), addressing these limitations through path-based reasoning for LP in biomedical knowledge graphs. Unlike node-embedding frameworks, BioPathNet learns representations between node pairs by considering all relations along paths, enhancing prediction accuracy and interpretability. This allows visualization of influential paths and facilitates biological validation. BioPathNet leverages a background regulatory graph (BRG) for enhanced message passing and uses stringent negative sampling to improve precision. In evaluations across various LP tasks, such as gene function annotation, drug-disease indication, synthetic lethality, and lncRNA-mRNA interaction prediction, BioPathNet consistently outperformed shallow node embedding methods, relational graph neural networks and task-specific state-of-the-art methods, demonstrating robust performance and versatility. Our study predicts novel drug indications for diseases like acute lymphoblastic leukemia (ALL) and Alzheimer's, validated by medical experts and clinical trials. We also identified new synthetic lethality gene pairs and regulatory interactions involving lncRNAs and target genes, confirmed through literature reviews. BioPathNet's interpretability will enable researchers to trace prediction paths and gain molecular insights, making it a valuable tool for drug discovery, personalized medicine and biology in general.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"86 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Novel Insights into Post-Myocardial Infarction Cardiac Remodeling through Algorithmic Detection of Cell-Type Composition Shifts 通过细胞类型组成变化算法检测心肌梗死后心脏重塑的新见解
Pub Date : 2024-08-10 DOI: 10.1101/2024.08.09.607400
Brian Gural, Logan Kirkland, Abbey Hockett, Peyton Sandroni, Jiandong Zhang, Manuel Rosa-Garrido, Samantha K Swift, Douglas Chapski, Michael A Flinn, Caitlin C O'Meara, Thomas M Vondriska, Michaela Patterson, Brian C Jensen, Christoph Rau
Background: Recent advances in single cell sequencing have led to an increased focus on the role of cell-type composition in phenotypic presentation and disease progression. Cell-type composition research in the heart is challenging due to large, frequently multinucleated cardiomyocytes that preclude most single cell approaches from obtaining accurate measurements of cell composition. Our in silico studies reveal that ignoring cell type composition when calculating differentially expressed genes (DEGs) can have significant consequences. For example, a relatively small change in cell abundance of only 10% can result in over 25% of DEGs being false positives. Methods: We have implemented an algorithmic approach that uses snRNAseq datasets as a reference to accurately calculate cell type compositions from bulk RNAseq datasets through robust data cleaning, gene selection, and multi-sample cross-subject and cross-cell-type deconvolution. We applied our approach to cardiomyocyte-specific α1A adrenergic receptor (CM-α1A-AR) knockout mice. 8-12 week-old mice (either WT or CM-α1A-KO) were subjected to permanent left coronary artery (LCA) ligation or sham surgery (n=4 per group). Transcriptomes from the infarct border zones were collected 3 days later and analyzed using our algorithm to determine cell-type abundances, corrected differential expression calculations using DESeq2, and validated these findings using RNAscope. Results: Uncorrected DEGs for the CM-α1A-KO X LCA interaction term featured many cell-type specific genes such as Timp4 (fibroblasts) and Aplnr (cardiomyocytes) and overall GO enrichment for terms pertaining to cardiomyocyte differentiation (P=3.1E-4). Using our algorithm, we observe a striking loss of cardiomyocytes and gain in fibroblasts in the α1A-KO + LCA mice that was not recapitulated in WT + LCA animals, although we did observe a similar increase in macrophage abundance in both conditions. This recapitulates prior results that showed a much more severe heart failure phenotype in CM-α1A-KO + LCA mice. Following correction for cell-type, our DEGs now highlight a novel set of genes enriched for GO terms such as cardiac contraction (P=3.7E-5) and actin filament organization (P=6.3E-5). Conclusions: Our algorithm identifies and corrects for cell-type abundance in bulk RNAseq datasets opening new avenues for research on novel genes and pathways as well as an improved understanding of the role of cardiac cell types in cardiovascular disease.
背景:单细胞测序技术的最新进展使人们越来越关注细胞类型组成在表型表现和疾病进展中的作用。心脏中的细胞类型组成研究具有挑战性,因为心肌细胞体积大且经常多核,大多数单细胞方法无法准确测量细胞组成。我们的硅学研究表明,在计算差异表达基因(DEGs)时忽略细胞类型组成会产生重大影响。例如,细胞丰度相对较小的变化(仅为 10%)会导致超过 25% 的 DEGs 出现假阳性。方法:我们采用了一种算法方法,以 snRNAseq 数据集为参考,通过稳健的数据清理、基因选择、多样本跨主体和跨细胞类型解卷积,从大量 RNAseq 数据集中准确计算出细胞类型组成。我们将这种方法应用于心肌细胞特异性α1A肾上腺素能受体(CM-α1A-AR)基因敲除小鼠。对 8-12 周大的小鼠(WT 或 CM-α1A-KO)进行永久性左冠状动脉(LCA)结扎或假手术(每组 4 只)。3 天后收集梗死边缘区的转录组,使用我们的算法分析确定细胞类型丰度,使用 DESeq2 校正差异表达计算,并使用 RNAscope 验证这些发现。结果CM-α1A-KO X LCA 相互作用项的未校正 DEGs 有许多细胞类型特异性基因,如 Timp4(成纤维细胞)和 Aplnr(心肌细胞),以及与心肌细胞分化有关的术语的整体 GO 富集(P=3.1E-4)。使用我们的算法,我们观察到在α1A-KO + LCA小鼠中心肌细胞显著减少,而成纤维细胞增加,这在WT + LCA动物中没有再现,尽管我们在两种情况下都观察到了巨噬细胞丰度的类似增加。这再现了之前的结果,即 CM-α1A-KO + LCA 小鼠的心衰表型要严重得多。在对细胞类型进行校正后,我们的 DEGs 现在突出显示了一组新的基因,它们富集于心脏收缩(P=3.7E-5)和肌动蛋白丝组织(P=6.3E-5)等 GO 术语。结论我们的算法能识别并校正大量 RNAseq 数据集中的细胞类型丰度,为新型基因和通路的研究开辟了新途径,并能更好地了解心脏细胞类型在心血管疾病中的作用。
{"title":"Novel Insights into Post-Myocardial Infarction Cardiac Remodeling through Algorithmic Detection of Cell-Type Composition Shifts","authors":"Brian Gural, Logan Kirkland, Abbey Hockett, Peyton Sandroni, Jiandong Zhang, Manuel Rosa-Garrido, Samantha K Swift, Douglas Chapski, Michael A Flinn, Caitlin C O'Meara, Thomas M Vondriska, Michaela Patterson, Brian C Jensen, Christoph Rau","doi":"10.1101/2024.08.09.607400","DOIUrl":"https://doi.org/10.1101/2024.08.09.607400","url":null,"abstract":"Background: Recent advances in single cell sequencing have led to an increased focus on the role of cell-type composition in phenotypic presentation and disease progression. Cell-type composition research in the heart is challenging due to large, frequently multinucleated cardiomyocytes that preclude most single cell approaches from obtaining accurate measurements of cell composition. Our in silico studies reveal that ignoring cell type composition when calculating differentially expressed genes (DEGs) can have significant consequences. For example, a relatively small change in cell abundance of only 10% can result in over 25% of DEGs being false positives. Methods: We have implemented an algorithmic approach that uses snRNAseq datasets as a reference to accurately calculate cell type compositions from bulk RNAseq datasets through robust data cleaning, gene selection, and multi-sample cross-subject and cross-cell-type deconvolution. We applied our approach to cardiomyocyte-specific α1A adrenergic receptor (CM-α1A-AR) knockout mice. 8-12 week-old mice (either WT or CM-α1A-KO) were subjected to permanent left coronary artery (LCA) ligation or sham surgery (n=4 per group). Transcriptomes from the infarct border zones were collected 3 days later and analyzed using our algorithm to determine cell-type abundances, corrected differential expression calculations using DESeq2, and validated these findings using RNAscope. Results: Uncorrected DEGs for the CM-α1A-KO X LCA interaction term featured many cell-type specific genes such as Timp4 (fibroblasts) and Aplnr (cardiomyocytes) and overall GO enrichment for terms pertaining to cardiomyocyte differentiation (P=3.1E-4). Using our algorithm, we observe a striking loss of cardiomyocytes and gain in fibroblasts in the α1A-KO + LCA mice that was not recapitulated in WT + LCA animals, although we did observe a similar increase in macrophage abundance in both conditions. This recapitulates prior results that showed a much more severe heart failure phenotype in CM-α1A-KO + LCA mice. Following correction for cell-type, our DEGs now highlight a novel set of genes enriched for GO terms such as cardiac contraction (P=3.7E-5) and actin filament organization (P=6.3E-5). Conclusions: Our algorithm identifies and corrects for cell-type abundance in bulk RNAseq datasets opening new avenues for research on novel genes and pathways as well as an improved understanding of the role of cardiac cell types in cardiovascular disease.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predict metal-binding proteins and structures through integration of evolutionary-scale and physics-based modeling 通过整合进化尺度建模和物理建模预测金属结合蛋白和结构
Pub Date : 2024-08-10 DOI: 10.1101/2024.08.09.607368
Xin Dai, Max Henderson, Shinjae Yoo, Qun Liu
Metals are essential elements in all living organisms, binding to approximately 50% of proteins. They serve to stabilize proteins, catalyze reactions, regulate activities, and fulfill various physiological and pathological functions. While there have been many advancements in determining the structures of protein-metal complexes, numerous metal-binding proteins still need to be identified through computational methods and validated through experiments. To address this need, we have developed the ESMBind-based workflow, which combines evolutionary scale modeling (ESM) for metal-binding prediction and physics-based protein-metal modeling. Our approach utilizes the ESM-2 and ESM-IF models to predict metal-binding probability at the residue level. In addition, we have designed a metal-placement method and energy minimization technique to generate detailed 3D structures of protein-metal complexes. Our workflow outperforms other models in terms of residue and 3D-level predictions. To demonstrate its effectiveness, we applied the workflow to 142 uncharacterized fungal pathogen proteins and predicted metal-binding proteins involved in fungal infection and virulence.
金属是所有生物体内不可或缺的元素,与大约 50% 的蛋白质结合。它们起到稳定蛋白质、催化反应、调节活动以及实现各种生理和病理功能的作用。虽然在确定蛋白质-金属复合物结构方面取得了许多进展,但仍有许多金属结合蛋白需要通过计算方法来鉴定,并通过实验来验证。为了满足这一需求,我们开发了基于 ESMBind 的工作流程,该流程结合了用于金属结合预测的进化尺度建模(ESM)和基于物理的蛋白质-金属建模。我们的方法利用 ESM-2 和 ESM-IF 模型预测残基水平的金属结合概率。此外,我们还设计了一种金属置放方法和能量最小化技术,以生成蛋白质-金属复合物的详细三维结构。在残基和三维水平预测方面,我们的工作流程优于其他模型。为了证明其有效性,我们将该工作流程应用于 142 个未表征的真菌病原体蛋白,并预测了涉及真菌感染和毒力的金属结合蛋白。
{"title":"Predict metal-binding proteins and structures through integration of evolutionary-scale and physics-based modeling","authors":"Xin Dai, Max Henderson, Shinjae Yoo, Qun Liu","doi":"10.1101/2024.08.09.607368","DOIUrl":"https://doi.org/10.1101/2024.08.09.607368","url":null,"abstract":"Metals are essential elements in all living organisms, binding to approximately 50% of proteins. They serve to stabilize proteins, catalyze reactions, regulate activities, and fulfill various physiological and pathological functions. While there have been many advancements in determining the structures of protein-metal complexes, numerous metal-binding proteins still need to be identified through computational methods and validated through experiments. To address this need, we have developed the ESMBind-based workflow, which combines evolutionary scale modeling (ESM) for metal-binding prediction and physics-based protein-metal modeling. Our approach utilizes the ESM-2 and ESM-IF models to predict metal-binding probability at the residue level. In addition, we have designed a metal-placement method and energy minimization technique to generate detailed 3D structures of protein-metal complexes. Our workflow outperforms other models in terms of residue and 3D-level predictions. To demonstrate its effectiveness, we applied the workflow to 142 uncharacterized fungal pathogen proteins and predicted metal-binding proteins involved in fungal infection and virulence.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MyVivarium: A cloud-based lab animal colony management application with near-realtime ambient sensing MyVivarium:基于云的实验动物群落管理应用程序,具有近实时环境感应功能
Pub Date : 2024-08-10 DOI: 10.1101/2024.08.10.607395
Robinson Vidva, Mir Abbas Raza, Jaswant Prabhakaran, Ayesha Sheikh, Alaina Sharp, Hayden Ott, Amelia Moore, Christopher Fleisher, Pothitos M. Pitychoutis, Tam V. Nguyen, Aaron Sathyanesan
Research-animal colony management is a critical determinant of scientific productivity in labs conducting preclinical research. However, labs generally use ad hoc paper-based or spreadsheet-based methods to manage animal colonies. Current software-based solutions are limited based on cost, ease of implementation and deployment, and lack of remote access. In addition, most current solutions lack integration of realtime monitoring of ambient variables that affect colony wellbeing and breeding efficiency. Keeping these functionalities in mind, we built MyVivarium - a cloud-based animal colony database management web application that can be easily deployed and sustained for a cost comparable to starting and maintaining a lab website. For mouse colony management, MyVivarium allows for the tracking of individual mice or litters within holding-cage and breeding-cage records by multiple users at designated levels of access. Physical identities of cages are mapped onto the database using printable cage-card templates with cage-specific QR codes, enabling quick record updating using mobile devices. Tasks can be easily assigned with reminders for experiments or cage maintenance. Finally, MyVivarium integrates near-realtime internet-of-things (IoT)-based ambient sensing using low-cost open-source hardware to track humidity, temperature, vivarium-worker activity, and room illuminance. Taken together, MyVivarium is a novel open-source cloud-based application template that can serve as a low-cost, simple, and efficient solution for digital management of research animal colonies.
在进行临床前研究的实验室中,研究动物群落管理是决定科研生产力的关键因素。然而,实验室通常使用基于纸张或电子表格的临时方法来管理动物群。目前基于软件的解决方案因成本、实施和部署的难易程度以及缺乏远程访问能力而受到限制。此外,目前的大多数解决方案都缺乏对影响动物群健康和繁殖效率的环境变量进行实时监控的集成功能。考虑到这些功能,我们建立了 MyVivarium - 一个基于云的动物群落数据库管理网络应用程序,可以轻松部署和维持,成本与创建和维护一个实验室网站相当。在小鼠群落管理方面,MyVivarium 允许多个用户按照指定的访问级别在饲养笼和繁殖笼记录中跟踪单个小鼠或鼠群。笼子的物理标识通过带有特定笼子二维码的可打印笼子卡模板映射到数据库中,从而可以使用移动设备快速更新记录。可以轻松分配任务,并提醒进行实验或笼子维护。最后,MyVivarium 利用低成本开源硬件集成了基于物联网(IoT)的近实时环境传感技术,以跟踪湿度、温度、饲养员活动和房间照度。综合来看,MyVivarium 是一种基于云的新型开源应用模板,可作为研究动物群落数字化管理的低成本、简单而高效的解决方案。
{"title":"MyVivarium: A cloud-based lab animal colony management application with near-realtime ambient sensing","authors":"Robinson Vidva, Mir Abbas Raza, Jaswant Prabhakaran, Ayesha Sheikh, Alaina Sharp, Hayden Ott, Amelia Moore, Christopher Fleisher, Pothitos M. Pitychoutis, Tam V. Nguyen, Aaron Sathyanesan","doi":"10.1101/2024.08.10.607395","DOIUrl":"https://doi.org/10.1101/2024.08.10.607395","url":null,"abstract":"Research-animal colony management is a critical determinant of scientific productivity in labs conducting preclinical research. However, labs generally use <em>ad hoc</em> paper-based or spreadsheet-based methods to manage animal colonies. Current software-based solutions are limited based on cost, ease of implementation and deployment, and lack of remote access. In addition, most current solutions lack integration of realtime monitoring of ambient variables that affect colony wellbeing and breeding efficiency. Keeping these functionalities in mind, we built <em>MyVivarium</em> - a cloud-based animal colony database management web application that can be easily deployed and sustained for a cost comparable to starting and maintaining a lab website. For mouse colony management, MyVivarium allows for the tracking of individual mice or litters within holding-cage and breeding-cage records by multiple users at designated levels of access. Physical identities of cages are mapped onto the database using printable cage-card templates with cage-specific QR codes, enabling quick record updating using mobile devices. Tasks can be easily assigned with reminders for experiments or cage maintenance. Finally, MyVivarium integrates near-realtime internet-of-things (IoT)-based ambient sensing using low-cost open-source hardware to track humidity, temperature, vivarium-worker activity, and room illuminance. Taken together, MyVivarium is a novel open-source cloud-based application template that can serve as a low-cost, simple, and efficient solution for digital management of research animal colonies.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Codon Usage Bias Analysis of Human Papillomavirus 18s L1 Protein and its Host Adaptability 人类乳头瘤病毒 18s L1 蛋白的密码子使用偏差分析及其宿主适应性
Pub Date : 2024-08-10 DOI: 10.1101/2024.08.10.607454
Vinaya Vinod Shinde, Swati Bankariya, Parminder Kaur
Human Papillomavirus 18 (HPV 18) is known as a high-risk variant associated with cervical and anogenital malignancies. High-risk types HPV 18 and HPV 16 (human papillomavirus 16) play a major part in about 70 percent of cervical cancer worldwide (Ramakrishnan et al., 2015). The L1 protein of HPV 18 (HPV 18s L1 protein), also known as major capsid L1 protein is targeted in the vaccine development against HPV 18 due to its non-oncogenic and non-infectious properties with self-assembly ability into virus-like particles. In the present analysis, an extensive codon usage bias analysis of HPV 18s L1 protein and adaptation to its host human was conducted. The Effective number (Nc) Grand Average of Hydropathy (GRAVY), Index of Aromaticity (AROMO), and Codon Bias Index (CBI) values revealed no biases in codon usage of HPV 18s L1 protein. The data of the Codon Adaptation Index (CAI), and Relative Codon Deoptimization Index (RCDI) indicate adaptation of HPV 18s L1 protein according to its host human. The domination of selection pressure on codon usage of HPV 18s L1 protein was demonstrated based on GC12 vs GC3, Nc vs GC3, and frequency of optimal codons (FOP). The Parity plot revealed that the genome of HPV 18s L1 protein has a preference for purine over pyrimidine, that is G nucleotides over C, and no preference for A over T but A/T richness was observed in the genome of HPV 18s L1 protein. In the Nucleotide composition, GC1 richness ultimately represents evolutionary aspects of codon usage. Furthermore, these findings can be used in currently ongoing vaccine development and gene therapy to design viral vectors.
人乳头瘤病毒 18(HPV 18)是一种与宫颈癌和肛门生殖器恶性肿瘤相关的高危变种。高危型 HPV 18 和 HPV 16(人乳头瘤病毒 16)在全球约 70% 的宫颈癌中扮演着重要角色(Ramakrishnan 等人,2015 年)。HPV 18 的 L1 蛋白(HPV 18s L1 蛋白)又称主要噬菌体 L1 蛋白,由于其具有非致癌和非感染特性,并具有自组装成病毒样颗粒的能力,因此成为 HPV 18 疫苗开发的目标。本分析对 HPV 18s L1 蛋白的密码子使用偏差及其对宿主人类的适应性进行了广泛分析。有效数(Nc)、水合总平均值(GRAVY)、芳香指数(AROMO)和密码子偏差指数(CBI)值显示,HPV 18s L1 蛋白的密码子使用没有偏差。密码子适应指数(CAI)和相对密码子去优化指数(RCDI)的数据表明,HPV 18s L1 蛋白根据宿主人类进行了适应性调整。根据 GC12 与 GC3、Nc 与 GC3 以及最佳密码子频率(FOP)的比较,证明了选择压力对 HPV 18s L1 蛋白密码子使用的支配作用。奇偶性图显示,HPV 18s L1 蛋白的基因组偏好嘌呤而非嘧啶,即偏好 G 核苷酸而非 C 核苷酸,不偏好 A 核苷酸而非 T 核苷酸,但在 HPV 18s L1 蛋白的基因组中观察到了丰富的 A/T 核苷酸。在核苷酸组成中,GC1 的丰富性最终代表了密码子使用的进化方面。此外,这些发现可用于目前正在进行的疫苗开发和基因治疗,以设计病毒载体。
{"title":"Codon Usage Bias Analysis of Human Papillomavirus 18s L1 Protein and its Host Adaptability","authors":"Vinaya Vinod Shinde, Swati Bankariya, Parminder Kaur","doi":"10.1101/2024.08.10.607454","DOIUrl":"https://doi.org/10.1101/2024.08.10.607454","url":null,"abstract":"Human Papillomavirus 18 (HPV 18) is known as a high-risk variant associated with cervical and anogenital malignancies. High-risk types HPV 18 and HPV 16 (human papillomavirus 16) play a major part in about 70 percent of cervical cancer worldwide (Ramakrishnan et al., 2015). The L1 protein of HPV 18 (HPV 18s L1 protein), also known as major capsid L1 protein is targeted in the vaccine development against HPV 18 due to its non-oncogenic and non-infectious properties with self-assembly ability into virus-like particles. In the present analysis, an extensive codon usage bias analysis of HPV 18s L1 protein and adaptation to its host human was conducted. The Effective number (Nc) Grand Average of Hydropathy (GRAVY), Index of Aromaticity (AROMO), and Codon Bias Index (CBI) values revealed no biases in codon usage of HPV 18s L1 protein. The data of the Codon Adaptation Index (CAI), and Relative Codon Deoptimization Index (RCDI) indicate adaptation of HPV 18s L1 protein according to its host human. The domination of selection pressure on codon usage of HPV 18s L1 protein was demonstrated based on GC12 vs GC3, Nc vs GC3, and frequency of optimal codons (FOP). The Parity plot revealed that the genome of HPV 18s L1 protein has a preference for purine over pyrimidine, that is G nucleotides over C, and no preference for A over T but A/T richness was observed in the genome of HPV 18s L1 protein. In the Nucleotide composition, GC1 richness ultimately represents evolutionary aspects of codon usage. Furthermore, these findings can be used in currently ongoing vaccine development and gene therapy to design viral vectors.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Impacts of Cell Ranger versions on Chromium gene expression data 细胞游侠版本对 Chromium 基因表达数据的影响
Pub Date : 2024-08-10 DOI: 10.1101/2024.08.10.607413
Imad Abugessaisa, Akira Hasegawa, Scott Walker, Shintaro Katayama, Juha Kere, Takeya Kasukawa
In droplet-based Chromium single cell gene expression data by the 10x Genomics platform, cell barcode calling by Cell Ranger (CR) is a standard pipeline. However, no systematic evaluation of the impact of the released versions of CR on Chromium single cell gene expression data has been conducted. To comprehensively evaluate the impact of CR, we considered six molecular quality criteria, quantified gene expression, and performed downstream analysis for 12 single-cell Chromium gene expression datasets. Each dataset was processed by 10 versions of CR resulting in 180 datasets and a total of 702,493 cell barcodes. We demonstrated that different versions of CR yield different numbers of cell barcodes with significant variation in molecular qualities and average gene expression for the same dataset. Our analysis finds distinction between two diverse categories of cell barcodes: common barcodes called (unmasked) by all versions of CR, and specific barcodes only called (unmasked/masked) by some versions. Surprisingly, we observed variations in molecular quality indices between common cell barcodes when called by different versions of CR. The specific barcodes yield skewed gene body coverage and form distinct clusters at the edges of UMAP plots. The choice of CR version affects scores for quality, average gene expression, clustering results, and top cluster marker genes for each dataset. Our study indicates a demonstrable, quantitative effect on downstream analysis from choice of CR version, resulting in widely different Chromium single cell gene expression data for different CR versions.
在 10x Genomics 平台基于液滴的 Chromium 单细胞基因表达数据中,细胞游侠(Cell Ranger,CR)的细胞条形码调用是一个标准流程。然而,目前还没有系统评估已发布版本的 CR 对 Chromium 单细胞基因表达数据的影响。为了全面评估 CR 的影响,我们考虑了六个分子质量标准,量化了基因表达,并对 12 个 Chromium 单细胞基因表达数据集进行了下游分析。每个数据集都经过 10 个版本的 CR 处理,共产生 180 个数据集和 702,493 个细胞条形码。我们证明,不同版本的 CR 产生的细胞条形码数量不同,同一数据集的分子质量和平均基因表达量也有显著差异。我们的分析发现细胞条形码有两种不同的类别:一种是所有 CR 版本都调用(未屏蔽)的普通条形码,另一种是某些版本才调用(未屏蔽/屏蔽)的特定条形码。令人惊讶的是,我们观察到不同版本的 CR 调用普通细胞条形码时,其分子质量指数存在差异。特定的条形码会产生偏斜的基因体覆盖率,并在 UMAP 图的边缘形成明显的群集。CR 版本的选择会影响每个数据集的质量得分、平均基因表达量、聚类结果和顶级聚类标记基因。我们的研究表明,选择 CR 版本会对下游分析产生明显的定量影响,导致不同 CR 版本的 Chromium 单细胞基因表达数据大相径庭。
{"title":"Impacts of Cell Ranger versions on Chromium gene expression data","authors":"Imad Abugessaisa, Akira Hasegawa, Scott Walker, Shintaro Katayama, Juha Kere, Takeya Kasukawa","doi":"10.1101/2024.08.10.607413","DOIUrl":"https://doi.org/10.1101/2024.08.10.607413","url":null,"abstract":"In droplet-based Chromium single cell gene expression data by the 10x Genomics platform, cell barcode calling by Cell Ranger (CR) is a standard pipeline. However, no systematic evaluation of the impact of the released versions of CR on Chromium single cell gene expression data has been conducted. To comprehensively evaluate the impact of CR, we considered six molecular quality criteria, quantified gene expression, and performed downstream analysis for 12 single-cell Chromium gene expression datasets. Each dataset was processed by 10 versions of CR resulting in 180 datasets and a total of 702,493 cell barcodes. We demonstrated that different versions of CR yield different numbers of cell barcodes with significant variation in molecular qualities and average gene expression for the same dataset. Our analysis finds distinction between two diverse categories of cell barcodes: common barcodes called (unmasked) by all versions of CR, and specific barcodes only called (unmasked/masked) by some versions. Surprisingly, we observed variations in molecular quality indices between common cell barcodes when called by different versions of CR. The specific barcodes yield skewed gene body coverage and form distinct clusters at the edges of UMAP plots. The choice of CR version affects scores for quality, average gene expression, clustering results, and top cluster marker genes for each dataset. Our study indicates a demonstrable, quantitative effect on downstream analysis from choice of CR version, resulting in widely different Chromium single cell gene expression data for different CR versions.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"86 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fine-tuning of conditional Transformers for the generation of functionally characterized enzymes 微调条件转化器,生成功能特征酶
Pub Date : 2024-08-10 DOI: 10.1101/2024.08.10.607430
Marco Nicolini, Emanuele Saitto, Ruben E Jimenez Franco, Emanuele Cavalleri, Marco Mesiti, Aldo J Galeano Alfonso, Dario Malchiodi, Alberto Paccanaro, Peter N Robinson, Elena Casiraghi, Giorgio Valentini
We introduce Finenzyme, a Protein Language Model (PLM) that employs a multifaceted learning strategy based on transfer learning from a decoder-based Transformer, conditional learning using specific functional keywords, and fine-tuning to model specific Enzyme Commission (EC) categories. Using Finenzyme, we investigate the conditions under which fine-tuning enhances the prediction and generation of EC categories, showing a two-fold perplexity improvement in EC-specific categories compared to a generalist model. Our extensive experimentation shows that Finenzyme generated sequences can be very different from natural ones while retaining similar tertiary structures, functions and chemical kinetics of their natural counterparts. Importantly, the embedded representations of the generated enzymes closely resemble those of natural ones, thus making them suitable for downstream tasks. Finally, we illustrate how Finenzyme can be used in practice to generate enzymes characterized by specific functions using in-silico directed evolution, a computationally inexpensive PLM fine-tuning procedure significantly enhancing and assisting targeted enzyme engineering tasks.
我们介绍了一种蛋白质语言模型(PLM)--Finenzyme,它采用了基于解码器转换器的迁移学习、使用特定功能关键词的条件学习和微调来模拟特定酶委员会(EC)类别的多方面学习策略。通过使用Finenzyme,我们研究了在哪些条件下微调可增强EC类别的预测和生成,结果表明,与通用模型相比,EC特定类别的复杂性提高了两倍。我们的大量实验表明,Finenzyme生成的序列可以与天然序列大相径庭,但同时保留了与天然序列相似的三级结构、功能和化学动力学。重要的是,生成的酶的嵌入式表示与天然酶的嵌入式表示非常相似,因此适合下游任务。最后,我们说明了如何在实践中使用Finenzyme来生成以特定功能为特征的酶,使用的是一种计算成本低廉的PLM微调程序,可显著增强和协助有针对性的酶工程任务。
{"title":"Fine-tuning of conditional Transformers for the generation of functionally characterized enzymes","authors":"Marco Nicolini, Emanuele Saitto, Ruben E Jimenez Franco, Emanuele Cavalleri, Marco Mesiti, Aldo J Galeano Alfonso, Dario Malchiodi, Alberto Paccanaro, Peter N Robinson, Elena Casiraghi, Giorgio Valentini","doi":"10.1101/2024.08.10.607430","DOIUrl":"https://doi.org/10.1101/2024.08.10.607430","url":null,"abstract":"We introduce Finenzyme, a Protein Language Model (PLM) that employs a multifaceted learning strategy based on transfer learning from a decoder-based Transformer, conditional learning using specific functional keywords, and fine-tuning to model specific Enzyme Commission (EC) categories. Using Finenzyme, we investigate the conditions under which fine-tuning enhances the prediction and generation of EC categories, showing a two-fold perplexity improvement in EC-specific categories compared to a generalist model. Our extensive experimentation shows that Finenzyme generated sequences can be very different from natural ones while retaining similar tertiary structures, functions and chemical kinetics of their natural counterparts. Importantly, the embedded representations of the generated enzymes closely resemble those of natural ones, thus making them suitable for downstream tasks. Finally, we illustrate how Finenzyme can be used in practice to generate enzymes characterized by specific functions using in-silico directed evolution, a computationally inexpensive PLM fine-tuning procedure significantly enhancing and assisting targeted enzyme engineering tasks.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Discovering nuclear localization signal universe through a novel deep learning model with interpretable attention units 通过具有可解释注意力单元的新型深度学习模型发现核定位信号宇宙
Pub Date : 2024-08-10 DOI: 10.1101/2024.08.10.606103
Yifan Li, Xiaoyong Pan, Hong-Bin Shen
Nuclear localization signals (NLSs) are essential peptide fragments within proteins that play a decisive role in guiding proteins into the cell nucleus. Determining the existence and precise locations of NLSs experimentally is time-consuming and complicated, resulting in a scarcity of experimentally validated NLS fragments. Consequently, annotated NLS datasets are relatively small, presenting challenges for data-driven methods. In this study, we propose an innovative interpretable approach, NLSExplorer, which leverages large-scale protein language models to capture crucial biological information with a novel attention-based deep network for NLS identification. By utilizing the knowledge retrieved from protein language models, NLSExplorer achieves superior predictive performance compared to existing methods on two NLS benchmark datasets. Additionally, NLSExplorer is able to detect various kinds of segments highly correlated with nuclear transport, such as nuclear export signals. We employ NLSExplorer to investigate potential NLSs and other domains that are important for nuclear transport in nucleus-localized proteins within the Swiss-Prot database. Further comprehensive pattern analysis for all these segments uncovers a potential NLS space and internal relationship of important nuclear transport segments for 416 species. This study not only introduces a powerful tool for predicting and exploring NLS space, but also offers a versatile network that detects characteristic domains and motifs of NLSs.
核定位信号(NLS)是蛋白质中的重要肽段,在引导蛋白质进入细胞核方面起着决定性作用。通过实验确定 NLS 的存在和精确位置既耗时又复杂,因此实验验证的 NLS 片段非常稀少。因此,注释的 NLS 数据集相对较少,给数据驱动方法带来了挑战。在这项研究中,我们提出了一种创新的可解释方法--NLSExplorer,它利用大规模蛋白质语言模型,通过基于注意力的新型深度网络捕捉关键的生物信息,用于 NLS 识别。通过利用从蛋白质语言模型中获取的知识,NLSExplorer 在两个 NLS 基准数据集上取得了优于现有方法的预测性能。此外,NLSExplorer 还能检测与核运输高度相关的各种片段,如核输出信号。我们利用 NLSExplorer 研究了 Swiss-Prot 数据库中潜在的 NLS 和其他对核定位蛋白核运输非常重要的结构域。对所有这些片段的进一步综合模式分析揭示了潜在的 NLS 空间和 416 种重要核转运片段的内部关系。这项研究不仅为预测和探索 NLS 空间引入了一个强大的工具,而且还提供了一个检测 NLS 特征域和图案的多功能网络。
{"title":"Discovering nuclear localization signal universe through a novel deep learning model with interpretable attention units","authors":"Yifan Li, Xiaoyong Pan, Hong-Bin Shen","doi":"10.1101/2024.08.10.606103","DOIUrl":"https://doi.org/10.1101/2024.08.10.606103","url":null,"abstract":"Nuclear localization signals (NLSs) are essential peptide fragments within proteins that play a decisive role in guiding proteins into the cell nucleus. Determining the existence and precise locations of NLSs experimentally is time-consuming and complicated, resulting in a scarcity of experimentally validated NLS fragments. Consequently, annotated NLS datasets are relatively small, presenting challenges for data-driven methods. In this study, we propose an innovative interpretable approach, NLSExplorer, which leverages large-scale protein language models to capture crucial biological information with a novel attention-based deep network for NLS identification. By utilizing the knowledge retrieved from protein language models, NLSExplorer achieves superior predictive performance compared to existing methods on two NLS benchmark datasets. Additionally, NLSExplorer is able to detect various kinds of segments highly correlated with nuclear transport, such as nuclear export signals. We employ NLSExplorer to investigate potential NLSs and other domains that are important for nuclear transport in nucleus-localized proteins within the Swiss-Prot database. Further comprehensive pattern analysis for all these segments uncovers a potential NLS space and internal relationship of important nuclear transport segments for 416 species. This study not only introduces a powerful tool for predicting and exploring NLS space, but also offers a versatile network that detects characteristic domains and motifs of NLSs.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
bioRxiv - Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1