首页 > 最新文献

BMC Bioinformatics最新文献

英文 中文
Answering open questions in biology using spatial genomics and structured methods. 利用空间基因组学和结构化方法回答生物学中的开放性问题。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-04 DOI: 10.1186/s12859-024-05912-5
Siddhartha G Jena, Archit Verma, Barbara E Engelhardt

Genomics methods have uncovered patterns in a range of biological systems, but obscure important aspects of cell behavior: the shapes, relative locations, movement, and interactions of cells in space. Spatial technologies that collect genomic or epigenomic data while preserving spatial information have begun to overcome these limitations. These new data promise a deeper understanding of the factors that affect cellular behavior, and in particular the ability to directly test existing theories about cell state and variation in the context of morphology, location, motility, and signaling that could not be tested before. Rapid advancements in resolution, ease-of-use, and scale of spatial genomics technologies to address these questions also require an updated toolkit of statistical methods with which to interrogate these data. We present a framework to respond to this new avenue of research: four open biological questions that can now be answered using spatial genomics data paired with methods for analysis. We outline spatial data modalities for each open question that may yield specific insights, discuss how conflicting theories may be tested by comparing the data to conceptual models of biological behavior, and highlight statistical and machine learning-based tools that may prove particularly helpful to recover biological understanding.

基因组学方法揭示了一系列生物系统的模式,但却掩盖了细胞行为的重要方面:细胞在空间中的形状、相对位置、运动和相互作用。收集基因组或表观基因组数据同时保留空间信息的空间技术已开始克服这些局限。这些新数据有望加深对影响细胞行为的因素的理解,特别是能够直接检验形态、位置、运动和信号传导方面有关细胞状态和变异的现有理论,而这些理论以前是无法检验的。为解决这些问题,空间基因组学技术在分辨率、易用性和规模方面取得了突飞猛进的发展,这也需要一套最新的统计方法工具包来分析这些数据。我们提出了一个框架来应对这一新的研究途径:四个开放的生物学问题现在可以利用空间基因组学数据和分析方法来回答。我们概述了每个开放性问题的空间数据模式,这些模式可能会产生特定的见解,讨论了如何通过将数据与生物行为的概念模型进行比较来检验相互冲突的理论,并强调了基于统计和机器学习的工具,这些工具可能会被证明特别有助于恢复对生物学的理解。
{"title":"Answering open questions in biology using spatial genomics and structured methods.","authors":"Siddhartha G Jena, Archit Verma, Barbara E Engelhardt","doi":"10.1186/s12859-024-05912-5","DOIUrl":"10.1186/s12859-024-05912-5","url":null,"abstract":"<p><p>Genomics methods have uncovered patterns in a range of biological systems, but obscure important aspects of cell behavior: the shapes, relative locations, movement, and interactions of cells in space. Spatial technologies that collect genomic or epigenomic data while preserving spatial information have begun to overcome these limitations. These new data promise a deeper understanding of the factors that affect cellular behavior, and in particular the ability to directly test existing theories about cell state and variation in the context of morphology, location, motility, and signaling that could not be tested before. Rapid advancements in resolution, ease-of-use, and scale of spatial genomics technologies to address these questions also require an updated toolkit of statistical methods with which to interrogate these data. We present a framework to respond to this new avenue of research: four open biological questions that can now be answered using spatial genomics data paired with methods for analysis. We outline spatial data modalities for each open question that may yield specific insights, discuss how conflicting theories may be tested by comparing the data to conceptual models of biological behavior, and highlight statistical and machine learning-based tools that may prove particularly helpful to recover biological understanding.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11375982/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142131751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VCF observer: a user-friendly software tool for preliminary VCF file analysis and comparison. VCF observer:一款用户友好型软件工具,用于对 VCF 文件进行初步分析和比较。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-03 DOI: 10.1186/s12859-024-05860-0
Abdullah Asım Emül, Mehmet Arif Ergün, Rumeysa Aslıhan Ertürk, Ömer Çinal, Mehmet Baysan

Background: Advancements over the past decade in DNA sequencing technology and computing power have created the potential to revolutionize medicine. There has been a marked increase in genetic data available, allowing for the advancement of areas such as personalized medicine. A crucial type of data in this context is genetic variant data which is stored in variant call format (VCF) files. However, the rapid growth in genomics has presented challenges in analyzing and comparing VCF files.

Results: In response to the limitations of existing tools, this paper introduces a novel web application that provides a user-friendly solution for VCF file analyses and comparisons. The software tool enables researchers and clinicians to perform high-level analysis with ease and enhances productivity. The application's interface allows users to conveniently upload, analyze, and visualize their VCF files using simple drag-and-drop and point-and-click operations. Essential visualizations such as Venn diagrams, clustergrams, and precision-recall plots are provided to users. A key feature of the application is its support for metadata-based file grouping, accomplished through flexible data matrix uploads, streamlining organization and analysis of user-defined categories. Additionally, the application facilitates standardized benchmarking of VCF files by integrating user-provided ground truth regions and variant lists.

Conclusions: By providing a user-friendly interface and supporting essential visualizations, this software enhances the accessibility of VCF file analysis and assists researchers and clinicians in their scientific inquiries.

背景:过去十年中,DNA 测序技术和计算能力的进步为医学带来了革命性的变革。可用的基因数据显著增加,促进了个性化医疗等领域的发展。这方面的一个重要数据类型是存储在变异调用格式(VCF)文件中的遗传变异数据。然而,基因组学的快速发展给分析和比较 VCF 文件带来了挑战:针对现有工具的局限性,本文介绍了一种新型网络应用程序,它为 VCF 文件分析和比较提供了用户友好型解决方案。该软件工具能让研究人员和临床医生轻松进行高级分析,提高工作效率。该应用程序的界面允许用户通过简单的拖放和点击操作,方便地上传、分析和可视化其 VCF 文件。用户还可获得维恩图、聚类图和精确调用图等基本可视化功能。该应用程序的一个主要特点是支持基于元数据的文件分组,通过灵活的数据矩阵上传,简化了用户定义类别的组织和分析。此外,该应用程序还通过整合用户提供的基本真实区域和变体列表,促进了 VCF 文件的标准化基准测试:通过提供友好的用户界面和支持基本的可视化功能,该软件提高了 VCF 文件分析的可访问性,有助于研究人员和临床医生进行科学研究。
{"title":"VCF observer: a user-friendly software tool for preliminary VCF file analysis and comparison.","authors":"Abdullah Asım Emül, Mehmet Arif Ergün, Rumeysa Aslıhan Ertürk, Ömer Çinal, Mehmet Baysan","doi":"10.1186/s12859-024-05860-0","DOIUrl":"10.1186/s12859-024-05860-0","url":null,"abstract":"<p><strong>Background: </strong>Advancements over the past decade in DNA sequencing technology and computing power have created the potential to revolutionize medicine. There has been a marked increase in genetic data available, allowing for the advancement of areas such as personalized medicine. A crucial type of data in this context is genetic variant data which is stored in variant call format (VCF) files. However, the rapid growth in genomics has presented challenges in analyzing and comparing VCF files.</p><p><strong>Results: </strong>In response to the limitations of existing tools, this paper introduces a novel web application that provides a user-friendly solution for VCF file analyses and comparisons. The software tool enables researchers and clinicians to perform high-level analysis with ease and enhances productivity. The application's interface allows users to conveniently upload, analyze, and visualize their VCF files using simple drag-and-drop and point-and-click operations. Essential visualizations such as Venn diagrams, clustergrams, and precision-recall plots are provided to users. A key feature of the application is its support for metadata-based file grouping, accomplished through flexible data matrix uploads, streamlining organization and analysis of user-defined categories. Additionally, the application facilitates standardized benchmarking of VCF files by integrating user-provided ground truth regions and variant lists.</p><p><strong>Conclusions: </strong>By providing a user-friendly interface and supporting essential visualizations, this software enhances the accessibility of VCF file analysis and assists researchers and clinicians in their scientific inquiries.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11373448/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142124727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction: Advancing drug-target interaction prediction: a comprehensive graph-based approach integrating knowledge graph embedding and ProtBert pretraining. 更正:推进药物-靶点相互作用预测:整合知识图嵌入和 ProtBert 预训练的基于图的综合方法。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-03 DOI: 10.1186/s12859-024-05905-4
Warith Eddine Djeddi, Khalil Hermi, Sadok Ben Yahia, Gayo Diallo
{"title":"Correction: Advancing drug-target interaction prediction: a comprehensive graph-based approach integrating knowledge graph embedding and ProtBert pretraining.","authors":"Warith Eddine Djeddi, Khalil Hermi, Sadok Ben Yahia, Gayo Diallo","doi":"10.1186/s12859-024-05905-4","DOIUrl":"10.1186/s12859-024-05905-4","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11373278/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142124725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Variant graph craft (VGC): a comprehensive tool for analyzing genetic variation and identifying disease-causing variants. 变异图工艺(VGC):分析遗传变异和识别致病变异的综合工具。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-03 DOI: 10.1186/s12859-024-05875-7
Jennifer Li, Andy Yang, Benedito A Carneiro, Ece D Gamsiz Uzun, Lauren Massingham, Alper Uzun

Background: The variant call format (VCF) file is a structured and comprehensive text file crucial for researchers and clinicians in interpreting and understanding genomic variation data. It contains essential information about variant positions in the genome, along with alleles, genotype calls, and quality scores. Analyzing and visualizing these files, however, poses significant challenges due to the need for diverse resources and robust features for in-depth exploration.

Results: To address these challenges, we introduce variant graph craft (VGC), a VCF file visualization and analysis tool. VGC offers a wide range of features for exploring genetic variations, including extraction of variant data, intuitive visualization, and graphical representation of samples with genotype information. VGC is designed primarily for the analysis of patient cohorts, but it can also be adapted for use with individual probands or families. It integrates seamlessly with external resources, providing insights into gene function and variant frequencies in sample data. VGC includes gene function and pathway information from Molecular Signatures Database (MSigDB) for GO terms, KEGG, Biocarta, Pathway Interaction Database, and Reactome. Additionally, it dynamically links to gnomAD for variant information and incorporates ClinVar data for pathogenic variant information. VGC supports the Human Genome Assembly Hg37 and Hg38, ensuring compatibility with a wide range of data sets, and accommodates various approaches to exploring genetic variation data. It can be tailored to specific user needs with optional phenotype input data.

Conclusions: In summary, VGC provides a comprehensive set of features tailored to researchers working with genomic variation data. Its intuitive interface, rapid filtering capabilities, and the flexibility to perform queries using custom groups make it an effective tool in identifying variants potentially associated with diseases. VGC operates locally, ensuring data security and privacy by eliminating the need for cloud-based VCF uploads, making it a secure and user-friendly tool. It is freely available at https://github.com/alperuzun/VGC .

背景:变异调用格式(VCF)文件是一种结构化的综合文本文件,对研究人员和临床医生解释和理解基因组变异数据至关重要。它包含基因组中变异位置的基本信息,以及等位基因、基因型调用和质量评分。然而,由于需要多样化的资源和强大的功能来进行深入探索,分析和可视化这些文件面临着巨大的挑战:为了应对这些挑战,我们推出了变异图工艺(VGC)--一种 VCF 文件可视化和分析工具。VGC 为探索基因变异提供了广泛的功能,包括提取变异数据、直观的可视化以及用图形表示样本的基因型信息。VGC 主要是为分析患者队列而设计的,但也可用于分析单个原核或家族。它能与外部资源无缝集成,提供对样本数据中基因功能和变异频率的深入了解。VGC 包括来自分子特征数据库(MSigDB)GO 术语、KEGG、Biocarta、通路相互作用数据库和 Reactome 的基因功能和通路信息。此外,它还能动态链接到 gnomAD 以获取变异信息,并纳入 ClinVar 数据以获取致病变异信息。VGC 支持人类基因组组装 Hg37 和 Hg38,确保与广泛的数据集兼容,并适应各种探索遗传变异数据的方法。它可以根据用户的具体需求定制,提供可选的表型输入数据:总之,VGC 为从事基因组变异数据研究的人员提供了一整套量身定制的功能。其直观的界面、快速的过滤能力以及使用自定义组进行查询的灵活性,使其成为识别可能与疾病相关的变异的有效工具。VGC 在本地运行,无需基于云的 VCF 上传,从而确保了数据的安全性和隐私性,是一款安全且用户友好的工具。它可在 https://github.com/alperuzun/VGC 免费获取。
{"title":"Variant graph craft (VGC): a comprehensive tool for analyzing genetic variation and identifying disease-causing variants.","authors":"Jennifer Li, Andy Yang, Benedito A Carneiro, Ece D Gamsiz Uzun, Lauren Massingham, Alper Uzun","doi":"10.1186/s12859-024-05875-7","DOIUrl":"10.1186/s12859-024-05875-7","url":null,"abstract":"<p><strong>Background: </strong>The variant call format (VCF) file is a structured and comprehensive text file crucial for researchers and clinicians in interpreting and understanding genomic variation data. It contains essential information about variant positions in the genome, along with alleles, genotype calls, and quality scores. Analyzing and visualizing these files, however, poses significant challenges due to the need for diverse resources and robust features for in-depth exploration.</p><p><strong>Results: </strong>To address these challenges, we introduce variant graph craft (VGC), a VCF file visualization and analysis tool. VGC offers a wide range of features for exploring genetic variations, including extraction of variant data, intuitive visualization, and graphical representation of samples with genotype information. VGC is designed primarily for the analysis of patient cohorts, but it can also be adapted for use with individual probands or families. It integrates seamlessly with external resources, providing insights into gene function and variant frequencies in sample data. VGC includes gene function and pathway information from Molecular Signatures Database (MSigDB) for GO terms, KEGG, Biocarta, Pathway Interaction Database, and Reactome. Additionally, it dynamically links to gnomAD for variant information and incorporates ClinVar data for pathogenic variant information. VGC supports the Human Genome Assembly Hg37 and Hg38, ensuring compatibility with a wide range of data sets, and accommodates various approaches to exploring genetic variation data. It can be tailored to specific user needs with optional phenotype input data.</p><p><strong>Conclusions: </strong>In summary, VGC provides a comprehensive set of features tailored to researchers working with genomic variation data. Its intuitive interface, rapid filtering capabilities, and the flexibility to perform queries using custom groups make it an effective tool in identifying variants potentially associated with diseases. VGC operates locally, ensuring data security and privacy by eliminating the need for cloud-based VCF uploads, making it a secure and user-friendly tool. It is freely available at https://github.com/alperuzun/VGC .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11370019/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142124726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tensor product algorithms for inference of contact network from epidemiological data. 从流行病学数据推断接触网络的张量乘积算法。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-02 DOI: 10.1186/s12859-024-05910-7
Sergey Dolgov, Dmitry Savostyanov

We consider a problem of inferring contact network from nodal states observed during an epidemiological process. In a black-box Bayesian optimisation framework this problem reduces to a discrete likelihood optimisation over the set of possible networks. The cardinality of this set grows combinatorially with the number of network nodes, which makes this optimisation computationally challenging. For each network, its likelihood is the probability for the observed data to appear during the evolution of the epidemiological process on this network. This probability can be very small, particularly if the network is significantly different from the ground truth network, from which the observed data actually appear. A commonly used stochastic simulation algorithm struggles to recover rare events and hence to estimate small probabilities and likelihoods. In this paper we replace the stochastic simulation with solving the chemical master equation for the probabilities of all network states. Since this equation also suffers from the curse of dimensionality, we apply tensor train approximations to overcome it and enable fast and accurate computations. Numerical simulations demonstrate efficient black-box Bayesian inference of the network.

我们考虑了从流行病学过程中观察到的节点状态推断接触网络的问题。在黑箱贝叶斯优化框架中,这一问题简化为对可能网络集合的离散似然优化。这个集合的可计算性随着网络节点数量的增加而增加,这使得优化工作在计算上极具挑战性。对于每个网络来说,其可能性是在该网络的流行病学过程演变过程中观察到的数据出现的概率。这个概率可能非常小,尤其是当网络与观察到的数据实际出现的基本真实网络有很大差异时。常用的随机模拟算法难以恢复罕见事件,因此也难以估计小概率和小可能性。在本文中,我们用求解所有网络状态概率的化学主方程来取代随机模拟。由于该方程也存在 "维度诅咒"(curse of dimensionality),我们采用张量列车近似来克服这一问题,从而实现快速、准确的计算。数值模拟证明了网络的高效黑箱贝叶斯推断。
{"title":"Tensor product algorithms for inference of contact network from epidemiological data.","authors":"Sergey Dolgov, Dmitry Savostyanov","doi":"10.1186/s12859-024-05910-7","DOIUrl":"10.1186/s12859-024-05910-7","url":null,"abstract":"<p><p>We consider a problem of inferring contact network from nodal states observed during an epidemiological process. In a black-box Bayesian optimisation framework this problem reduces to a discrete likelihood optimisation over the set of possible networks. The cardinality of this set grows combinatorially with the number of network nodes, which makes this optimisation computationally challenging. For each network, its likelihood is the probability for the observed data to appear during the evolution of the epidemiological process on this network. This probability can be very small, particularly if the network is significantly different from the ground truth network, from which the observed data actually appear. A commonly used stochastic simulation algorithm struggles to recover rare events and hence to estimate small probabilities and likelihoods. In this paper we replace the stochastic simulation with solving the chemical master equation for the probabilities of all network states. Since this equation also suffers from the curse of dimensionality, we apply tensor train approximations to overcome it and enable fast and accurate computations. Numerical simulations demonstrate efficient black-box Bayesian inference of the network.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11370089/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142118921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PCP-GC-LM: single-sequence-based protein contact prediction using dual graph convolutional neural network and convolutional neural network. PCP-GC-LM:使用双图卷积神经网络和卷积神经网络进行基于单序列的蛋白质接触预测。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-02 DOI: 10.1186/s12859-024-05914-3
J Ouyang, Y Gao, Y Yang

Background: Recently, the process of evolution information and the deep learning network has promoted the improvement of protein contact prediction methods. Nevertheless, still remain some bottleneck: (1) One of the bottlenecks is the prediction of orphans and other fewer evolution information proteins. (2) The other bottleneck is the method of predicting single-sequence-based proteins mainly focuses on selecting protein sequence features and tuning the neural network architecture, However, while the deeper neural networks improve prediction accuracy, there is still the problem of increasing the computational burden. Compared with other neural networks in the field of protein prediction, the graph neural network has the following advantages: due to the advantage of revealing the topology structure via graph neural network and being able to take advantage of the hierarchical structure and local connectivity of graph neural networks has certain advantages in capturing the features of different levels of abstraction in protein molecules. When using protein sequence and structure information for joint training, the dependencies between the two kinds of information can be better captured. And it can process protein molecular structures of different lengths and shapes, while traditional neural networks need to convert proteins into fixed-size vectors or matrices for processing.

Results: Here, we propose a single-sequence-based protein contact map predictor PCP-GC-LM, with dual-level graph neural networks and convolution networks. Our method performs better with other single-sequence-based predictors in different independent tests. In addition, to verify the validity of our method against complex protein structures, we will also compare it with other methods in two homodimers protein test sets (DeepHomo test dataset and CASP-CAPRI target dataset). Furthermore, we also perform ablation experiments to demonstrate the necessity of a dual graph network. In all, our framework presents new modules to accurately predict inter-chain contact maps in protein and it's also useful to analyze interactions in other types of protein complexes.

研究背景近年来,进化信息和深度学习网络促进了蛋白质接触预测方法的改进。然而,仍然存在一些瓶颈:(1)其中一个瓶颈是对 "孤儿 "和其他进化信息较少的蛋白质的预测。(2)另一个瓶颈是基于单序列的蛋白质预测方法主要集中在蛋白质序列特征的选择和神经网络结构的调整上,然而,虽然深度神经网络提高了预测精度,但仍然存在增加计算负担的问题。与蛋白质预测领域的其他神经网络相比,图神经网络具有以下优势:由于图神经网络具有揭示拓扑结构的优势,能够利用图神经网络的层次结构和局部连通性,在捕捉蛋白质分子中不同抽象层次的特征方面具有一定优势。在利用蛋白质序列和结构信息进行联合训练时,可以更好地捕捉两种信息之间的依赖关系。而且它可以处理不同长度和形状的蛋白质分子结构,而传统的神经网络需要将蛋白质转换成固定大小的向量或矩阵才能进行处理:在此,我们提出了一种基于单序列的蛋白质接触图预测方法 PCP-GC-LM,它采用了双层图神经网络和卷积网络。在不同的独立测试中,我们的方法比其他基于单序列的预测方法表现更好。此外,为了验证我们的方法对复杂蛋白质结构的有效性,我们还将在两个同源二聚体蛋白质测试集(DeepHomo 测试数据集和 CASP-CAPRI 目标数据集)中将其与其他方法进行比较。此外,我们还进行了消融实验,以证明双图网络的必要性。总之,我们的框架提供了准确预测蛋白质链间接触图的新模块,对分析其他类型蛋白质复合物的相互作用也很有用。
{"title":"PCP-GC-LM: single-sequence-based protein contact prediction using dual graph convolutional neural network and convolutional neural network.","authors":"J Ouyang, Y Gao, Y Yang","doi":"10.1186/s12859-024-05914-3","DOIUrl":"10.1186/s12859-024-05914-3","url":null,"abstract":"<p><strong>Background: </strong>Recently, the process of evolution information and the deep learning network has promoted the improvement of protein contact prediction methods. Nevertheless, still remain some bottleneck: (1) One of the bottlenecks is the prediction of orphans and other fewer evolution information proteins. (2) The other bottleneck is the method of predicting single-sequence-based proteins mainly focuses on selecting protein sequence features and tuning the neural network architecture, However, while the deeper neural networks improve prediction accuracy, there is still the problem of increasing the computational burden. Compared with other neural networks in the field of protein prediction, the graph neural network has the following advantages: due to the advantage of revealing the topology structure via graph neural network and being able to take advantage of the hierarchical structure and local connectivity of graph neural networks has certain advantages in capturing the features of different levels of abstraction in protein molecules. When using protein sequence and structure information for joint training, the dependencies between the two kinds of information can be better captured. And it can process protein molecular structures of different lengths and shapes, while traditional neural networks need to convert proteins into fixed-size vectors or matrices for processing.</p><p><strong>Results: </strong>Here, we propose a single-sequence-based protein contact map predictor PCP-GC-LM, with dual-level graph neural networks and convolution networks. Our method performs better with other single-sequence-based predictors in different independent tests. In addition, to verify the validity of our method against complex protein structures, we will also compare it with other methods in two homodimers protein test sets (DeepHomo test dataset and CASP-CAPRI target dataset). Furthermore, we also perform ablation experiments to demonstrate the necessity of a dual graph network. In all, our framework presents new modules to accurately predict inter-chain contact maps in protein and it's also useful to analyze interactions in other types of protein complexes.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11370006/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142118919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SmithHunter: a workflow for the identification of candidate smithRNAs and their targets. SmithHunter:用于识别候选 smithRNA 及其靶标的工作流程。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-02 DOI: 10.1186/s12859-024-05909-0
Giovanni Marturano, Diego Carli, Claudio Cucini, Antonio Carapelli, Federico Plazzi, Francesco Frati, Marco Passamonti, Francesco Nardi

Background: SmithRNAs (Small MITochondrial Highly-transcribed RNAs) are a novel class of small RNA molecules that are encoded in the mitochondrial genome and regulate the expression of nuclear transcripts. Initial evidence for their existence came from the Manila clam Ruditapes philippinarum, where they have been described and whose activity has been biologically validated through RNA injection experiments. Current evidence on the existence of these RNAs in other species is based only on small RNA sequencing. As a preliminary step to characterize smithRNAs across different metazoan lineages, a dedicated, unified, analytical workflow is needed.

Results: We propose a novel workflow specifically designed for smithRNAs. Sequence data (from small RNA sequencing) uniquely mapping to the mitochondrial genome are clustered into putative smithRNAs and prefiltered based on their abundance, presence in replicate libraries and 5' and 3' transcription boundary conservation. The surviving sequences are subsequently compared to the untranslated regions of nuclear transcripts based on seed pairing, overall match and thermodynamic stability to identify possible targets. Ample collateral information and graphics are produced to help characterize these molecules in the species of choice and guide the operator through the analysis. The workflow was tested on the original Manila clam data. Under basic settings, the results of the original study are largely replicated. The effect of additional parameter customization (clustering threshold, stringency, minimum number of replicates, seed matching) was further evaluated.

Conclusions: The study of smithRNAs is still in its infancy and no dedicated analytical workflow is currently available. At its core, the SmithHunter workflow builds over the bioinformatic procedure originally applied to identify candidate smithRNAs in the Manila clam. In fact, this is currently the only evidence for smithRNAs that has been biologically validated and, therefore, the elective starting point for characterizing smithRNAs in other species. The original analysis was readapted using current software implementations and some minor issues were solved. Moreover, the workflow was improved by allowing the customization of different analytical parameters, mostly focusing on stringency and the possibility of accounting for a minimal level of genetic differentiation among samples.

背景:SmithRNA(线粒体高转录小 RNA)是一类新型的小 RNA 分子,在线粒体基因组中编码,可调节核转录本的表达。它们存在的最初证据来自马尼拉蛤蜊 Ruditapes philippinarum。目前在其他物种中存在这些 RNA 的证据仅基于小 RNA 测序。作为鉴定不同元古脊椎动物谱系中铁匠核糖核酸特征的第一步,需要一个专门的、统一的分析工作流程:结果:我们提出了一种专为铁丝核糖核酸设计的新型工作流程。将唯一映射到线粒体基因组的序列数据(来自小 RNA 测序)聚类为推测的 smithRNA,并根据其丰度、在重复文库中的存在情况以及 5' 和 3' 转录边界的保守性进行预筛选。随后,根据种子配对、整体匹配和热力学稳定性,将存活的序列与核转录本的非翻译区进行比较,以确定可能的靶标。同时还会生成大量的附带信息和图形,以帮助确定这些分子在所选物种中的特征,并指导操作者完成分析。该工作流程在马尼拉蛤的原始数据上进行了测试。在基本设置下,原始研究的结果基本得到了复制。我们还进一步评估了附加参数定制(聚类阈值、严格程度、最小重复次数、种子匹配)的效果:史密斯核糖核酸的研究仍处于起步阶段,目前还没有专门的分析工作流程。SmithHunter 工作流程的核心是建立在最初用于识别马尼拉蛤中候选 smithRNAs 的生物信息学程序之上。事实上,这是目前唯一经过生物学验证的铁锈色核糖核酸证据,因此也是鉴定其他物种铁锈色核糖核酸特征的首选起点。利用当前的软件实现对原始分析进行了重新调整,并解决了一些小问题。此外,还改进了工作流程,允许定制不同的分析参数,主要集中在严格性和考虑样本间最低遗传差异水平的可能性。
{"title":"SmithHunter: a workflow for the identification of candidate smithRNAs and their targets.","authors":"Giovanni Marturano, Diego Carli, Claudio Cucini, Antonio Carapelli, Federico Plazzi, Francesco Frati, Marco Passamonti, Francesco Nardi","doi":"10.1186/s12859-024-05909-0","DOIUrl":"10.1186/s12859-024-05909-0","url":null,"abstract":"<p><strong>Background: </strong>SmithRNAs (Small MITochondrial Highly-transcribed RNAs) are a novel class of small RNA molecules that are encoded in the mitochondrial genome and regulate the expression of nuclear transcripts. Initial evidence for their existence came from the Manila clam Ruditapes philippinarum, where they have been described and whose activity has been biologically validated through RNA injection experiments. Current evidence on the existence of these RNAs in other species is based only on small RNA sequencing. As a preliminary step to characterize smithRNAs across different metazoan lineages, a dedicated, unified, analytical workflow is needed.</p><p><strong>Results: </strong>We propose a novel workflow specifically designed for smithRNAs. Sequence data (from small RNA sequencing) uniquely mapping to the mitochondrial genome are clustered into putative smithRNAs and prefiltered based on their abundance, presence in replicate libraries and 5' and 3' transcription boundary conservation. The surviving sequences are subsequently compared to the untranslated regions of nuclear transcripts based on seed pairing, overall match and thermodynamic stability to identify possible targets. Ample collateral information and graphics are produced to help characterize these molecules in the species of choice and guide the operator through the analysis. The workflow was tested on the original Manila clam data. Under basic settings, the results of the original study are largely replicated. The effect of additional parameter customization (clustering threshold, stringency, minimum number of replicates, seed matching) was further evaluated.</p><p><strong>Conclusions: </strong>The study of smithRNAs is still in its infancy and no dedicated analytical workflow is currently available. At its core, the SmithHunter workflow builds over the bioinformatic procedure originally applied to identify candidate smithRNAs in the Manila clam. In fact, this is currently the only evidence for smithRNAs that has been biologically validated and, therefore, the elective starting point for characterizing smithRNAs in other species. The original analysis was readapted using current software implementations and some minor issues were solved. Moreover, the workflow was improved by allowing the customization of different analytical parameters, mostly focusing on stringency and the possibility of accounting for a minimal level of genetic differentiation among samples.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11370224/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142118920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PSSM-Sumo: deep learning based intelligent model for prediction of sumoylation sites using discriminative features. PSSM-Sumo:基于深度学习的智能模型,利用判别特征预测苏木酰化位点。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-08-30 DOI: 10.1186/s12859-024-05917-0
Salman Khan, Salman A AlQahtani, Sumaiya Noor, Nijad Ahmad

Post-translational modifications (PTMs) are fundamental to essential biological processes, exerting significant influence over gene expression, protein localization, stability, and genome replication. Sumoylation, a PTM involving the covalent addition of a chemical group to a specific protein sequence, profoundly impacts the functional diversity of proteins. Notably, identifying sumoylation sites has garnered significant attention due to their crucial roles in proteomic functions and their implications in various diseases, including Parkinson's and Alzheimer's. Despite the proposal of several computational models for identifying sumoylation sites, their effectiveness could be improved by the limitations associated with conventional learning methodologies. In this study, we introduce pseudo-position-specific scoring matrix (PsePSSM), a robust computational model designed for accurately predicting sumoylation sites using an optimized deep learning algorithm and efficient feature extraction techniques. Moreover, to streamline computational processes and eliminate irrelevant and noisy features, sequential forward selection using a support vector machine (SFS-SVM) is implemented to identify optimal features. The multi-layer Deep Neural Network (DNN) is a robust classifier, facilitating precise sumoylation site prediction. We meticulously assess the performance of PSSM-Sumo through a tenfold cross-validation approach, employing various statistical metrics such as the Matthews Correlation Coefficient (MCC), accuracy, sensitivity, specificity, and the Area under the ROC Curve (AUC). Comparative analyses reveal that PSSM-Sumo achieves an exceptional average prediction accuracy of 98.71%, surpassing existing models. The robustness and accuracy of the proposed model position it as a promising tool for advancing drug discovery and the diagnosis of diverse diseases linked to sumoylation sites.

翻译后修饰(PTM)是重要生物过程的基础,对基因表达、蛋白质定位、稳定性和基因组复制有重大影响。苏木酰化是一种涉及在特定蛋白质序列上共价添加化学基团的 PTM,对蛋白质的功能多样性有深远影响。值得注意的是,由于苏木酰化位点在蛋白质组功能中的关键作用及其对包括帕金森氏症和阿尔茨海默氏症在内的各种疾病的影响,确定苏木酰化位点已引起了极大的关注。尽管已经提出了几种用于鉴定苏木酰化位点的计算模型,但由于传统学习方法的局限性,这些模型的有效性还有待提高。在本研究中,我们引入了伪位置特异性评分矩阵(PsePSSM),这是一种稳健的计算模型,旨在利用优化的深度学习算法和高效的特征提取技术准确预测苏木酰化位点。此外,为了简化计算过程并消除不相关和有噪声的特征,利用支持向量机(SFS-SVM)实施了顺序前向选择,以确定最佳特征。多层深度神经网络(DNN)是一种稳健的分类器,有助于精确预测苏木酰化位点。我们采用马修斯相关系数(MCC)、准确率、灵敏度、特异性和 ROC 曲线下面积(AUC)等各种统计指标,通过十倍交叉验证方法对 PSSM-Sumo 的性能进行了细致评估。对比分析表明,PSSM-Sumo 的平均预测准确率高达 98.71%,超越了现有模型。所提模型的稳健性和准确性使其成为推动药物发现和诊断与苏木酰化位点相关的各种疾病的一种有前途的工具。
{"title":"PSSM-Sumo: deep learning based intelligent model for prediction of sumoylation sites using discriminative features.","authors":"Salman Khan, Salman A AlQahtani, Sumaiya Noor, Nijad Ahmad","doi":"10.1186/s12859-024-05917-0","DOIUrl":"https://doi.org/10.1186/s12859-024-05917-0","url":null,"abstract":"<p><p>Post-translational modifications (PTMs) are fundamental to essential biological processes, exerting significant influence over gene expression, protein localization, stability, and genome replication. Sumoylation, a PTM involving the covalent addition of a chemical group to a specific protein sequence, profoundly impacts the functional diversity of proteins. Notably, identifying sumoylation sites has garnered significant attention due to their crucial roles in proteomic functions and their implications in various diseases, including Parkinson's and Alzheimer's. Despite the proposal of several computational models for identifying sumoylation sites, their effectiveness could be improved by the limitations associated with conventional learning methodologies. In this study, we introduce pseudo-position-specific scoring matrix (PsePSSM), a robust computational model designed for accurately predicting sumoylation sites using an optimized deep learning algorithm and efficient feature extraction techniques. Moreover, to streamline computational processes and eliminate irrelevant and noisy features, sequential forward selection using a support vector machine (SFS-SVM) is implemented to identify optimal features. The multi-layer Deep Neural Network (DNN) is a robust classifier, facilitating precise sumoylation site prediction. We meticulously assess the performance of PSSM-Sumo through a tenfold cross-validation approach, employing various statistical metrics such as the Matthews Correlation Coefficient (MCC), accuracy, sensitivity, specificity, and the Area under the ROC Curve (AUC). Comparative analyses reveal that PSSM-Sumo achieves an exceptional average prediction accuracy of 98.71%, surpassing existing models. The robustness and accuracy of the proposed model position it as a promising tool for advancing drug discovery and the diagnosis of diverse diseases linked to sumoylation sites.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11363370/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142103931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CNVDeep: deep association of copy number variants with neurocognitive disorders. CNVDeep:拷贝数变异与神经认知障碍的深度关联。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-08-29 DOI: 10.1186/s12859-024-05874-8
Zahra Rahaie, Hamid R Rabiee, Hamid Alinejad-Rokny

Background: Copy number variants (CNVs) have become increasingly instrumental in understanding the etiology of all diseases and phenotypes, including Neurocognitive Disorders (NDs). Among the well-established regions associated with ND are small parts of chromosome 16 deletions (16p11.2) and chromosome 15 duplications (15q3). Various methods have been developed to identify associations between CNVs and diseases of interest. The majority of methods are based on statistical inference techniques. However, due to the multi-dimensional nature of the features of the CNVs, these methods are still immature. The other aspect is that regions discovered by different methods are large, while the causative regions may be much smaller.

Results: In this study, we propose a regularized deep learning model to select causal regions for the target disease. With the help of the proximal [20] gradient descent algorithm, the model utilizes the group LASSO concept and embraces a deep learning model in a sparsity framework. We perform the CNV analysis for 74,811 individuals with three types of brain disorders, autism spectrum disorder (ASD), schizophrenia (SCZ), and developmental delay (DD), and also perform cumulative analysis to discover the regions that are common among the NDs. The brain expression of genes associated with diseases has increased by an average of 20 percent, and genes with homologs in mice that cause nervous system phenotypes have increased by 18 percent (on average). The DECIPHER data source also seeks other phenotypes connected to the detected regions alongside gene ontology analysis. The target diseases are correlated with some unexplored regions, such as deletions on 1q21.1 and 1q21.2 (for ASD), deletions on 20q12 (for SCZ), and duplications on 8p23.3 (for DD). Furthermore, our method is compared with other machine learning algorithms.

Conclusions: Our model effectively identifies regions associated with phenotypic traits using regularized deep learning. Rather than attempting to analyze the whole genome, CNVDeep allows us to focus only on the causative regions of disease.

背景:拷贝数变异(CNVs)在了解包括神经认知障碍(NDs)在内的所有疾病和表型的病因学方面发挥着越来越重要的作用。与 ND 相关的区域包括 16 号染色体的小部分缺失(16p11.2)和 15 号染色体的重复(15q3)。目前已开发出多种方法来确定 CNV 与相关疾病之间的关联。大多数方法都基于统计推断技术。然而,由于 CNVs 特征的多维性,这些方法仍不成熟。另一方面,不同方法发现的区域都很大,而致病区域可能小得多:在这项研究中,我们提出了一种正则化深度学习模型来选择目标疾病的因果区域。在近似[20]梯度下降算法的帮助下,该模型利用了组 LASSO 概念,并在稀疏性框架中包含了一个深度学习模型。我们对 74,811 名患有自闭症谱系障碍(ASD)、精神分裂症(SCZ)和发育迟缓(DD)这三种脑部疾病的个体进行了 CNV 分析,同时还进行了累积分析,以发现 NDs 之间的共同区域。与疾病相关的基因在大脑中的表达量平均增加了 20%,而在小鼠中具有同源物、导致神经系统表型的基因则平均增加了 18%。DECIPHER 数据源在进行基因本体分析的同时,还寻找与检测区域相关的其他表型。目标疾病与一些未探索的区域相关,如 1q21.1 和 1q21.2 的缺失(针对 ASD)、20q12 的缺失(针对 SCZ)和 8p23.3 的重复(针对 DD)。此外,我们还将我们的方法与其他机器学习算法进行了比较:我们的模型利用正则化深度学习有效地识别了与表型特征相关的区域。CNVDeep 可让我们只关注疾病的致病区域,而不是试图分析整个基因组。
{"title":"CNVDeep: deep association of copy number variants with neurocognitive disorders.","authors":"Zahra Rahaie, Hamid R Rabiee, Hamid Alinejad-Rokny","doi":"10.1186/s12859-024-05874-8","DOIUrl":"https://doi.org/10.1186/s12859-024-05874-8","url":null,"abstract":"<p><strong>Background: </strong>Copy number variants (CNVs) have become increasingly instrumental in understanding the etiology of all diseases and phenotypes, including Neurocognitive Disorders (NDs). Among the well-established regions associated with ND are small parts of chromosome 16 deletions (16p11.2) and chromosome 15 duplications (15q3). Various methods have been developed to identify associations between CNVs and diseases of interest. The majority of methods are based on statistical inference techniques. However, due to the multi-dimensional nature of the features of the CNVs, these methods are still immature. The other aspect is that regions discovered by different methods are large, while the causative regions may be much smaller.</p><p><strong>Results: </strong>In this study, we propose a regularized deep learning model to select causal regions for the target disease. With the help of the proximal [20] gradient descent algorithm, the model utilizes the group LASSO concept and embraces a deep learning model in a sparsity framework. We perform the CNV analysis for 74,811 individuals with three types of brain disorders, autism spectrum disorder (ASD), schizophrenia (SCZ), and developmental delay (DD), and also perform cumulative analysis to discover the regions that are common among the NDs. The brain expression of genes associated with diseases has increased by an average of 20 percent, and genes with homologs in mice that cause nervous system phenotypes have increased by 18 percent (on average). The DECIPHER data source also seeks other phenotypes connected to the detected regions alongside gene ontology analysis. The target diseases are correlated with some unexplored regions, such as deletions on 1q21.1 and 1q21.2 (for ASD), deletions on 20q12 (for SCZ), and duplications on 8p23.3 (for DD). Furthermore, our method is compared with other machine learning algorithms.</p><p><strong>Conclusions: </strong>Our model effectively identifies regions associated with phenotypic traits using regularized deep learning. Rather than attempting to analyze the whole genome, CNVDeep allows us to focus only on the causative regions of disease.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11360772/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142103930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prediction of mutation-induced protein stability changes based on the geometric representations learned by a self-supervised method. 基于自监督方法学习到的几何表征预测突变引起的蛋白质稳定性变化
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-08-28 DOI: 10.1186/s12859-024-05876-6
Shan Shan Li, Zhao Ming Liu, Jiao Li, Yi Bo Ma, Ze Yuan Dong, Jun Wei Hou, Fu Jie Shen, Wei Bu Wang, Qi Ming Li, Ji Guo Su

Background: Thermostability is a fundamental property of proteins to maintain their biological functions. Predicting protein stability changes upon mutation is important for our understanding protein structure-function relationship, and is also of great interest in protein engineering and pharmaceutical design.

Results: Here we present mutDDG-SSM, a deep learning-based framework that uses the geometric representations encoded in protein structure to predict the mutation-induced protein stability changes. mutDDG-SSM consists of two parts: a graph attention network-based protein structural feature extractor that is trained with a self-supervised learning scheme using large-scale high-resolution protein structures, and an eXtreme Gradient Boosting model-based stability change predictor with an advantage of alleviating overfitting problem. The performance of mutDDG-SSM was tested on several widely-used independent datasets. Then, myoglobin and p53 were used as case studies to illustrate the effectiveness of the model in predicting protein stability changes upon mutations. Our results show that mutDDG-SSM achieved high performance in estimating the effects of mutations on protein stability. In addition, mutDDG-SSM exhibited good unbiasedness, where the prediction accuracy on the inverse mutations is as well as that on the direct mutations.

Conclusion: Meaningful features can be extracted from our pre-trained model to build downstream tasks and our model may serve as a valuable tool for protein engineering and drug design.

背景:热稳定性是蛋白质维持其生物功能的基本特性。预测突变后蛋白质稳定性的变化对于我们理解蛋白质结构与功能的关系非常重要,在蛋白质工程和药物设计中也具有重大意义:mutDDG-SSM由两部分组成:基于图注意网络的蛋白质结构特征提取器和基于梯度提升模型的稳定性变化预测器,前者是利用大规模高分辨率蛋白质结构通过自监督学习方案训练而成,后者的优点是可以缓解过拟合问题。在几个广泛使用的独立数据集上测试了 mutDDG-SSM 的性能。然后,以肌红蛋白和 p53 为案例,说明了该模型在预测突变后蛋白质稳定性变化方面的有效性。结果表明,mutDDG-SSM 在估计突变对蛋白质稳定性的影响方面具有很高的性能。此外,mutDDG-SSM 还表现出良好的无偏性,对反向突变的预测准确率与对直接突变的预测准确率相当:结论:我们可以从预先训练好的模型中提取有意义的特征来构建下游任务,我们的模型可以作为蛋白质工程和药物设计的重要工具。
{"title":"Prediction of mutation-induced protein stability changes based on the geometric representations learned by a self-supervised method.","authors":"Shan Shan Li, Zhao Ming Liu, Jiao Li, Yi Bo Ma, Ze Yuan Dong, Jun Wei Hou, Fu Jie Shen, Wei Bu Wang, Qi Ming Li, Ji Guo Su","doi":"10.1186/s12859-024-05876-6","DOIUrl":"10.1186/s12859-024-05876-6","url":null,"abstract":"<p><strong>Background: </strong>Thermostability is a fundamental property of proteins to maintain their biological functions. Predicting protein stability changes upon mutation is important for our understanding protein structure-function relationship, and is also of great interest in protein engineering and pharmaceutical design.</p><p><strong>Results: </strong>Here we present mutDDG-SSM, a deep learning-based framework that uses the geometric representations encoded in protein structure to predict the mutation-induced protein stability changes. mutDDG-SSM consists of two parts: a graph attention network-based protein structural feature extractor that is trained with a self-supervised learning scheme using large-scale high-resolution protein structures, and an eXtreme Gradient Boosting model-based stability change predictor with an advantage of alleviating overfitting problem. The performance of mutDDG-SSM was tested on several widely-used independent datasets. Then, myoglobin and p53 were used as case studies to illustrate the effectiveness of the model in predicting protein stability changes upon mutations. Our results show that mutDDG-SSM achieved high performance in estimating the effects of mutations on protein stability. In addition, mutDDG-SSM exhibited good unbiasedness, where the prediction accuracy on the inverse mutations is as well as that on the direct mutations.</p><p><strong>Conclusion: </strong>Meaningful features can be extracted from our pre-trained model to build downstream tasks and our model may serve as a valuable tool for protein engineering and drug design.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11360314/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142092141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
BMC Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1