Mutation Pathogenicity Prediction by a Biology Based Explainable AI Multi-Modal Algorithm

medRxiv Pub Date : 2024-06-05 DOI:10.1101/2024.06.05.24308476

R. Kellerman, O. Nayshool, O. Barel, S. Paz, N. Amariglio, E. Klang, G. Rechavi

{"title":"Mutation Pathogenicity Prediction by a Biology Based Explainable AI Multi-Modal Algorithm","authors":"R. Kellerman, O. Nayshool, O. Barel, S. Paz, N. Amariglio, E. Klang, G. Rechavi","doi":"10.1101/2024.06.05.24308476","DOIUrl":null,"url":null,"abstract":"Most known pathogenic mutations occur in protein-coding regions of DNA and change the way proteins are made. Deciphering the protein structure therefore provides great insight into the molecular mechanisms underlying biological functions in human disease. While there have recently been major advances in the artificial intelligence-based prediction of protein structure, the determination of the biological and clinical relevance of specific mutations is not yet up to clinical standards. This challenge is of utmost medical importance when decisions, as critical as suggesting termination of pregnancy or recommending cancer-directed rational drugs, depend on the accuracy of prediction of the effect of the specific mutation. Currently, available tools are aiming to characterize the effect of a mutation on the unctionality of the protein according to biochemical criteria, independent of the biological context. A specific change in protein structure can result either in loss of function (LOF) or gain-of-function (GOF) and the ability to identify the directionality of effect needs to be taken into consideration when interpreting the biological outcome of the mutation. Here we describe Triple-modalities Variant Interpretation and Analysis (TriVIAI), a tool incorporating three complementing modalities for improved prediction of missense mutations pathogenicity: protein language model (pLM), graph neural network (GNN) and a tabular model incorporating physical properties from the protein structure. The TriVIAl ensemble's predictions compare favorably with the existing tools across various metrics, achieving an AUC-ROC of 0.887, a precision-recall curve (PRC) score of 0.68, and a Brier score of 0.16. The TriVIAI ensemble is also endowed with two major advantages compared to other available tools. The first is the incorporation of biological insights which allow to differentiate between GOF mutations that tend to cluster in specific hotspots and affect structure in a specific functional way versus LOF mutations that are usually dispersed and can cripple the protein in a variety of different ways. Importantly, the advantage over other available tools is more noticeable with GOF mutations as their effect on the protein structure is less disruptive and can be misinterpreted by current variant prioritization strategies. Until now available AI-based pathogenicity predicting algorithms were a black box for the users. The second significant advantage of TriVIAI is the explainability of the ensemble which contrasts the other available AI-based pathogenicity predicting algorithms which constitute a black box for the users. This explainability feature is of major importance considering the clinical responsibility of the medical decision-makers using AI-based pathogenicity predictors.","PeriodicalId":506788,"journal":{"name":"medRxiv","volume":"1 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.06.05.24308476","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Most known pathogenic mutations occur in protein-coding regions of DNA and change the way proteins are made. Deciphering the protein structure therefore provides great insight into the molecular mechanisms underlying biological functions in human disease. While there have recently been major advances in the artificial intelligence-based prediction of protein structure, the determination of the biological and clinical relevance of specific mutations is not yet up to clinical standards. This challenge is of utmost medical importance when decisions, as critical as suggesting termination of pregnancy or recommending cancer-directed rational drugs, depend on the accuracy of prediction of the effect of the specific mutation. Currently, available tools are aiming to characterize the effect of a mutation on the unctionality of the protein according to biochemical criteria, independent of the biological context. A specific change in protein structure can result either in loss of function (LOF) or gain-of-function (GOF) and the ability to identify the directionality of effect needs to be taken into consideration when interpreting the biological outcome of the mutation. Here we describe Triple-modalities Variant Interpretation and Analysis (TriVIAI), a tool incorporating three complementing modalities for improved prediction of missense mutations pathogenicity: protein language model (pLM), graph neural network (GNN) and a tabular model incorporating physical properties from the protein structure. The TriVIAl ensemble's predictions compare favorably with the existing tools across various metrics, achieving an AUC-ROC of 0.887, a precision-recall curve (PRC) score of 0.68, and a Brier score of 0.16. The TriVIAI ensemble is also endowed with two major advantages compared to other available tools. The first is the incorporation of biological insights which allow to differentiate between GOF mutations that tend to cluster in specific hotspots and affect structure in a specific functional way versus LOF mutations that are usually dispersed and can cripple the protein in a variety of different ways. Importantly, the advantage over other available tools is more noticeable with GOF mutations as their effect on the protein structure is less disruptive and can be misinterpreted by current variant prioritization strategies. Until now available AI-based pathogenicity predicting algorithms were a black box for the users. The second significant advantage of TriVIAI is the explainability of the ensemble which contrasts the other available AI-based pathogenicity predicting algorithms which constitute a black box for the users. This explainability feature is of major importance considering the clinical responsibility of the medical decision-makers using AI-based pathogenicity predictors.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于生物学的可解释人工智能多模式算法预测突变致病性

大多数已知的致病突变发生在 DNA 的蛋白质编码区，并改变了蛋白质的制造方式。因此，破译蛋白质结构可以深入了解人类疾病中生物功能的分子机制。虽然最近在基于人工智能的蛋白质结构预测方面取得了重大进展，但确定特定突变的生物学和临床相关性尚未达到临床标准。这一挑战在医学上具有极其重要的意义，因为无论是建议终止妊娠还是推荐针对癌症的合理药物等关键决策，都取决于对特定突变影响的预测是否准确。目前，可用的工具都是根据生化标准来描述突变对蛋白质非功能性的影响，而与生物背景无关。蛋白质结构的特定变化可能导致功能缺失（LOF）或功能增益（GOF），在解释突变的生物学结果时，需要考虑到识别效应方向性的能力。在这里，我们介绍了三重模式变异解释与分析（TriVIAI），这是一种结合了三种互补模式的工具，用于改进对错义突变致病性的预测：蛋白质语言模型（pLM）、图神经网络（GNN）和结合了蛋白质结构物理特性的表格模型。在各种指标上，TriVIAl 组合的预测结果优于现有工具，AUC-ROC 达到 0.887，精确度-召回曲线 (PRC) 得分为 0.68，Brier 得分为 0.16。与其他可用工具相比，TriVIAI 组合还具有两大优势。首先，TriVIAI 组合结合了生物学观点，能够区分 GOF 突变与 LOF 突变，前者倾向于聚集在特定的热点区域，以特定的功能方式影响结构，而后者则通常比较分散，会以各种不同的方式削弱蛋白质。重要的是，与其他现有工具相比，GOF 突变的优势更为明显，因为它们对蛋白质结构的影响破坏性较小，可能会被当前的变异优先策略误解。到目前为止，现有的基于人工智能的致病性预测算法对用户来说还是一个黑盒子。TriVIAI 的第二个显著优势是集合的可解释性，这与其他现有的基于人工智能的致病性预测算法形成了鲜明对比，后者对用户来说是一个黑箱。考虑到使用基于人工智能的致病性预测算法的医疗决策者的临床责任，这种可解释性特征具有重要意义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

medRxiv

自引率

0.00%

发文量