通过同源性预测亲和力(PATH):可解释的结合亲和力预测与持续同源。

bioRxiv : the preprint server for biology Pub Date : 2024-10-21 DOI:10.1101/2023.11.16.567384

Yuxi Long, Bruce R Donald

{"title":"通过同源性预测亲和力(PATH):可解释的结合亲和力预测与持续同源。","authors":"Yuxi Long, Bruce R Donald","doi":"10.1101/2023.11.16.567384","DOIUrl":null,"url":null,"abstract":"Accurate binding affinity prediction is crucial to structure-based drug design. Recent work used computational topology to obtain an effective representation of protein-ligand interactions. While algorithms using algebraic topology have proven useful in predicting properties of biomolecules, previous algorithms employed uninterpretable machine learning models which failed to explain the underlying geometric and topological features that drive accurate binding affinity prediction. Moreover, they had high computational complexity which made them intractable for large proteins. We present the fastest known algorithm to compute persistent homology features for protein-ligand complexes using opposition distance, with a runtime that is independent of the protein size. Then, we exploit these features in a novel, interpretable algorithm to predict protein-ligand binding affinity. Our algorithm achieves interpretability through an effective embedding of distances across bipartite matchings of the protein and ligand atoms into real-valued functions by summing Gaussians centered at features constructed by persistent homology. We name these functions internuclear persistent contours (IPCs) . Next, we introduce persistence fingerprints , a vector with 10 components that sketches the distances of different bipartite matching between protein and ligand atoms, refined from IPCs. Let the number of protein atoms in the protein-ligand complex be n , number of ligand atoms be m , and ω ≈ 2.4 be the matrix multiplication exponent. We show that for any 0 < ε < 1, after an 𝒪 ( mn log( mn )) preprocessing procedure, we can compute an ε -accurate approximation to the persistence fingerprint in 𝒪 ( m log 6 ω ( m/ε )) time, independent of protein size. This is an improvement in time complexity by a factor of 𝒪 (( m + n ) 3 ) over any previous binding affinity prediction that uses persistent homology. We show that the representational power of persistence fingerprint generalizes to protein-ligand binding datasets beyond the training dataset. Then, we introduce PATH , Predicting Affinity Through Homology, a two-part algorithm consisting of PATH + and PATH - . PATH + is an interpretable, small ensemble of shallow regression trees for binding affinity prediction from persistence fingerprints. We show that despite using 1,400-fold fewer features, PATH + has comparable performance to a previous state-of-the-art binding affinity prediction algorithm that uses persistent homology. Moreover, PATH + has the advantage of being interpretable. We visualize the features captured by persistence fingerprint for variant HIV-1 protease complexes and show that persistence fingerprint captures binding-relevant structural mutations. PATH - , in turn, uses regression trees over IPCs to differentiate between binding and decoy complexes. Finally, we benchmarked PATH versus established binding affinity prediction algorithms spanning physics-based, knowledge-based, and deep learning methods, revealing that PATH has comparable or better performance with less overfitting, compared to these state-of-the-art methods. The source code for PATH is released open-source as part of the osprey protein design software package.","PeriodicalId":72407,"journal":{"name":"bioRxiv : the preprint server for biology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10680814/pdf/","citationCount":"0","resultStr":"{\"title\":\"Predicting Affinity Through Homology (PATH): Interpretable Binding Affinity Prediction with Persistent Homology.\",\"authors\":\"Yuxi Long, Bruce R Donald\",\"doi\":\"10.1101/2023.11.16.567384\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accurate binding affinity prediction is crucial to structure-based drug design. Recent work used computational topology to obtain an effective representation of protein-ligand interactions. While algorithms using algebraic topology have proven useful in predicting properties of biomolecules, previous algorithms employed uninterpretable machine learning models which failed to explain the underlying geometric and topological features that drive accurate binding affinity prediction. Moreover, they had high computational complexity which made them intractable for large proteins. We present the fastest known algorithm to compute persistent homology features for protein-ligand complexes using opposition distance, with a runtime that is independent of the protein size. Then, we exploit these features in a novel, interpretable algorithm to predict protein-ligand binding affinity. Our algorithm achieves interpretability through an effective embedding of distances across bipartite matchings of the protein and ligand atoms into real-valued functions by summing Gaussians centered at features constructed by persistent homology. We name these functions internuclear persistent contours (IPCs) . Next, we introduce persistence fingerprints , a vector with 10 components that sketches the distances of different bipartite matching between protein and ligand atoms, refined from IPCs. Let the number of protein atoms in the protein-ligand complex be n , number of ligand atoms be m , and ω ≈ 2.4 be the matrix multiplication exponent. We show that for any 0 < ε < 1, after an 𝒪 ( mn log( mn )) preprocessing procedure, we can compute an ε -accurate approximation to the persistence fingerprint in 𝒪 ( m log 6 ω ( m/ε )) time, independent of protein size. This is an improvement in time complexity by a factor of 𝒪 (( m + n ) 3 ) over any previous binding affinity prediction that uses persistent homology. We show that the representational power of persistence fingerprint generalizes to protein-ligand binding datasets beyond the training dataset. Then, we introduce PATH , Predicting Affinity Through Homology, a two-part algorithm consisting of PATH + and PATH - . PATH + is an interpretable, small ensemble of shallow regression trees for binding affinity prediction from persistence fingerprints. We show that despite using 1,400-fold fewer features, PATH + has comparable performance to a previous state-of-the-art binding affinity prediction algorithm that uses persistent homology. Moreover, PATH + has the advantage of being interpretable. We visualize the features captured by persistence fingerprint for variant HIV-1 protease complexes and show that persistence fingerprint captures binding-relevant structural mutations. PATH - , in turn, uses regression trees over IPCs to differentiate between binding and decoy complexes. Finally, we benchmarked PATH versus established binding affinity prediction algorithms spanning physics-based, knowledge-based, and deep learning methods, revealing that PATH has comparable or better performance with less overfitting, compared to these state-of-the-art methods. The source code for PATH is released open-source as part of the osprey protein design software package.\",\"PeriodicalId\":72407,\"journal\":{\"name\":\"bioRxiv : the preprint server for biology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-10-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10680814/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"bioRxiv : the preprint server for biology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2023.11.16.567384\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv : the preprint server for biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2023.11.16.567384","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

准确的结合亲和力预测对基于结构的药物设计至关重要。最近的工作使用计算拓扑来获得蛋白质-配体相互作用的有效表示。尽管持续同源编码几何特征，但先前使用持续同源进行结合亲和预测的工作采用了不可解释的机器学习模型，未能解释驱动准确结合亲和预测的潜在几何和拓扑特征。在这项工作中，我们提出了一种新的、可解释的蛋白质配体结合亲和力预测算法。我们的算法通过有效地将蛋白质和配体原子的二部匹配之间的距离嵌入到实值函数中，从而实现可解释性，方法是将以持久同源性构建的特征为中心的高斯求和。我们将这些功能命名为核间持久轮廓(IPCs)。接下来，我们引入持久性指纹，这是一个由10个组件组成的向量，它描绘了蛋白质和配体原子之间不同的二部匹配的距离，从ipc中提炼出来。设蛋白质-配体复合物中蛋白质原子数为n，配体原子数为m， ω≈2.4为矩阵乘法指数。我们发现，对于任意0 < ε < 1，经过一个(mn log(mn))的预处理过程后，我们可以计算出一个ε精度的近似的持续指纹在(m log 6 ω (m/“))的时间内，与蛋白质的大小无关。这是在时间复杂度上的一个改进，比以前任何使用持久同源性的结合亲和预测都要提高一个因子((m + n) 3)。我们证明了持久性指纹的表征能力可以推广到训练数据集以外的蛋白质配体结合数据集。然后，我们引入了PATH，通过同源性预测亲和力，这是一个可解释的小浅层回归树集合，用于从持久性指纹预测绑定亲和力。我们表明，尽管使用的特征少了1400倍，但PATH的性能与先前使用持久同源特征的最先进的绑定亲和预测算法相当。此外，PATH具有可解释的优点。最后，我们可视化持久性指纹捕获的变异HIV-1蛋白酶复合物的特征，并表明持久性指纹捕获结合相关的结构突变。PATH的源代码作为鱼鹰蛋白设计软件包的一部分开源发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Predicting Affinity Through Homology (PATH): Interpretable Binding Affinity Prediction with Persistent Homology.

Accurate binding affinity prediction is crucial to structure-based drug design. Recent work used computational topology to obtain an effective representation of protein-ligand interactions. While algorithms using algebraic topology have proven useful in predicting properties of biomolecules, previous algorithms employed uninterpretable machine learning models which failed to explain the underlying geometric and topological features that drive accurate binding affinity prediction. Moreover, they had high computational complexity which made them intractable for large proteins. We present the fastest known algorithm to compute persistent homology features for protein-ligand complexes using opposition distance, with a runtime that is independent of the protein size. Then, we exploit these features in a novel, interpretable algorithm to predict protein-ligand binding affinity. Our algorithm achieves interpretability through an effective embedding of distances across bipartite matchings of the protein and ligand atoms into real-valued functions by summing Gaussians centered at features constructed by persistent homology. We name these functions internuclear persistent contours (IPCs) . Next, we introduce persistence fingerprints , a vector with 10 components that sketches the distances of different bipartite matching between protein and ligand atoms, refined from IPCs. Let the number of protein atoms in the protein-ligand complex be n , number of ligand atoms be m , and ω ≈ 2.4 be the matrix multiplication exponent. We show that for any 0 < ε < 1, after an 𝒪 ( mn log( mn )) preprocessing procedure, we can compute an ε -accurate approximation to the persistence fingerprint in 𝒪 ( m log ^{6 ω} ( m/ε )) time, independent of protein size. This is an improvement in time complexity by a factor of 𝒪 (( m + n ) ³ ) over any previous binding affinity prediction that uses persistent homology. We show that the representational power of persistence fingerprint generalizes to protein-ligand binding datasets beyond the training dataset. Then, we introduce PATH , Predicting Affinity Through Homology, a two-part algorithm consisting of PATH ⁺ and PATH ^- . PATH ⁺ is an interpretable, small ensemble of shallow regression trees for binding affinity prediction from persistence fingerprints. We show that despite using 1,400-fold fewer features, PATH ⁺ has comparable performance to a previous state-of-the-art binding affinity prediction algorithm that uses persistent homology. Moreover, PATH ⁺ has the advantage of being interpretable. We visualize the features captured by persistence fingerprint for variant HIV-1 protease complexes and show that persistence fingerprint captures binding-relevant structural mutations. PATH ^- , in turn, uses regression trees over IPCs to differentiate between binding and decoy complexes. Finally, we benchmarked PATH versus established binding affinity prediction algorithms spanning physics-based, knowledge-based, and deep learning methods, revealing that PATH has comparable or better performance with less overfitting, compared to these state-of-the-art methods. The source code for PATH is released open-source as part of the osprey protein design software package.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

bioRxiv : the preprint server for biology

自引率

0.00%

发文量