Artificial intelligence in the life sciences最新文献_第4页

The path to adoption of open source AI for drug discovery in Africa 在非洲采用开源人工智能进行药物发现的途径

Artificial intelligence in the life sciences

Pub Date : 2024-12-05 DOI: 10.1016/j.ailsci.2024.100118

Gemma Turon, Miquel Duran-Frigola

引用次数: 0

Corrigendum to “Modeling PROTAC degradation activity with machine learning” [Artif. Intell. Life Sci. 6 (2024) 100104] “用机器学习建模PROTAC降解活动”的勘误表[Artif。智能。生命科学，6 (2024)100104]

Artificial intelligence in the life sciences

Pub Date : 2024-12-01 DOI: 10.1016/j.ailsci.2024.100114

Stefano Ribes , Eva Nittinger , Christian Tyrchan , Rocío Mercado

PROTACs are a promising therapeutic modality that harnesses the cell’s built-in degradation machinery to degrade specific proteins. Despite their potential, developing new PROTACs is challenging and requires significant domain expertise, time, and cost. Meanwhile, machine learning has transformed drug design and development. In this work, we present a strategy for curating open-source PROTAC data and an open-source deep learning tool for predicting the degradation activity of novel PROTAC molecules. The curated dataset incorporates important information such as

p D C_{50}

,

D_{m a x}

, E3 ligase type, POI amino acid sequence, and experimental cell type. Our model architecture leverages learned embeddings from pretrained machine learning models, in particular for encoding protein sequences and cell type information. We assessed the quality of the curated data and the generalization ability of our model architecture against new PROTACs and targets via three tailored studies, which we recommend other researchers to use in evaluating their degradation activity models. In each study, three models predict protein degradation in a majority vote setting, reaching a top test accuracy of 80.8% and 0.865 ROC-AUC, and a test accuracy of 62.3% and 0.604 ROC-AUC when generalizing to novel protein targets. Our results are not only comparable to state-of-the-art models for protein degradation prediction, but also part of an open-source implementation which is easily reproducible and less computationally complex than existing approaches.

PROTACs是一种很有前途的治疗方式，它利用细胞内置的降解机制来降解特定的蛋白质。尽管它们具有潜力，但开发新的protac具有挑战性，需要大量的领域专业知识、时间和成本。与此同时，机器学习已经改变了药物的设计和开发。在这项工作中，我们提出了一种策略，用于管理开源PROTAC数据和开源深度学习工具，用于预测新型PROTAC分子的降解活性。整理的数据集包含重要信息，如pDC50， Dmax， E3连接酶类型，POI氨基酸序列和实验细胞类型。我们的模型架构利用了预训练机器学习模型的学习嵌入，特别是编码蛋白质序列和细胞类型信息。我们通过三个量身定制的研究评估了整理数据的质量和我们的模型架构对新PROTACs和目标的泛化能力，我们建议其他研究人员在评估他们的降解活性模型时使用这些研究。在每项研究中，三个模型在多数投票设置下预测蛋白质降解，最高测试精度为80.8%和0.865 ROC-AUC，当推广到新的蛋白质靶标时，测试精度为62.3%和0.604 ROC-AUC。我们的结果不仅可以与最先进的蛋白质降解预测模型相媲美，而且还可以作为开源实现的一部分，与现有方法相比，它易于重现，计算复杂性更低。

{"title":"Corrigendum to “Modeling PROTAC degradation activity with machine learning” [Artif. Intell. Life Sci. 6 (2024) 100104]","authors":"Stefano Ribes , Eva Nittinger , Christian Tyrchan , Rocío Mercado","doi":"10.1016/j.ailsci.2024.100114","DOIUrl":"10.1016/j.ailsci.2024.100114","url":null,"abstract":"<div><div>PROTACs are a promising therapeutic modality that harnesses the cell’s built-in degradation machinery to degrade specific proteins. Despite their potential, developing new PROTACs is challenging and requires significant domain expertise, time, and cost. Meanwhile, machine learning has transformed drug design and development. In this work, we present a strategy for curating open-source PROTAC data and an open-source deep learning tool for predicting the degradation activity of novel PROTAC molecules. The curated dataset incorporates important information such as <span><math><mrow><mi>p</mi><mi>D</mi><msub><mrow><mi>C</mi></mrow><mrow><mn>50</mn></mrow></msub></mrow></math></span>, <span><math><msub><mrow><mi>D</mi></mrow><mrow><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub></math></span>, E3 ligase type, POI amino acid sequence, and experimental cell type. Our model architecture leverages learned embeddings from pretrained machine learning models, in particular for encoding protein sequences and cell type information. We assessed the quality of the curated data and the generalization ability of our model architecture against new PROTACs and targets via three tailored studies, which we recommend other researchers to use in evaluating their degradation activity models. In each study, three models predict protein degradation in a majority vote setting, reaching a top test accuracy of 80.8% and 0.865 ROC-AUC, and a test accuracy of 62.3% and 0.604 ROC-AUC when generalizing to novel protein targets. Our results are not only comparable to state-of-the-art models for protein degradation prediction, but also part of an open-source implementation which is easily reproducible and less computationally complex than existing approaches.</div></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"6 ","pages":"Article 100114"},"PeriodicalIF":0.0,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143169762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rethinking the 'best method' paradigm: The effectiveness of hybrid and multidisciplinary approaches in chemoinformatics 重新思考“最佳方法”范式：化学信息学中混合和多学科方法的有效性

Artificial intelligence in the life sciences

Pub Date : 2024-12-01 DOI: 10.1016/j.ailsci.2024.100117

José L. Medina-Franco , Johny R. Rodríguez-Pérez , Héctor F. Cortés-Hernández , Edgar López-López

In Chemoinformatics, as in many other computational-related disciplines, it is a common practice to identify the “single best” approach or methodology, for instance, identify the best fingerprint representation, the best single virtual screening approach or protocol, the optimal representation of the chemical space, the best predictive model, to name a few. In molecular modeling, a typical example is finding the best docking program. However, it is also known that each approach has its advantages and limitations. There are examples of benchmark studies comparing different approaches to find the most appropriate solution, and it is common to find that there are no single best programs in such studies. Yet, searching for the “best” methods is still common. The main goal of this work is to survey hybrid methodologies recently developed in Chemoinformatics. The list of approaches is not exhaustive, but it aims to cover several representative applications. One of the major outcomes of the survey is that, for various purposes, individual methods do not perform as well as the combination of approaches because single methods have inherent limitations with advantages and disadvantages.

在化学信息学中，与许多其他与计算相关的学科一样，确定“单一最佳”方法或方法是一种常见的做法，例如，确定最佳指纹表示，最佳单一虚拟筛选方法或协议，化学空间的最佳表示，最佳预测模型，等等。在分子建模中，寻找最佳对接方案是一个典型的例子。然而，众所周知，每种方法都有其优点和局限性。有一些比较不同方法以找到最合适的解决方案的基准研究的例子，并且通常发现在此类研究中没有单一的最佳方案。然而，寻找“最佳”方法仍然很常见。这项工作的主要目的是调查混合方法最近发展在化学信息学。方法列表并不详尽，但它旨在涵盖几个具有代表性的应用程序。调查的主要结果之一是，对于各种目的，单个方法不如方法组合的效果好，因为单个方法具有固有的优点和缺点的局限性。

{"title":"Rethinking the 'best method' paradigm: The effectiveness of hybrid and multidisciplinary approaches in chemoinformatics","authors":"José L. Medina-Franco , Johny R. Rodríguez-Pérez , Héctor F. Cortés-Hernández , Edgar López-López","doi":"10.1016/j.ailsci.2024.100117","DOIUrl":"10.1016/j.ailsci.2024.100117","url":null,"abstract":"<div><div>In Chemoinformatics, as in many other computational-related disciplines, it is a common practice to identify the “single best” approach or methodology, for instance, identify the best fingerprint representation, the best single virtual screening approach or protocol, the optimal representation of the chemical space, the best predictive model, to name a few. In molecular modeling, a typical example is finding the best docking program. However, it is also known that each approach has its advantages and limitations. There are examples of benchmark studies comparing different approaches to find the most appropriate solution, and it is common to find that there are no single best programs in such studies. Yet, searching for the “best” methods is still common. The main goal of this work is to survey hybrid methodologies recently developed in Chemoinformatics. The list of approaches is not exhaustive, but it aims to cover several representative applications. One of the major outcomes of the survey is that, for various purposes, individual methods do not perform as well as the combination of approaches because single methods have inherent limitations with advantages and disadvantages.</div></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"6 ","pages":"Article 100117"},"PeriodicalIF":0.0,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142748622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Corrigendum to “Modeling PROTAC degradation activity with machine learning” [Artificial Intelligence in the Life Sciences 6 (2024) 100104] “用机器学习建模PROTAC降解活动”的勘误表[生命科学中的人工智能6 (2024)100104]

Artificial intelligence in the life sciences

Pub Date : 2024-12-01 DOI: 10.1016/j.ailsci.2024.100105

Stefano Ribes , Eva Nittinger , Christian Tyrchan , Rocío Mercado

引用次数: 0

Pharmacological profiles of neglected tropical disease drugs 被忽视的热带病药物的药理学特征

Artificial intelligence in the life sciences

Pub Date : 2024-10-30 DOI: 10.1016/j.ailsci.2024.100116

Alessandro Greco , Reagon Karki , Yojana Gadiya , Clara Deecke , Andrea Zaliani , Sheraz Gul

According to the World health Organization there are a group of 20 diverse infectious Neglected Tropical Disease (NTD) conditions that primarily affect populations in low-income and developing regions. Despite the limited attention and funding compared to other health concerns, significant efforts to develop drugs for treating and controlling NTDs have been made. However, there is room for developing NTD drugs with improved safety, efficacy and ecotoxicological profiles. In order to facilitate this, we have adapted our existing validated data-driven workflows for understanding disease comorbidity to systematically evaluate the approved drugs that target the major World Health Organization defined NTDs. The foundation for this work comprised assembling the physicochemical, biological and clinical properties of each NTD drug and identifying patterns that reveal the underlying cause of their efficacy and side-effect profiles. Subsequently, computational methods were employed to identify analogs with potentially improved profiles and validated in a case study focusing on the teratogenic antileishmanial drug miltefosine. The wider impact of NTD drugs with regards to a One Health cross-disciplinary perspective at the human-animal-environment interface are also discussed.

据世界卫生组织统计，被忽视的热带传染病（NTD）有 20 种，主要影响低收入和发展中地区的人口。尽管与其他健康问题相比，NTD 得到的关注和资金有限，但在开发治疗和控制 NTD 的药物方面仍做出了巨大努力。然而，在开发安全性、有效性和生态毒理学特征更佳的非传染性疾病药物方面仍有空间。为了促进这项工作，我们调整了现有的经过验证的数据驱动工作流程，以了解疾病的并发症，从而系统地评估针对世界卫生组织定义的主要非传染性疾病的已批准药物。这项工作的基础包括收集每种非传染性疾病药物的理化、生物和临床特性，并找出揭示其疗效和副作用特征根本原因的模式。随后，利用计算方法确定了具有潜在改良特性的类似物，并在以致畸抗利什曼病药物米替福新为重点的案例研究中进行了验证。此外，还讨论了非传染性疾病药物对人类-动物-环境界面的 "一体健康 "跨学科视角的更广泛影响。

{"title":"Pharmacological profiles of neglected tropical disease drugs","authors":"Alessandro Greco , Reagon Karki , Yojana Gadiya , Clara Deecke , Andrea Zaliani , Sheraz Gul","doi":"10.1016/j.ailsci.2024.100116","DOIUrl":"10.1016/j.ailsci.2024.100116","url":null,"abstract":"<div><div>According to the World health Organization there are a group of 20 diverse infectious Neglected Tropical Disease (NTD) conditions that primarily affect populations in low-income and developing regions. Despite the limited attention and funding compared to other health concerns, significant efforts to develop drugs for treating and controlling NTDs have been made. However, there is room for developing NTD drugs with improved safety, efficacy and ecotoxicological profiles. In order to facilitate this, we have adapted our existing validated data-driven workflows for understanding disease comorbidity to systematically evaluate the approved drugs that target the major World Health Organization defined NTDs. The foundation for this work comprised assembling the physicochemical, biological and clinical properties of each NTD drug and identifying patterns that reveal the underlying cause of their efficacy and side-effect profiles. Subsequently, computational methods were employed to identify analogs with potentially improved profiles and validated in a case study focusing on the teratogenic antileishmanial drug miltefosine. The wider impact of NTD drugs with regards to a One Health cross-disciplinary perspective at the human-animal-environment interface are also discussed.</div></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"6 ","pages":"Article 100116"},"PeriodicalIF":0.0,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142586762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DTA Atlas: A massive-scale drug repurposing database DTA Atlas：大规模药物再利用数据库

Artificial intelligence in the life sciences

Pub Date : 2024-10-18 DOI: 10.1016/j.ailsci.2024.100115

Madina Sultanova , Elizaveta Vinogradova , Alisher Amantay , Ferdinand Molnár , Siamac Fazli

The drug development process is costly and time-consuming. Repurposing existing approved drugs, an efficient and cost-effective strategy, involves assessing numerous drug-protein pairs to uncover new interactions. While modern in silico methods enhance scalability, an open database for projected drug-target interactions across the entire human proteome is still lacking. In this work, we introduce an open database of predicted drug-target interactions, termed DTA Atlas, covering the entire human proteome as well as a wide range of marketed drugs, resulting in over 220 million drug-target pairs. The database integrates 4 billion affinity predictions from advanced deep neural networks and offers a user-friendly web interface, enabling users to explore drug-target affinity predictions for the human proteome. To the best of our knowledge, DTA Atlas represents the first comprehensive collection of drug-target binding strength predictions. It is open-source and can serve as an important resource for drug development, drug repurposing, toxicity studies and more.

药物开发过程耗资巨大、耗时漫长。对现有获批药物进行再利用是一种高效且具有成本效益的策略，它涉及评估众多药物-蛋白质配对，以发现新的相互作用。虽然现代的硅学方法提高了可扩展性，但目前仍缺乏一个开放的数据库来预测整个人类蛋白质组中药物与靶点的相互作用。在这项工作中，我们引入了一个预测药物-靶点相互作用的开放式数据库，称为 DTA Atlas，它涵盖了整个人类蛋白质组以及各种上市药物，从而产生了超过 2.2 亿个药物-靶点配对。该数据库整合了来自高级深度神经网络的 40 亿次亲和力预测，并提供了用户友好的网络界面，使用户能够探索人类蛋白质组的药物-靶点亲和力预测。据我们所知，DTA Atlas 是第一个全面的药物-靶点结合强度预测集合。它是开源的，可作为药物开发、药物再利用、毒性研究等方面的重要资源。

{"title":"DTA Atlas: A massive-scale drug repurposing database","authors":"Madina Sultanova , Elizaveta Vinogradova , Alisher Amantay , Ferdinand Molnár , Siamac Fazli","doi":"10.1016/j.ailsci.2024.100115","DOIUrl":"10.1016/j.ailsci.2024.100115","url":null,"abstract":"<div><div>The drug development process is costly and time-consuming. Repurposing existing approved drugs, an efficient and cost-effective strategy, involves assessing numerous drug-protein pairs to uncover new interactions. While modern <em>in silico</em> methods enhance scalability, an open database for projected drug-target interactions across the entire human proteome is still lacking. In this work, we introduce an open database of predicted drug-target interactions, termed <em>DTA Atlas</em>, covering the entire human proteome as well as a wide range of marketed drugs, resulting in over 220 million drug-target pairs. The database integrates 4 billion affinity predictions from advanced deep neural networks and offers a user-friendly web interface, enabling users to explore drug-target affinity predictions for the human proteome. To the best of our knowledge, DTA Atlas represents the first comprehensive collection of drug-target binding strength predictions. It is open-source and can serve as an important resource for drug development, drug repurposing, toxicity studies and more.</div></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"6 ","pages":"Article 100115"},"PeriodicalIF":0.0,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142525925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Modeling PROTAC degradation activity with machine learning 利用机器学习模拟 PROTAC 降解活动

Artificial intelligence in the life sciences

Pub Date : 2024-07-14 DOI: 10.1016/j.ailsci.2024.100104

Stefano Ribes , Eva Nittinger , Christian Tyrchan , Rocío Mercado

PROTACs are a promising therapeutic modality that harnesses the cell’s built-in degradation machinery to degrade specific proteins. Despite their potential, developing new PROTACs is challenging and requires significant domain expertise, time, and cost. Meanwhile, machine learning has transformed drug design and development. In this work, we present a strategy for curating open-source PROTAC data and an open-source deep learning tool for predicting the degradation activity of novel PROTAC molecules. The curated dataset incorporates important information such as $p D C_{50}$ , $D_{m a x}$ , E3 ligase type, POI amino acid sequence, and experimental cell type. Our model architecture leverages learned embeddings from pretrained machine learning models, in particular for encoding protein sequences and cell type information. We assessed the quality of the curated data and the generalization ability of our model architecture against new PROTACs and targets via three tailored studies, which we recommend other researchers to use in evaluating their degradation activity models. In each study, three models predict protein degradation in a majority vote setting, reaching a top test accuracy of 82.6% and 0.848 ROC AUC, and a test accuracy of 61% and 0.615 ROC AUC when generalizing to novel protein targets. Our results are not only comparable to state-of-the-art models for protein degradation prediction, but also part of an open-source implementation which is easily reproducible and less computationally complex than existing approaches.

PROTACs 是一种很有前景的治疗方式，它利用细胞内置的降解机制来降解特定蛋白质。尽管PROTACs潜力巨大，但开发新的PROTACs却极具挑战性，需要大量的专业领域知识、时间和成本。与此同时，机器学习改变了药物设计和开发。在这项工作中，我们提出了一种整理开源 PROTAC 数据的策略，以及一种预测新型 PROTAC 分子降解活性的开源深度学习工具。策划的数据集包含 pDC50、Dmax、E3 连接酶类型、POI 氨基酸序列和实验细胞类型等重要信息。我们的模型架构利用了从预先训练的机器学习模型中学习到的嵌入，特别是用于编码蛋白质序列和细胞类型信息。我们通过三项量身定制的研究评估了数据的质量以及我们的模型架构对新的 PROTAC 和靶标的泛化能力，我们建议其他研究人员在评估他们的降解活性模型时使用这些数据。在每项研究中，三个模型都以多数票方式预测了蛋白质降解情况，最高测试准确率达 82.6%，ROC AUC 为 0.848；当推广到新型蛋白质靶标时，测试准确率达 61%，ROC AUC 为 0.615。我们的结果不仅可以与最先进的蛋白质降解预测模型相媲美，而且是开源实现的一部分，与现有方法相比，它易于重复，计算复杂度较低。

{"title":"Modeling PROTAC degradation activity with machine learning","authors":"Stefano Ribes , Eva Nittinger , Christian Tyrchan , Rocío Mercado","doi":"10.1016/j.ailsci.2024.100104","DOIUrl":"10.1016/j.ailsci.2024.100104","url":null,"abstract":"<div><p>PROTACs are a promising therapeutic modality that harnesses the cell’s built-in degradation machinery to degrade specific proteins. Despite their potential, developing new PROTACs is challenging and requires significant domain expertise, time, and cost. Meanwhile, machine learning has transformed drug design and development. In this work, we present a strategy for curating open-source PROTAC data and an open-source deep learning tool for predicting the degradation activity of novel PROTAC molecules. The curated dataset incorporates important information such as <span><math><mrow><mi>p</mi><mi>D</mi><msub><mrow><mi>C</mi></mrow><mrow><mn>50</mn></mrow></msub></mrow></math></span>, <span><math><msub><mrow><mi>D</mi></mrow><mrow><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub></math></span>, E3 ligase type, POI amino acid sequence, and experimental cell type. Our model architecture leverages learned embeddings from pretrained machine learning models, in particular for encoding protein sequences and cell type information. We assessed the quality of the curated data and the generalization ability of our model architecture against new PROTACs and targets via three tailored studies, which we recommend other researchers to use in evaluating their degradation activity models. In each study, three models predict protein degradation in a majority vote setting, reaching a top test accuracy of 82.6% and 0.848 ROC AUC, and a test accuracy of 61% and 0.615 ROC AUC when generalizing to novel protein targets. Our results are not only comparable to state-of-the-art models for protein degradation prediction, but also part of an open-source implementation which is easily reproducible and less computationally complex than existing approaches.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"6 ","pages":"Article 100104"},"PeriodicalIF":0.0,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318524000114/pdfft?md5=fbcd6191bbd4f65eeacdd8602953af66&pid=1-s2.0-S2667318524000114-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141960711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine learning proteochemometric models for Cereblon glue activity predictions 用于预测脑龙胶活性的机器学习蛋白质化学计量模型

Artificial intelligence in the life sciences

Pub Date : 2024-06-11 DOI: 10.1016/j.ailsci.2024.100100

Francis J. Prael III , Jiayi Cox , Noé Sturm , Peter Kutchukian , William C. Forrester , Gregory Michaud , Jutta Blank , Lingling Shen , Raquel Rodríguez-Pérez

Targeted protein degradation (TPD) is a rapidly developing drug discovery technique with unique efficacy and target scope stemming from its degradation-based activity. Molecular glue degraders are a promising arm of TPD, as evidenced by the FDA-approved therapeutics within this class, the increasing number of degraders in clinical development, and their predisposition to drug-likeness. Cereblon (CRBN) glue degraders mediate target degradation by generating a neomorphic interface between CRBN and a protein of interest. While promising, the complicated nature of this CRBN-glue-target ternary complex makes the rational design of molecular glue degraders challenging. For other drug modalities, predictive modeling has been established to leverage existing activity data and generate quantitative structure-activity relationships (QSAR). However, the applicability of QSAR strategies for glues remains under-investigated. Herein, machine learning methodologies were developed to predict glue-mediated recruitment of CRBN to target proteins and achieved promising performance. Generated models leveraged more than a hundred internal screening campaigns across thousands of CRBN glues to predict glue-mediated recruitment of targets to CRBN. Our results show that recruitment activity of CRBN glue degraders can be modeled by machine learning, with 89 % of models producing an area under the receiver operating characteristic curve (ROC AUC) > 0.8 and 70 % of models producing a Matthew's correlation coefficient (MCC) > 0.2 for these primary screening data. Importantly, our findings also indicate that the combination of compound and protein descriptors in the so-called proteochemometric models improves performance, with >80 % of the models exhibiting higher ROC AUC and MCC values than per-target models only based on compound information. Hence, our investigations suggest that proteochemometric modeling is a successful approach for molecular glue degraders. The proposed machine learning strategies can aid compound prioritization based on recruitment efficacy and target selectivity, thus have the potential to facilitate the design and discovery of therapeutic CRBN molecular glues.

靶向蛋白质降解（TPD）是一种快速发展的药物发现技术，其独特的功效和靶向范围源于其基于降解的活性。分子胶降解剂是一种前景广阔的靶向降解技术，美国食品及药物管理局（FDA）批准的该类治疗药物、越来越多的降解剂进入临床开发阶段以及它们的药物相似性都证明了这一点。Cereblon（CRBN）胶水降解剂通过在 CRBN 和感兴趣的蛋白质之间生成一个新形界面来介导目标降解。这种 CRBN-胶水-靶标三元复合物性质复杂，虽然前景广阔，但合理设计分子胶水降解剂仍具有挑战性。对于其他药物模式，已经建立了预测模型来利用现有的活性数据并生成定量结构-活性关系（QSAR）。然而，QSAR 策略对胶水的适用性仍未得到充分研究。在此，我们开发了机器学习方法来预测胶水介导的 CRBN 对靶蛋白的招募，并取得了良好的效果。生成的模型利用了数以千计的 CRBN 胶的百余次内部筛选活动来预测胶介导的 CRBN 对靶蛋白的招募。我们的研究结果表明，CRBN胶水降解剂的招募活性可以通过机器学习来建模，对于这些初筛数据，89%的模型产生的接收者操作特征曲线下面积（ROC AUC）为0.8，70%的模型产生的马修相关系数（MCC）为0.2。重要的是，我们的研究结果还表明，在所谓的蛋白质化学计量学模型中结合化合物和蛋白质描述因子可提高性能，80%的模型比仅基于化合物信息的每目标模型显示出更高的ROC AUC和MCC值。因此，我们的研究表明，蛋白化学计量模型是一种成功的分子胶降解方法。所提出的机器学习策略可以根据招募效果和靶点选择性帮助确定化合物的优先级，从而有可能促进治疗性 CRBN 分子胶的设计和发现。

{"title":"Machine learning proteochemometric models for Cereblon glue activity predictions","authors":"Francis J. Prael III , Jiayi Cox , Noé Sturm , Peter Kutchukian , William C. Forrester , Gregory Michaud , Jutta Blank , Lingling Shen , Raquel Rodríguez-Pérez","doi":"10.1016/j.ailsci.2024.100100","DOIUrl":"https://doi.org/10.1016/j.ailsci.2024.100100","url":null,"abstract":"<div><p>Targeted protein degradation (TPD) is a rapidly developing drug discovery technique with unique efficacy and target scope stemming from its degradation-based activity. Molecular glue degraders are a promising arm of TPD, as evidenced by the FDA-approved therapeutics within this class, the increasing number of degraders in clinical development, and their predisposition to drug-likeness. Cereblon (CRBN) glue degraders mediate target degradation by generating a neomorphic interface between CRBN and a protein of interest. While promising, the complicated nature of this CRBN-glue-target ternary complex makes the rational design of molecular glue degraders challenging. For other drug modalities, predictive modeling has been established to leverage existing activity data and generate quantitative structure-activity relationships (QSAR). However, the applicability of QSAR strategies for glues remains under-investigated. Herein, machine learning methodologies were developed to predict glue-mediated recruitment of CRBN to target proteins and achieved promising performance. Generated models leveraged more than a hundred internal screening campaigns across thousands of CRBN glues to predict glue-mediated recruitment of targets to CRBN. Our results show that recruitment activity of CRBN glue degraders can be modeled by machine learning, with 89 % of models producing an area under the receiver operating characteristic curve (ROC AUC) > 0.8 and 70 % of models producing a Matthew's correlation coefficient (MCC) > 0.2 for these primary screening data. Importantly, our findings also indicate that the combination of compound and protein descriptors in the so-called proteochemometric models improves performance, with >80 % of the models exhibiting higher ROC AUC and MCC values than per-target models only based on compound information. Hence, our investigations suggest that proteochemometric modeling is a successful approach for molecular glue degraders. The proposed machine learning strategies can aid compound prioritization based on recruitment efficacy and target selectivity, thus have the potential to facilitate the design and discovery of therapeutic CRBN molecular glues.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"6 ","pages":"Article 100100"},"PeriodicalIF":0.0,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318524000072/pdfft?md5=74a4c064cfb576ff403180c61ffdc97f&pid=1-s2.0-S2667318524000072-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141324462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Statistical approaches enabling technology-specific assay interference prediction from large screening data sets 从大型筛选数据集中预测特定技术检测干扰的统计方法

Artificial intelligence in the life sciences

Pub Date : 2024-06-01 DOI: 10.1016/j.ailsci.2024.100099

Vincenzo Palmacci , Steffen Hirte , Jorge Enrique Hernández González , Floriane Montanari , Johannes Kirchmair

High throughput screening (HTS) technologies allow the biological testing of hundreds of thousands of compounds per day. Typically, a substantial proportion of the initial hits obtained by HTS are artifacts caused by assay interference. Therefore, global and technology-specific in silico models for identifying and predicting compounds interfering with biological assays have been developed. The global models benefit from training on large screening data sets, while the specialized models benefit from training on assay technology-specific experimental data. In this work, we develop and explore strategies for generating better predictors of technology-specific assay interference by utilizing the large bioactivity data matrices global models are trained on and employing partially new compound labeling approaches to maintain the assay technology awareness of specialized models. We demonstrate the utility of the statistically derived interference labels in machine learning using fluorescence-based assay interference as a representative example. Our random forest and multi-layer perceptron classifiers showed improved performance compared to existing models, achieving Matthews correlation coefficients (MCCs) of up to 0.47 on holdout data and up to 0.45 on an external test set. These results demonstrate that accurate assay-specific interference labels can be derived from large bioactivity data matrices, enabling the development of new machine-learning models without the need for further experimental data.

高通量筛选（HTS）技术每天可以对数十万种化合物进行生物测试。通常情况下，HTS 所获得的初始命中结果中有很大一部分是由检测干扰造成的假象。因此，我们开发了用于识别和预测干扰生物检测的化合物的全局和特定技术硅学模型。全局模型得益于大型筛选数据集的训练，而专用模型则得益于特定检测技术实验数据的训练。在这项工作中，我们开发并探索了一些策略，通过利用大型生物活性数据矩阵对全局模型进行训练，并采用部分新化合物标记方法来保持专用模型的检测技术意识，从而生成更好的特定技术检测干扰预测因子。我们以基于荧光的检测干扰为例，展示了统计得出的干扰标签在机器学习中的实用性。与现有模型相比，我们的随机森林和多层感知器分类器显示出更高的性能，在保留数据上实现了高达 0.47 的马修相关系数 (MCC)，在外部测试集上实现了高达 0.45 的马修相关系数 (MCC)。这些结果表明，可以从大型生物活性数据矩阵中得出准确的化验特异性干扰标签，从而开发出新的机器学习模型，而无需进一步的实验数据。

{"title":"Statistical approaches enabling technology-specific assay interference prediction from large screening data sets","authors":"Vincenzo Palmacci , Steffen Hirte , Jorge Enrique Hernández González , Floriane Montanari , Johannes Kirchmair","doi":"10.1016/j.ailsci.2024.100099","DOIUrl":"https://doi.org/10.1016/j.ailsci.2024.100099","url":null,"abstract":"<div><p>High throughput screening (HTS) technologies allow the biological testing of hundreds of thousands of compounds per day. Typically, a substantial proportion of the initial hits obtained by HTS are artifacts caused by assay interference. Therefore, global and technology-specific in silico models for identifying and predicting compounds interfering with biological assays have been developed. The global models benefit from training on large screening data sets, while the specialized models benefit from training on assay technology-specific experimental data. In this work, we develop and explore strategies for generating better predictors of technology-specific assay interference by utilizing the large bioactivity data matrices global models are trained on and employing partially new compound labeling approaches to maintain the assay technology awareness of specialized models. We demonstrate the utility of the statistically derived interference labels in machine learning using fluorescence-based assay interference as a representative example. Our random forest and multi-layer perceptron classifiers showed improved performance compared to existing models, achieving Matthews correlation coefficients (MCCs) of up to 0.47 on holdout data and up to 0.45 on an external test set. These results demonstrate that accurate assay-specific interference labels can be derived from large bioactivity data matrices, enabling the development of new machine-learning models without the need for further experimental data.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"5 ","pages":"Article 100099"},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318524000060/pdfft?md5=b99d896dcc34d54ad38a7b8ccb52ebda&pid=1-s2.0-S2667318524000060-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141289445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Federated learning for predicting compound mechanism of action based on image-data from cell painting 基于细胞绘画图像数据预测化合物作用机制的联合学习

Artificial intelligence in the life sciences

Pub Date : 2024-05-09 DOI: 10.1016/j.ailsci.2024.100098

Li Ju , Andreas Hellander , Ola Spjuth

Having access to sufficient data is essential in order to train accurate machine learning models, but much data is not publicly available. In drug discovery this is particularly evident, as much data is withheld at pharmaceutical companies for various reasons. Federated Learning (FL) aims at training a joint model between multiple parties but without disclosing data between the parties. In this work, we leverage Federated Learning to predict compound Mechanism of Action (MoA) using fluorescence image data from cell painting. Our study evaluates the effectiveness and efficiency of FL, comparing to non-collaborative and data-sharing collaborative learning in diverse scenarios. Specifically, we investigate the impact of data heterogeneity across participants on MoA prediction, an essential concern in real-life applications of FL, and demonstrate the benefits for all involved parties. This work highlights the potential of federated learning in multi-institutional collaborative machine learning for drug discovery and assessment of chemicals, offering a promising avenue to overcome data-sharing constraints.

要训练精确的机器学习模型，获得足够的数据是必不可少的，但很多数据并不公开。在药物发现领域，这种情况尤为明显，因为制药公司出于各种原因隐瞒了许多数据。Federated Learning（FL）旨在训练多方之间的联合模型，但不公开各方之间的数据。在这项工作中，我们利用联合学习技术，利用细胞绘画的荧光图像数据预测化合物的作用机制（MoA）。我们的研究评估了联邦学习的有效性和效率，并在不同场景下与非协作学习和数据共享协作学习进行了比较。特别是，我们研究了参与者之间的数据异质性对 MoA 预测的影响（这是 FL 在现实生活中应用的一个重要问题），并证明了所有参与方都能从中获益。这项工作凸显了联合学习在药物发现和化学品评估的多机构协作机器学习中的潜力，为克服数据共享限制提供了一条大有可为的途径。

{"title":"Federated learning for predicting compound mechanism of action based on image-data from cell painting","authors":"Li Ju , Andreas Hellander , Ola Spjuth","doi":"10.1016/j.ailsci.2024.100098","DOIUrl":"https://doi.org/10.1016/j.ailsci.2024.100098","url":null,"abstract":"<div><p>Having access to sufficient data is essential in order to train accurate machine learning models, but much data is not publicly available. In drug discovery this is particularly evident, as much data is withheld at pharmaceutical companies for various reasons. Federated Learning (FL) aims at training a joint model between multiple parties but without disclosing data between the parties. In this work, we leverage Federated Learning to predict compound Mechanism of Action (MoA) using fluorescence image data from cell painting. Our study evaluates the effectiveness and efficiency of FL, comparing to non-collaborative and data-sharing collaborative learning in diverse scenarios. Specifically, we investigate the impact of data heterogeneity across participants on MoA prediction, an essential concern in real-life applications of FL, and demonstrate the benefits for all involved parties. This work highlights the potential of federated learning in multi-institutional collaborative machine learning for drug discovery and assessment of chemicals, offering a promising avenue to overcome data-sharing constraints.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"5 ","pages":"Article 100098"},"PeriodicalIF":0.0,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318524000059/pdfft?md5=100e1ed9ac27f95816db906647d11bc0&pid=1-s2.0-S2667318524000059-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140951069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0