首页 > 最新文献

Artificial intelligence in the life sciences最新文献

英文 中文
Modeling PROTAC degradation activity with machine learning 利用机器学习模拟 PROTAC 降解活动
Pub Date : 2024-07-14 DOI: 10.1016/j.ailsci.2024.100104

PROTACs are a promising therapeutic modality that harnesses the cell’s built-in degradation machinery to degrade specific proteins. Despite their potential, developing new PROTACs is challenging and requires significant domain expertise, time, and cost. Meanwhile, machine learning has transformed drug design and development. In this work, we present a strategy for curating open-source PROTAC data and an open-source deep learning tool for predicting the degradation activity of novel PROTAC molecules. The curated dataset incorporates important information such as pDC50, Dmax, E3 ligase type, POI amino acid sequence, and experimental cell type. Our model architecture leverages learned embeddings from pretrained machine learning models, in particular for encoding protein sequences and cell type information. We assessed the quality of the curated data and the generalization ability of our model architecture against new PROTACs and targets via three tailored studies, which we recommend other researchers to use in evaluating their degradation activity models. In each study, three models predict protein degradation in a majority vote setting, reaching a top test accuracy of 82.6% and 0.848 ROC AUC, and a test accuracy of 61% and 0.615 ROC AUC when generalizing to novel protein targets. Our results are not only comparable to state-of-the-art models for protein degradation prediction, but also part of an open-source implementation which is easily reproducible and less computationally complex than existing approaches.

PROTACs 是一种很有前景的治疗方式,它利用细胞内置的降解机制来降解特定蛋白质。尽管PROTACs潜力巨大,但开发新的PROTACs却极具挑战性,需要大量的专业领域知识、时间和成本。与此同时,机器学习改变了药物设计和开发。在这项工作中,我们提出了一种整理开源 PROTAC 数据的策略,以及一种预测新型 PROTAC 分子降解活性的开源深度学习工具。策划的数据集包含 pDC50、Dmax、E3 连接酶类型、POI 氨基酸序列和实验细胞类型等重要信息。我们的模型架构利用了从预先训练的机器学习模型中学习到的嵌入,特别是用于编码蛋白质序列和细胞类型信息。我们通过三项量身定制的研究评估了数据的质量以及我们的模型架构对新的 PROTAC 和靶标的泛化能力,我们建议其他研究人员在评估他们的降解活性模型时使用这些数据。在每项研究中,三个模型都以多数票方式预测了蛋白质降解情况,最高测试准确率达 82.6%,ROC AUC 为 0.848;当推广到新型蛋白质靶标时,测试准确率达 61%,ROC AUC 为 0.615。我们的结果不仅可以与最先进的蛋白质降解预测模型相媲美,而且是开源实现的一部分,与现有方法相比,它易于重复,计算复杂度较低。
{"title":"Modeling PROTAC degradation activity with machine learning","authors":"","doi":"10.1016/j.ailsci.2024.100104","DOIUrl":"10.1016/j.ailsci.2024.100104","url":null,"abstract":"<div><p>PROTACs are a promising therapeutic modality that harnesses the cell’s built-in degradation machinery to degrade specific proteins. Despite their potential, developing new PROTACs is challenging and requires significant domain expertise, time, and cost. Meanwhile, machine learning has transformed drug design and development. In this work, we present a strategy for curating open-source PROTAC data and an open-source deep learning tool for predicting the degradation activity of novel PROTAC molecules. The curated dataset incorporates important information such as <span><math><mrow><mi>p</mi><mi>D</mi><msub><mrow><mi>C</mi></mrow><mrow><mn>50</mn></mrow></msub></mrow></math></span>, <span><math><msub><mrow><mi>D</mi></mrow><mrow><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub></math></span>, E3 ligase type, POI amino acid sequence, and experimental cell type. Our model architecture leverages learned embeddings from pretrained machine learning models, in particular for encoding protein sequences and cell type information. We assessed the quality of the curated data and the generalization ability of our model architecture against new PROTACs and targets via three tailored studies, which we recommend other researchers to use in evaluating their degradation activity models. In each study, three models predict protein degradation in a majority vote setting, reaching a top test accuracy of 82.6% and 0.848 ROC AUC, and a test accuracy of 61% and 0.615 ROC AUC when generalizing to novel protein targets. Our results are not only comparable to state-of-the-art models for protein degradation prediction, but also part of an open-source implementation which is easily reproducible and less computationally complex than existing approaches.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318524000114/pdfft?md5=fbcd6191bbd4f65eeacdd8602953af66&pid=1-s2.0-S2667318524000114-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141960711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning proteochemometric models for Cereblon glue activity predictions 用于预测脑龙胶活性的机器学习蛋白质化学计量模型
Pub Date : 2024-06-11 DOI: 10.1016/j.ailsci.2024.100100
Francis J. Prael III , Jiayi Cox , Noé Sturm , Peter Kutchukian , William C. Forrester , Gregory Michaud , Jutta Blank , Lingling Shen , Raquel Rodríguez-Pérez

Targeted protein degradation (TPD) is a rapidly developing drug discovery technique with unique efficacy and target scope stemming from its degradation-based activity. Molecular glue degraders are a promising arm of TPD, as evidenced by the FDA-approved therapeutics within this class, the increasing number of degraders in clinical development, and their predisposition to drug-likeness. Cereblon (CRBN) glue degraders mediate target degradation by generating a neomorphic interface between CRBN and a protein of interest. While promising, the complicated nature of this CRBN-glue-target ternary complex makes the rational design of molecular glue degraders challenging. For other drug modalities, predictive modeling has been established to leverage existing activity data and generate quantitative structure-activity relationships (QSAR). However, the applicability of QSAR strategies for glues remains under-investigated. Herein, machine learning methodologies were developed to predict glue-mediated recruitment of CRBN to target proteins and achieved promising performance. Generated models leveraged more than a hundred internal screening campaigns across thousands of CRBN glues to predict glue-mediated recruitment of targets to CRBN. Our results show that recruitment activity of CRBN glue degraders can be modeled by machine learning, with 89 % of models producing an area under the receiver operating characteristic curve (ROC AUC) > 0.8 and 70 % of models producing a Matthew's correlation coefficient (MCC) > 0.2 for these primary screening data. Importantly, our findings also indicate that the combination of compound and protein descriptors in the so-called proteochemometric models improves performance, with >80 % of the models exhibiting higher ROC AUC and MCC values than per-target models only based on compound information. Hence, our investigations suggest that proteochemometric modeling is a successful approach for molecular glue degraders. The proposed machine learning strategies can aid compound prioritization based on recruitment efficacy and target selectivity, thus have the potential to facilitate the design and discovery of therapeutic CRBN molecular glues.

靶向蛋白质降解(TPD)是一种快速发展的药物发现技术,其独特的功效和靶向范围源于其基于降解的活性。分子胶降解剂是一种前景广阔的靶向降解技术,美国食品及药物管理局(FDA)批准的该类治疗药物、越来越多的降解剂进入临床开发阶段以及它们的药物相似性都证明了这一点。Cereblon(CRBN)胶水降解剂通过在 CRBN 和感兴趣的蛋白质之间生成一个新形界面来介导目标降解。这种 CRBN-胶水-靶标三元复合物性质复杂,虽然前景广阔,但合理设计分子胶水降解剂仍具有挑战性。对于其他药物模式,已经建立了预测模型来利用现有的活性数据并生成定量结构-活性关系(QSAR)。然而,QSAR 策略对胶水的适用性仍未得到充分研究。在此,我们开发了机器学习方法来预测胶水介导的 CRBN 对靶蛋白的招募,并取得了良好的效果。生成的模型利用了数以千计的 CRBN 胶的百余次内部筛选活动来预测胶介导的 CRBN 对靶蛋白的招募。我们的研究结果表明,CRBN胶水降解剂的招募活性可以通过机器学习来建模,对于这些初筛数据,89%的模型产生的接收者操作特征曲线下面积(ROC AUC)为0.8,70%的模型产生的马修相关系数(MCC)为0.2。重要的是,我们的研究结果还表明,在所谓的蛋白质化学计量学模型中结合化合物和蛋白质描述因子可提高性能,80%的模型比仅基于化合物信息的每目标模型显示出更高的ROC AUC和MCC值。因此,我们的研究表明,蛋白化学计量模型是一种成功的分子胶降解方法。所提出的机器学习策略可以根据招募效果和靶点选择性帮助确定化合物的优先级,从而有可能促进治疗性 CRBN 分子胶的设计和发现。
{"title":"Machine learning proteochemometric models for Cereblon glue activity predictions","authors":"Francis J. Prael III ,&nbsp;Jiayi Cox ,&nbsp;Noé Sturm ,&nbsp;Peter Kutchukian ,&nbsp;William C. Forrester ,&nbsp;Gregory Michaud ,&nbsp;Jutta Blank ,&nbsp;Lingling Shen ,&nbsp;Raquel Rodríguez-Pérez","doi":"10.1016/j.ailsci.2024.100100","DOIUrl":"https://doi.org/10.1016/j.ailsci.2024.100100","url":null,"abstract":"<div><p>Targeted protein degradation (TPD) is a rapidly developing drug discovery technique with unique efficacy and target scope stemming from its degradation-based activity. Molecular glue degraders are a promising arm of TPD, as evidenced by the FDA-approved therapeutics within this class, the increasing number of degraders in clinical development, and their predisposition to drug-likeness. Cereblon (CRBN) glue degraders mediate target degradation by generating a neomorphic interface between CRBN and a protein of interest. While promising, the complicated nature of this CRBN-glue-target ternary complex makes the rational design of molecular glue degraders challenging. For other drug modalities, predictive modeling has been established to leverage existing activity data and generate quantitative structure-activity relationships (QSAR). However, the applicability of QSAR strategies for glues remains under-investigated. Herein, machine learning methodologies were developed to predict glue-mediated recruitment of CRBN to target proteins and achieved promising performance. Generated models leveraged more than a hundred internal screening campaigns across thousands of CRBN glues to predict glue-mediated recruitment of targets to CRBN. Our results show that recruitment activity of CRBN glue degraders can be modeled by machine learning, with 89 % of models producing an area under the receiver operating characteristic curve (ROC AUC) &gt; 0.8 and 70 % of models producing a Matthew's correlation coefficient (MCC) &gt; 0.2 for these primary screening data. Importantly, our findings also indicate that the combination of compound and protein descriptors in the so-called proteochemometric models improves performance, with &gt;80 % of the models exhibiting higher ROC AUC and MCC values than per-target models only based on compound information. Hence, our investigations suggest that proteochemometric modeling is a successful approach for molecular glue degraders. The proposed machine learning strategies can aid compound prioritization based on recruitment efficacy and target selectivity, thus have the potential to facilitate the design and discovery of therapeutic CRBN molecular glues.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318524000072/pdfft?md5=74a4c064cfb576ff403180c61ffdc97f&pid=1-s2.0-S2667318524000072-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141324462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Statistical approaches enabling technology-specific assay interference prediction from large screening data sets 从大型筛选数据集中预测特定技术检测干扰的统计方法
Pub Date : 2024-06-01 DOI: 10.1016/j.ailsci.2024.100099
Vincenzo Palmacci , Steffen Hirte , Jorge Enrique Hernández González , Floriane Montanari , Johannes Kirchmair

High throughput screening (HTS) technologies allow the biological testing of hundreds of thousands of compounds per day. Typically, a substantial proportion of the initial hits obtained by HTS are artifacts caused by assay interference. Therefore, global and technology-specific in silico models for identifying and predicting compounds interfering with biological assays have been developed. The global models benefit from training on large screening data sets, while the specialized models benefit from training on assay technology-specific experimental data. In this work, we develop and explore strategies for generating better predictors of technology-specific assay interference by utilizing the large bioactivity data matrices global models are trained on and employing partially new compound labeling approaches to maintain the assay technology awareness of specialized models. We demonstrate the utility of the statistically derived interference labels in machine learning using fluorescence-based assay interference as a representative example. Our random forest and multi-layer perceptron classifiers showed improved performance compared to existing models, achieving Matthews correlation coefficients (MCCs) of up to 0.47 on holdout data and up to 0.45 on an external test set. These results demonstrate that accurate assay-specific interference labels can be derived from large bioactivity data matrices, enabling the development of new machine-learning models without the need for further experimental data.

高通量筛选(HTS)技术每天可以对数十万种化合物进行生物测试。通常情况下,HTS 所获得的初始命中结果中有很大一部分是由检测干扰造成的假象。因此,我们开发了用于识别和预测干扰生物检测的化合物的全局和特定技术硅学模型。全局模型得益于大型筛选数据集的训练,而专用模型则得益于特定检测技术实验数据的训练。在这项工作中,我们开发并探索了一些策略,通过利用大型生物活性数据矩阵对全局模型进行训练,并采用部分新化合物标记方法来保持专用模型的检测技术意识,从而生成更好的特定技术检测干扰预测因子。我们以基于荧光的检测干扰为例,展示了统计得出的干扰标签在机器学习中的实用性。与现有模型相比,我们的随机森林和多层感知器分类器显示出更高的性能,在保留数据上实现了高达 0.47 的马修相关系数 (MCC),在外部测试集上实现了高达 0.45 的马修相关系数 (MCC)。这些结果表明,可以从大型生物活性数据矩阵中得出准确的化验特异性干扰标签,从而开发出新的机器学习模型,而无需进一步的实验数据。
{"title":"Statistical approaches enabling technology-specific assay interference prediction from large screening data sets","authors":"Vincenzo Palmacci ,&nbsp;Steffen Hirte ,&nbsp;Jorge Enrique Hernández González ,&nbsp;Floriane Montanari ,&nbsp;Johannes Kirchmair","doi":"10.1016/j.ailsci.2024.100099","DOIUrl":"https://doi.org/10.1016/j.ailsci.2024.100099","url":null,"abstract":"<div><p>High throughput screening (HTS) technologies allow the biological testing of hundreds of thousands of compounds per day. Typically, a substantial proportion of the initial hits obtained by HTS are artifacts caused by assay interference. Therefore, global and technology-specific in silico models for identifying and predicting compounds interfering with biological assays have been developed. The global models benefit from training on large screening data sets, while the specialized models benefit from training on assay technology-specific experimental data. In this work, we develop and explore strategies for generating better predictors of technology-specific assay interference by utilizing the large bioactivity data matrices global models are trained on and employing partially new compound labeling approaches to maintain the assay technology awareness of specialized models. We demonstrate the utility of the statistically derived interference labels in machine learning using fluorescence-based assay interference as a representative example. Our random forest and multi-layer perceptron classifiers showed improved performance compared to existing models, achieving Matthews correlation coefficients (MCCs) of up to 0.47 on holdout data and up to 0.45 on an external test set. These results demonstrate that accurate assay-specific interference labels can be derived from large bioactivity data matrices, enabling the development of new machine-learning models without the need for further experimental data.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318524000060/pdfft?md5=b99d896dcc34d54ad38a7b8ccb52ebda&pid=1-s2.0-S2667318524000060-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141289445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Federated learning for predicting compound mechanism of action based on image-data from cell painting 基于细胞绘画图像数据预测化合物作用机制的联合学习
Pub Date : 2024-05-09 DOI: 10.1016/j.ailsci.2024.100098
Li Ju , Andreas Hellander , Ola Spjuth

Having access to sufficient data is essential in order to train accurate machine learning models, but much data is not publicly available. In drug discovery this is particularly evident, as much data is withheld at pharmaceutical companies for various reasons. Federated Learning (FL) aims at training a joint model between multiple parties but without disclosing data between the parties. In this work, we leverage Federated Learning to predict compound Mechanism of Action (MoA) using fluorescence image data from cell painting. Our study evaluates the effectiveness and efficiency of FL, comparing to non-collaborative and data-sharing collaborative learning in diverse scenarios. Specifically, we investigate the impact of data heterogeneity across participants on MoA prediction, an essential concern in real-life applications of FL, and demonstrate the benefits for all involved parties. This work highlights the potential of federated learning in multi-institutional collaborative machine learning for drug discovery and assessment of chemicals, offering a promising avenue to overcome data-sharing constraints.

要训练精确的机器学习模型,获得足够的数据是必不可少的,但很多数据并不公开。在药物发现领域,这种情况尤为明显,因为制药公司出于各种原因隐瞒了许多数据。Federated Learning(FL)旨在训练多方之间的联合模型,但不公开各方之间的数据。在这项工作中,我们利用联合学习技术,利用细胞绘画的荧光图像数据预测化合物的作用机制(MoA)。我们的研究评估了联邦学习的有效性和效率,并在不同场景下与非协作学习和数据共享协作学习进行了比较。特别是,我们研究了参与者之间的数据异质性对 MoA 预测的影响(这是 FL 在现实生活中应用的一个重要问题),并证明了所有参与方都能从中获益。这项工作凸显了联合学习在药物发现和化学品评估的多机构协作机器学习中的潜力,为克服数据共享限制提供了一条大有可为的途径。
{"title":"Federated learning for predicting compound mechanism of action based on image-data from cell painting","authors":"Li Ju ,&nbsp;Andreas Hellander ,&nbsp;Ola Spjuth","doi":"10.1016/j.ailsci.2024.100098","DOIUrl":"https://doi.org/10.1016/j.ailsci.2024.100098","url":null,"abstract":"<div><p>Having access to sufficient data is essential in order to train accurate machine learning models, but much data is not publicly available. In drug discovery this is particularly evident, as much data is withheld at pharmaceutical companies for various reasons. Federated Learning (FL) aims at training a joint model between multiple parties but without disclosing data between the parties. In this work, we leverage Federated Learning to predict compound Mechanism of Action (MoA) using fluorescence image data from cell painting. Our study evaluates the effectiveness and efficiency of FL, comparing to non-collaborative and data-sharing collaborative learning in diverse scenarios. Specifically, we investigate the impact of data heterogeneity across participants on MoA prediction, an essential concern in real-life applications of FL, and demonstrate the benefits for all involved parties. This work highlights the potential of federated learning in multi-institutional collaborative machine learning for drug discovery and assessment of chemicals, offering a promising avenue to overcome data-sharing constraints.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318524000059/pdfft?md5=100e1ed9ac27f95816db906647d11bc0&pid=1-s2.0-S2667318524000059-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140951069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An integrated approach to predict activators of NRF2 - the transcription factor for oxidative stress response 预测氧化应激反应转录因子 NRF2 激活因子的综合方法
Pub Date : 2024-04-13 DOI: 10.1016/j.ailsci.2024.100097
Yaroslav Chushak , Rebecca A. Clewell

A variety of environmental and physiological conditions can cause oxidative stress that damage cellular components such as DNA, proteins and lipids. Oxidative stress is implicated in many human diseases including cancer, cardiovascular diseases, neurological diseases, inflammatory diseases, and aging. The nuclear factor erythroid 2–related factor 2 (NRF2) is a transcriptional factor that plays a key role in the cellular antioxidant defense system as it regulates transcription of antioxidant proteins and detoxifying enzymes. There is an urgent need to identify novel compounds that activate NRF2 and enhance antioxidant defense. We collected data from the high-throughput screening of NRF2 activators and identified molecular fragments (structural alerts) associated with the activation of NRF2. We also developed ten classification models using different types of molecular descriptors and machine learning techniques. Two approaches were used to establish the applicability domain of developed models: the structure-based approach and the distance to model approach. The best performing model that used message passing neural network (MPNN) technique showed accuracy of 87 % for the test set of chemicals within the distance to model of 0.3. The integrative approach using a combination of generated structural alerts and MPNN model was used to screen approved drugs collected in the DrugBank to identify potential NRF2 activators. Out of 2393 screened chemicals 138 compounds were predicted as NRF2 activators by both approaches. Analysis of these compounds showed that some drugs were already known activators of NRF2 while others are potentially novel activators.

各种环境和生理条件都会造成氧化应激,从而损害 DNA、蛋白质和脂质等细胞成分。氧化应激与许多人类疾病有关,包括癌症、心血管疾病、神经系统疾病、炎症性疾病和衰老。核因子红细胞 2 相关因子 2(NRF2)是一种转录因子,在细胞抗氧化防御系统中发挥着关键作用,因为它能调节抗氧化蛋白和解毒酶的转录。目前急需鉴定能激活 NRF2 并增强抗氧化防御能力的新型化合物。我们收集了高通量筛选 NRF2 激活剂的数据,并确定了与激活 NRF2 相关的分子片段(结构警报)。我们还利用不同类型的分子描述符和机器学习技术开发了十种分类模型。我们采用了两种方法来确定所开发模型的适用范围:基于结构的方法和模型距离方法。使用消息传递神经网络(MPNN)技术的模型表现最佳,在与模型的距离为 0.3 的范围内,对测试化学品集的准确率达到 87%。结合使用生成的结构警报和 MPNN 模型的综合方法用于筛选药物库中收集的已批准药物,以确定潜在的 NRF2 激活剂。在筛选出的 2393 种化学物质中,有 138 种化合物被这两种方法预测为 NRF2 激活剂。对这些化合物的分析表明,一些药物是已知的 NRF2 激活剂,而另一些则可能是新型激活剂。
{"title":"An integrated approach to predict activators of NRF2 - the transcription factor for oxidative stress response","authors":"Yaroslav Chushak ,&nbsp;Rebecca A. Clewell","doi":"10.1016/j.ailsci.2024.100097","DOIUrl":"https://doi.org/10.1016/j.ailsci.2024.100097","url":null,"abstract":"<div><p>A variety of environmental and physiological conditions can cause oxidative stress that damage cellular components such as DNA, proteins and lipids. Oxidative stress is implicated in many human diseases including cancer, cardiovascular diseases, neurological diseases, inflammatory diseases, and aging. The nuclear factor erythroid 2–related factor 2 (NRF2) is a transcriptional factor that plays a key role in the cellular antioxidant defense system as it regulates transcription of antioxidant proteins and detoxifying enzymes. There is an urgent need to identify novel compounds that activate NRF2 and enhance antioxidant defense. We collected data from the high-throughput screening of NRF2 activators and identified molecular fragments (structural alerts) associated with the activation of NRF2. We also developed ten classification models using different types of molecular descriptors and machine learning techniques. Two approaches were used to establish the applicability domain of developed models: the structure-based approach and the distance to model approach. The best performing model that used message passing neural network (MPNN) technique showed accuracy of 87 % for the test set of chemicals within the distance to model of 0.3. The integrative approach using a combination of generated structural alerts and MPNN model was used to screen approved drugs collected in the DrugBank to identify potential NRF2 activators. Out of 2393 screened chemicals 138 compounds were predicted as NRF2 activators by both approaches. Analysis of these compounds showed that some drugs were already known activators of NRF2 while others are potentially novel activators.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318524000047/pdfft?md5=29a2ee24a6813324417f266b95b1e48d&pid=1-s2.0-S2667318524000047-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140606623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Artificial intelligence-open science symbiosis in chemoinformatics 化学信息学中的人工智能-开放科学共生关系
Pub Date : 2024-03-21 DOI: 10.1016/j.ailsci.2024.100096
Filip Miljković , José L. Medina-Franco

In chemoinformatics, artificial intelligence (AI) continues to grow a symbiosis with open science (OS). Such a close AI-OS interaction brings substantial practical benefits in research, scientific dissemination, and education, to name a few areas. The AI-OS symbiosis can be further enhanced by combining sufficient substantive expertise, mathematical and statistical knowledge, and coding skills. This Viewpoint discusses the benefits of the smooth and productive interaction between AI, OS, and open data. We also present a short list of misconceptions and pitfalls surrounding AI-OS and propose correct responses and behaviors agreed upon by field experts. In addition, we provide suggestions to continue enhancing the positive contributions of the AI-OS symbiosis towards chemoinformatics.

在化学信息学领域,人工智能(AI)与开放科学(OS)不断发展共生关系。人工智能与操作系统的紧密互动为研究、科学传播和教育等领域带来了巨大的实际利益。人工智能与操作系统的共生关系可以通过结合足够的实质性专业知识、数理统计知识和编码技能得到进一步加强。本视点讨论了人工智能、操作系统和开放数据之间顺畅而富有成效的互动所带来的益处。我们还简要列举了围绕人工智能操作系统的误解和陷阱,并提出了领域专家一致认可的正确对策和行为。此外,我们还提出了继续加强人工智能-操作系统共生对化学信息学的积极贡献的建议。
{"title":"Artificial intelligence-open science symbiosis in chemoinformatics","authors":"Filip Miljković ,&nbsp;José L. Medina-Franco","doi":"10.1016/j.ailsci.2024.100096","DOIUrl":"10.1016/j.ailsci.2024.100096","url":null,"abstract":"<div><p>In chemoinformatics, artificial intelligence (AI) continues to grow a symbiosis with open science (OS). Such a close AI-OS interaction brings substantial practical benefits in research, scientific dissemination, and education, to name a few areas. The AI-OS symbiosis can be further enhanced by combining sufficient substantive expertise, mathematical and statistical knowledge, and coding skills. This Viewpoint discusses the benefits of the smooth and productive interaction between AI, OS, and open data. We also present a short list of misconceptions and pitfalls surrounding AI-OS and propose correct responses and behaviors agreed upon by field experts. In addition, we provide suggestions to continue enhancing the positive contributions of the AI-OS symbiosis towards chemoinformatics.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318524000035/pdfft?md5=15b234d142847a979a68f7886068152e&pid=1-s2.0-S2667318524000035-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140276452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rationalism in the face of GPT hypes: Benchmarking the output of large language models against human expert-curated biomedical knowledge graphs 面对 GPT 虚伪的理性主义:以人类专家编辑的生物医学知识图谱为基准测试大型语言模型的输出结果
Pub Date : 2024-02-01 DOI: 10.1016/j.ailsci.2024.100095
Negin Sadat Babaiha , Sathvik Guru Rao , Jürgen Klein , Bruce Schultz , Marc Jacobs , Martin Hofmann-Apitius

Biomedical knowledge graphs (KGs) hold valuable information regarding biomedical entities such as genes, diseases, biological processes, and drugs. KGs have been successfully employed in challenging biomedical areas such as the identification of pathophysiology mechanisms or drug repurposing. The creation of high-quality KGs typically requires labor-intensive multi-database integration or substantial human expert curation, both of which take time and contribute to the workload of data processing and annotation. Therefore, the use of automatic systems for KG building and maintenance is a prerequisite for the wide uptake and utilization of KGs. Technologies supporting the automated generation and updating of KGs typically make use of Natural Language Processing (NLP), which is optimized for extracting implicit triples described in relevant biomedical text sources. At the core of this challenge is how to improve the accuracy and coverage of the information extraction module by utilizing different models and tools. The emergence of pre-trained large language models (LLMs), such as ChatGPT which has grown in popularity dramatically, has revolutionized the field of NLP, making them a potential candidate to be used in text-based graph creation as well. So far, no previous work has investigated the power of LLMs on the generation of cause-and-effect networks and KGs encoded in Biological Expression Language (BEL). In this paper, we present initial studies towards one-shot BEL relation extraction using two different versions of the Generative Pre-trained Transformer (GPT) models and evaluate its performance by comparing the extracted results to a highly accurate, manually curated BEL KG curated by domain experts.

生物医学知识图谱(KG)包含有关基因、疾病、生物过程和药物等生物医学实体的宝贵信息。知识图谱已成功应用于具有挑战性的生物医学领域,如病理生理学机制鉴定或药物再利用。创建高质量的 KG 通常需要劳动密集型的多数据库整合或大量的人工专家策划,这两者都需要时间,并增加了数据处理和注释的工作量。因此,使用自动系统建立和维护 KG 是广泛吸收和利用 KG 的先决条件。支持自动生成和更新 KG 的技术通常使用自然语言处理(NLP)技术,该技术针对提取相关生物医学文本资源中描述的隐式三元组进行了优化。这一挑战的核心是如何利用不同的模型和工具来提高信息提取模块的准确性和覆盖范围。预训练的大型语言模型(LLM)的出现,如 ChatGPT 的急剧普及,给 NLP 领域带来了革命性的变化,使其也有可能用于基于文本的图创建。迄今为止,还没有人研究过 LLM 在生成以生物表达语言(BEL)编码的因果网络和 KG 方面的威力。在本文中,我们介绍了使用两种不同版本的生成预训练转换器(GPT)模型进行一次 BEL 关系提取的初步研究,并通过将提取结果与领域专家手动策划的高精度 BEL KG 进行比较,评估了其性能。
{"title":"Rationalism in the face of GPT hypes: Benchmarking the output of large language models against human expert-curated biomedical knowledge graphs","authors":"Negin Sadat Babaiha ,&nbsp;Sathvik Guru Rao ,&nbsp;Jürgen Klein ,&nbsp;Bruce Schultz ,&nbsp;Marc Jacobs ,&nbsp;Martin Hofmann-Apitius","doi":"10.1016/j.ailsci.2024.100095","DOIUrl":"https://doi.org/10.1016/j.ailsci.2024.100095","url":null,"abstract":"<div><p>Biomedical knowledge graphs (KGs) hold valuable information regarding biomedical entities such as genes, diseases, biological processes, and drugs. KGs have been successfully employed in challenging biomedical areas such as the identification of pathophysiology mechanisms or drug repurposing. The creation of high-quality KGs typically requires labor-intensive multi-database integration or substantial human expert curation, both of which take time and contribute to the workload of data processing and annotation. Therefore, the use of automatic systems for KG building and maintenance is a prerequisite for the wide uptake and utilization of KGs. Technologies supporting the automated generation and updating of KGs typically make use of Natural Language Processing (NLP), which is optimized for extracting implicit triples described in relevant biomedical text sources. At the core of this challenge is how to improve the accuracy and coverage of the information extraction module by utilizing different models and tools. The emergence of pre-trained large language models (LLMs), such as ChatGPT which has grown in popularity dramatically, has revolutionized the field of NLP, making them a potential candidate to be used in text-based graph creation as well. So far, no previous work has investigated the power of LLMs on the generation of cause-and-effect networks and KGs encoded in Biological Expression Language (BEL). In this paper, we present initial studies towards one-shot BEL relation extraction using two different versions of the Generative Pre-trained Transformer (GPT) models and evaluate its performance by comparing the extracted results to a highly accurate, manually curated BEL KG curated by domain experts.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318524000023/pdfft?md5=9137dd2a207653e4d13cb5b99ca17d48&pid=1-s2.0-S2667318524000023-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139710160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Origins and progression of the polypharmacology concept in drug discovery 药物发现中多药理学概念的起源与发展
Pub Date : 2024-01-03 DOI: 10.1016/j.ailsci.2024.100094
Jürgen Bajorath
{"title":"Origins and progression of the polypharmacology concept in drug discovery","authors":"Jürgen Bajorath","doi":"10.1016/j.ailsci.2024.100094","DOIUrl":"https://doi.org/10.1016/j.ailsci.2024.100094","url":null,"abstract":"","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318524000011/pdfft?md5=ef2f5411ede3a24f3429765640c3360c&pid=1-s2.0-S2667318524000011-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139107191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Potential inconsistencies or artifacts in deriving and interpreting deep learning models and key criteria for scientifically sound applications in the life sciences 推导和解释深度学习模型时可能出现的不一致或人为因素,以及在生命科学领域科学合理应用的关键标准
Pub Date : 2023-12-11 DOI: 10.1016/j.ailsci.2023.100093
Jürgen Bajorath
{"title":"Potential inconsistencies or artifacts in deriving and interpreting deep learning models and key criteria for scientifically sound applications in the life sciences","authors":"Jürgen Bajorath","doi":"10.1016/j.ailsci.2023.100093","DOIUrl":"https://doi.org/10.1016/j.ailsci.2023.100093","url":null,"abstract":"","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318523000375/pdfft?md5=889ff9050b182b6d486b269b3cf0eed4&pid=1-s2.0-S2667318523000375-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138656969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Yoked learning in molecular data science 分子数据科学中的交配学习
Pub Date : 2023-12-02 DOI: 10.1016/j.ailsci.2023.100089
Zhixiong Li, Yan Xiang, Yujing Wen, Daniel Reker

Active machine learning is an established and increasingly popular experimental design technique where the machine learning model can request additional data to improve the model's predictive performance. It is generally assumed that this data is optimal for the machine learning model since it relies on the model's predictions or model architecture and therefore cannot be transferred to other models. Inspired by research in pedagogy, we here introduce the concept of yoked machine learning where a second machine learning model learns from the data selected by another model. We found that in 48% of the benchmarked combinations, yoked learning performed similar or better than active learning. We analyze distinct cases in which yoked learning can improve active learning performance. In particular, we prototype yoked deep learning (YoDeL) where a classic machine learning model provides data to a deep neural network, thereby mitigating challenges of active deep learning such as slow refitting time per learning iteration and poor performance on small datasets. In summary, we expect the new concept of yoked (deep) learning to provide a competitive option to boost the performance of active learning and benefit from distinct capabilities of multiple machine learning models during data acquisition, training, and deployment.

主动式机器学习是一种成熟且日益流行的实验设计技术,机器学习模型可以请求额外的数据来提高模型的预测性能。一般认为,这些数据是机器学习模型的最佳数据,因为这些数据依赖于模型的预测或模型架构,因此不能转移到其他模型中。受教学法研究的启发,我们在此引入了枷锁式机器学习的概念,即第二个机器学习模型从另一个模型选择的数据中学习。我们发现,在 48% 的基准组合中,连带学习的表现与主动学习相似或更好。我们分析了联合学习可以提高主动学习性能的不同情况。特别是,我们提出了枷锁式深度学习(YoDeL)的原型,即由一个经典机器学习模型为深度神经网络提供数据,从而缓解主动式深度学习所面临的挑战,如每次学习迭代的重拟合时间较慢以及在小数据集上的性能较差。总之,我们希望轭状(深度)学习这一新概念能提供一种有竞争力的选择,以提高主动学习的性能,并在数据采集、训练和部署过程中受益于多种机器学习模型的独特能力。
{"title":"Yoked learning in molecular data science","authors":"Zhixiong Li,&nbsp;Yan Xiang,&nbsp;Yujing Wen,&nbsp;Daniel Reker","doi":"10.1016/j.ailsci.2023.100089","DOIUrl":"https://doi.org/10.1016/j.ailsci.2023.100089","url":null,"abstract":"<div><p>Active machine learning is an established and increasingly popular experimental design technique where the machine learning model can request additional data to improve the model's predictive performance. It is generally assumed that this data is optimal for the machine learning model since it relies on the model's predictions or model architecture and therefore cannot be transferred to other models. Inspired by research in pedagogy, we here introduce the concept of yoked machine learning where a second machine learning model learns from the data selected by another model. We found that in 48% of the benchmarked combinations, yoked learning performed similar or better than active learning. We analyze distinct cases in which yoked learning can improve active learning performance. In particular, we prototype yoked deep learning (YoDeL) where a classic machine learning model provides data to a deep neural network, thereby mitigating challenges of active deep learning such as slow refitting time per learning iteration and poor performance on small datasets. In summary, we expect the new concept of yoked (deep) learning to provide a competitive option to boost the performance of active learning and benefit from distinct capabilities of multiple machine learning models during data acquisition, training, and deployment.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318523000338/pdfft?md5=798e4cffb7539da96cce07297e51e3de&pid=1-s2.0-S2667318523000338-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138570365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Artificial intelligence in the life sciences
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1