Exploring the Potential of Adaptive, Local Machine Learning in Comparison to the Prediction Performance of Global Models: A Case Study from Bayer's Caco-2 Permeability Database.

IF 5.6 2区 化学 Q1 CHEMISTRY, MEDICINAL Journal of Chemical Information and Modeling Pub Date : 2024-11-20 DOI:10.1021/acs.jcim.4c01083
Frank Filip Steinbauer, Thorsten Lehr, Andreas Reichel
{"title":"Exploring the Potential of Adaptive, Local Machine Learning in Comparison to the Prediction Performance of Global Models: A Case Study from Bayer's Caco-2 Permeability Database.","authors":"Frank Filip Steinbauer, Thorsten Lehr, Andreas Reichel","doi":"10.1021/acs.jcim.4c01083","DOIUrl":null,"url":null,"abstract":"<p><p>Machine learning (ML) techniques are being widely implemented to fill the gap in simple molecular design guidelines for newer therapeutic modalities in the extended and beyond rule of five chemical space (eRo5, bRo5). These ML techniques predict molecular properties directly from the structure, allowing for the prioritization of promising compounds. However, the performance of models varies greatly among ML use cases. A molecular property for which achieving sufficient performance in generalizing global models still remains difficult is Caco-2 permeability. Especially within the lower permeability ranges, which are specific for larger molecules belonging to the e/bRo5 space, accurate regression predictions have proven to be challenging. The present study, therefore, identifies a suitable combination of ML algorithm and descriptors, consisting of the LightGBM algorithm and RDKit molecular property descriptors, to predict Caco-2 permeability very efficiently by a simple global model. An additionally introduced local model uses the same algorithm and descriptors but selects its training data based on Tanimoto fingerprint similarity to match the individual test compound's structure. Evaluation of this adaptive model, by systematically varying the number of most similar structures for training, shows that, in comparison to the global model, there was only marginally improved performance with specific training data constellations. These random improvements indicate that deriving general rules for local model parametrization is not possible <i>a priori</i> for the chosen algorithm and descriptor combination, and preselecting training data does not seem advantageous over global ML based on all available data, while creation of more data-efficient models was generally proven to be possible.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":""},"PeriodicalIF":5.6000,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.jcim.4c01083","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}
引用次数: 0

Abstract

Machine learning (ML) techniques are being widely implemented to fill the gap in simple molecular design guidelines for newer therapeutic modalities in the extended and beyond rule of five chemical space (eRo5, bRo5). These ML techniques predict molecular properties directly from the structure, allowing for the prioritization of promising compounds. However, the performance of models varies greatly among ML use cases. A molecular property for which achieving sufficient performance in generalizing global models still remains difficult is Caco-2 permeability. Especially within the lower permeability ranges, which are specific for larger molecules belonging to the e/bRo5 space, accurate regression predictions have proven to be challenging. The present study, therefore, identifies a suitable combination of ML algorithm and descriptors, consisting of the LightGBM algorithm and RDKit molecular property descriptors, to predict Caco-2 permeability very efficiently by a simple global model. An additionally introduced local model uses the same algorithm and descriptors but selects its training data based on Tanimoto fingerprint similarity to match the individual test compound's structure. Evaluation of this adaptive model, by systematically varying the number of most similar structures for training, shows that, in comparison to the global model, there was only marginally improved performance with specific training data constellations. These random improvements indicate that deriving general rules for local model parametrization is not possible a priori for the chosen algorithm and descriptor combination, and preselecting training data does not seem advantageous over global ML based on all available data, while creation of more data-efficient models was generally proven to be possible.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
与全局模型的预测性能相比,探索自适应局部机器学习的潜力:拜耳公司 Caco-2 渗透性数据库案例研究。
目前正在广泛应用机器学习(ML)技术,以填补在扩展和超越五化学规则空间(eRo5、bRo5)中更新治疗模式的简单分子设计指南方面的空白。这些 ML 技术可直接从结构预测分子特性,从而优先选择有前景的化合物。然而,不同 ML 用例的模型性能差异很大。Caco-2 的渗透性是一种分子性质,在推广全局模型时仍难以达到足够的性能。特别是在属于 e/bRo5 空间的较大分子所特有的较低渗透性范围内,准确的回归预测已被证明具有挑战性。因此,本研究确定了由 LightGBM 算法和 RDKit 分子性质描述符组成的 ML 算法和描述符的适当组合,通过一个简单的全局模型非常有效地预测了 Caco-2 的渗透性。另外引入的局部模型使用相同的算法和描述符,但根据 Tanimoto 指纹相似性选择训练数据,以匹配单个测试化合物的结构。通过系统地改变用于训练的最相似结构的数量,对这种自适应模型进行了评估,结果表明,与全局模型相比,特定训练数据组合的性能仅略有提高。这些随机的改进表明,对于所选择的算法和描述符组合,不可能先验地得出局部模型参数化的一般规则,而且与基于所有可用数据的全局 ML 相比,预选训练数据似乎并不具有优势。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
9.80
自引率
10.70%
发文量
529
审稿时长
1.4 months
期刊介绍: The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery. Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field. As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.
期刊最新文献
Exploring the Potential of Adaptive, Local Machine Learning in Comparison to the Prediction Performance of Global Models: A Case Study from Bayer's Caco-2 Permeability Database. Widespread Misinterpretation of pKa Terminology for Zwitterionic Compounds and Its Consequences. Influence of Stereochemistry in a Local Approach for Calculating Protein Conformations. Input Pose is Key to Performance of Free Energy Perturbation: Benchmarking with Monoacylglycerol Lipase. CPIScore: A Deep Learning Approach for Rapid Scoring and Interpretation of Protein-Ligand Binding Interactions.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1