优化自由能计算的主动学习

James Thompson , W Patrick Walters , Jianwen A Feng , Nicolas A Pabon , Hongcheng Xu , Michael Maser , Brian B Goldman , Demetri Moustakas , Molly Schmidt , Forrest York
{"title":"优化自由能计算的主动学习","authors":"James Thompson ,&nbsp;W Patrick Walters ,&nbsp;Jianwen A Feng ,&nbsp;Nicolas A Pabon ,&nbsp;Hongcheng Xu ,&nbsp;Michael Maser ,&nbsp;Brian B Goldman ,&nbsp;Demetri Moustakas ,&nbsp;Molly Schmidt ,&nbsp;Forrest York","doi":"10.1016/j.ailsci.2022.100050","DOIUrl":null,"url":null,"abstract":"<div><p>While Relative Binding Free Energy (RBFE) calculations have become a mainstay in lead optimization programs, the computational expense of performing these calculations has limited their broader application. Active learning (AL), a machine learning method used to direct a search iteratively, has explored larger chemical libraries using RBFE calculations. While AL has been successfully applied, there has not been a systematic study of the impact of parameter settings on the performance of AL. To address this gap, we have generated an exhaustive dataset of RBFE calculations on 10,000 congeneric molecules. We used this dataset to explore the impact of several AL design choices, including the number of molecules sampled at each iteration, the method used to select an initial sample, the method used to build a machine learning model, and the acquisition function that defines the balance between exploration and exploitation in the search. Our studies demonstrated that the performance of AL is largely insensitive to the specific machine learning method and acquisition functions used. In our studies, the most significant factor impacting performance was the number of molecules sampled at each iteration where selecting too few molecules hurts performance. Under the best conditions, we were able to identify 75% of the 100 top scoring molecules by sampling only 6% of the dataset. We hope that the dataset of 10K molecules will provide the basis for future studies exploring additional AL strategies. The source code and supporting data for the work are available at <span>https://github.com/google-research/google-research/tree/master/al_for_fep</span><svg><path></path></svg>.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"2 ","pages":"Article 100050"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318522000204/pdfft?md5=fd95fcb1f3da91cd7543db829403ca90&pid=1-s2.0-S2667318522000204-main.pdf","citationCount":"10","resultStr":"{\"title\":\"Optimizing active learning for free energy calculations\",\"authors\":\"James Thompson ,&nbsp;W Patrick Walters ,&nbsp;Jianwen A Feng ,&nbsp;Nicolas A Pabon ,&nbsp;Hongcheng Xu ,&nbsp;Michael Maser ,&nbsp;Brian B Goldman ,&nbsp;Demetri Moustakas ,&nbsp;Molly Schmidt ,&nbsp;Forrest York\",\"doi\":\"10.1016/j.ailsci.2022.100050\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>While Relative Binding Free Energy (RBFE) calculations have become a mainstay in lead optimization programs, the computational expense of performing these calculations has limited their broader application. Active learning (AL), a machine learning method used to direct a search iteratively, has explored larger chemical libraries using RBFE calculations. While AL has been successfully applied, there has not been a systematic study of the impact of parameter settings on the performance of AL. To address this gap, we have generated an exhaustive dataset of RBFE calculations on 10,000 congeneric molecules. We used this dataset to explore the impact of several AL design choices, including the number of molecules sampled at each iteration, the method used to select an initial sample, the method used to build a machine learning model, and the acquisition function that defines the balance between exploration and exploitation in the search. Our studies demonstrated that the performance of AL is largely insensitive to the specific machine learning method and acquisition functions used. In our studies, the most significant factor impacting performance was the number of molecules sampled at each iteration where selecting too few molecules hurts performance. Under the best conditions, we were able to identify 75% of the 100 top scoring molecules by sampling only 6% of the dataset. We hope that the dataset of 10K molecules will provide the basis for future studies exploring additional AL strategies. The source code and supporting data for the work are available at <span>https://github.com/google-research/google-research/tree/master/al_for_fep</span><svg><path></path></svg>.</p></div>\",\"PeriodicalId\":72304,\"journal\":{\"name\":\"Artificial intelligence in the life sciences\",\"volume\":\"2 \",\"pages\":\"Article 100050\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2667318522000204/pdfft?md5=fd95fcb1f3da91cd7543db829403ca90&pid=1-s2.0-S2667318522000204-main.pdf\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Artificial intelligence in the life sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2667318522000204\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial intelligence in the life sciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667318522000204","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

摘要

虽然相对结合自由能(RBFE)计算已经成为引线优化程序的主要内容,但执行这些计算的计算费用限制了它们的广泛应用。主动学习(AL)是一种用于迭代指导搜索的机器学习方法,已经使用RBFE计算探索了更大的化学库。虽然人工智能已经成功应用,但还没有系统地研究参数设置对人工智能性能的影响。为了解决这一差距,我们生成了一个详尽的数据集,其中包含了10,000个同源分子的RBFE计算。我们使用该数据集来探索几种人工智能设计选择的影响,包括每次迭代时采样的分子数量,用于选择初始样本的方法,用于构建机器学习模型的方法,以及定义搜索中探索和利用之间平衡的获取函数。我们的研究表明,人工智能的性能在很大程度上对所使用的特定机器学习方法和获取函数不敏感。在我们的研究中,影响性能的最重要因素是每次迭代中采样的分子数量,而选择太少的分子会损害性能。在最好的条件下,我们能够通过仅采样数据集的6%来识别100个得分最高的分子中的75%。我们希望10K个分子的数据集将为未来探索其他人工智能策略的研究提供基础。该工作的源代码和支持数据可在https://github.com/google-research/google-research/tree/master/al_for_fep上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Optimizing active learning for free energy calculations

While Relative Binding Free Energy (RBFE) calculations have become a mainstay in lead optimization programs, the computational expense of performing these calculations has limited their broader application. Active learning (AL), a machine learning method used to direct a search iteratively, has explored larger chemical libraries using RBFE calculations. While AL has been successfully applied, there has not been a systematic study of the impact of parameter settings on the performance of AL. To address this gap, we have generated an exhaustive dataset of RBFE calculations on 10,000 congeneric molecules. We used this dataset to explore the impact of several AL design choices, including the number of molecules sampled at each iteration, the method used to select an initial sample, the method used to build a machine learning model, and the acquisition function that defines the balance between exploration and exploitation in the search. Our studies demonstrated that the performance of AL is largely insensitive to the specific machine learning method and acquisition functions used. In our studies, the most significant factor impacting performance was the number of molecules sampled at each iteration where selecting too few molecules hurts performance. Under the best conditions, we were able to identify 75% of the 100 top scoring molecules by sampling only 6% of the dataset. We hope that the dataset of 10K molecules will provide the basis for future studies exploring additional AL strategies. The source code and supporting data for the work are available at https://github.com/google-research/google-research/tree/master/al_for_fep.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Artificial intelligence in the life sciences
Artificial intelligence in the life sciences Pharmacology, Biochemistry, Genetics and Molecular Biology (General), Computer Science Applications, Health Informatics, Drug Discovery, Veterinary Science and Veterinary Medicine (General)
CiteScore
5.00
自引率
0.00%
发文量
0
审稿时长
15 days
期刊最新文献
Multi-objective synthesis planning by means of Monte Carlo Tree search Enhancing uncertainty quantification in drug discovery with censored regression labels Conformal prediction-based machine learning in Cheminformatics: Current applications and new challenges LIDEB's Useful Decoys (LUDe): A freely available decoy-generation tool. Benchmarking and scope “Foundation models for research: A matter of trust?”
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1