Conformal prediction-based machine learning in Cheminformatics: Current applications and new challenges

IF 5.4 Artificial intelligence in the life sciences Pub Date : 2025-06-01 Epub Date: 2025-02-08 DOI:10.1016/j.ailsci.2025.100127
Mario Astigarraga , Andrés Sánchez-Ruiz , Gonzalo Colmenarejo
{"title":"Conformal prediction-based machine learning in Cheminformatics: Current applications and new challenges","authors":"Mario Astigarraga ,&nbsp;Andrés Sánchez-Ruiz ,&nbsp;Gonzalo Colmenarejo","doi":"10.1016/j.ailsci.2025.100127","DOIUrl":null,"url":null,"abstract":"<div><div>Conformal Prediction (CP) is a distribution-free Machine Learning (ML) framework that has been developed in the last ∼25 years to provide well calibrated prediction subsets/intervals that include the true label with a user pre-defined probability, only requiring data exchangeability. It is based on the concept of <em>nonconformity</em> (or dissimilarity) of the new prediction compared to previous data and their predictions, so that the prediction subset/interval size is larger for new “unusual” instances and smaller for “typical” instances. Given its simplicity and ease of applicability, since 2012 it has been widely adopted in Cheminformatics, especially in the Quantitative Structure-Activity Relationship (QSAR) modeling and Molecular Screening areas. This rapid popularization of CP in Cheminformatics can be explained on the grounds that: (a) it can handle the applicability domain (AD) issue of ML models, of large importance in Cheminformatics due to the immense size of the chemical space; (b) it deals with classification of heavily imbalanced datasets typical in Molecular Screening; and (c) it quantifies compound-specific prediction uncertainties, especially useful as it allows to implement gain-cost strategies to accelerate drug discovery by reducing compounds to test. This comprehensive review introduces the method, provides a full appraisal of the work done in the field of Cheminformatics (with special emphasis in the QSAR and Molecular Screening arenas), and discusses its pros and cons and new challenges, especially for Deep Learning applications and nonexchangeable datasets, a very frequent situation in Cheminformatics.</div></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"7 ","pages":"Article 100127"},"PeriodicalIF":5.4000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial intelligence in the life sciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667318525000030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/8 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Conformal Prediction (CP) is a distribution-free Machine Learning (ML) framework that has been developed in the last ∼25 years to provide well calibrated prediction subsets/intervals that include the true label with a user pre-defined probability, only requiring data exchangeability. It is based on the concept of nonconformity (or dissimilarity) of the new prediction compared to previous data and their predictions, so that the prediction subset/interval size is larger for new “unusual” instances and smaller for “typical” instances. Given its simplicity and ease of applicability, since 2012 it has been widely adopted in Cheminformatics, especially in the Quantitative Structure-Activity Relationship (QSAR) modeling and Molecular Screening areas. This rapid popularization of CP in Cheminformatics can be explained on the grounds that: (a) it can handle the applicability domain (AD) issue of ML models, of large importance in Cheminformatics due to the immense size of the chemical space; (b) it deals with classification of heavily imbalanced datasets typical in Molecular Screening; and (c) it quantifies compound-specific prediction uncertainties, especially useful as it allows to implement gain-cost strategies to accelerate drug discovery by reducing compounds to test. This comprehensive review introduces the method, provides a full appraisal of the work done in the field of Cheminformatics (with special emphasis in the QSAR and Molecular Screening arenas), and discusses its pros and cons and new challenges, especially for Deep Learning applications and nonexchangeable datasets, a very frequent situation in Cheminformatics.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
化学信息学中基于保形预测的机器学习:当前应用和新挑战
保形预测(CP)是一种无分布的机器学习(ML)框架,在过去的25年里开发出来,提供了经过校准的预测子集/区间,其中包括具有用户预定义概率的真实标签,只需要数据可交换性。它基于新预测与以前的数据及其预测相比较的不一致性(或不相似性)的概念,因此预测子集/区间大小对于新的“不寻常”实例较大,而对于“典型”实例较小。由于它的简单性和适用性,自2012年以来,它被广泛应用于化学信息学,特别是在定量结构-活性关系(QSAR)建模和分子筛选领域。CP在化学信息学中的迅速普及可以解释为:(a)它可以处理ML模型的适用性域(AD)问题,由于化学空间的巨大规模,这在化学信息学中非常重要;(b)处理分子筛选中典型的严重不平衡数据集的分类;(c)它量化了特定化合物的预测不确定性,尤其有用,因为它允许实施收益成本策略,通过减少要测试的化合物来加速药物发现。这篇全面的综述介绍了该方法,全面评估了化学信息学领域的工作(特别强调QSAR和分子筛选领域),并讨论了其优缺点和新的挑战,特别是深度学习应用和不可交换数据集,这是化学信息学中非常常见的情况。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Artificial intelligence in the life sciences
Artificial intelligence in the life sciences Pharmacology, Biochemistry, Genetics and Molecular Biology (General), Computer Science Applications, Health Informatics, Drug Discovery, Veterinary Science and Veterinary Medicine (General)
CiteScore
5.00
自引率
0.00%
发文量
0
审稿时长
15 days
期刊最新文献
Integrating clinical evidence for multi-condition care PATHOS: Predicting variant pathogenicity by combining protein language models and biological features Performance assessment strategies for language model applications in healthcare Understanding biology with machine learning: compression, intelligibility, and dependency The aims and scope of AILSCI and quality criteria for publications
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1