A systematic review of deep learning chemical language models in recent era

IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Journal of Cheminformatics Pub Date : 2024-11-18 DOI:10.1186/s13321-024-00916-y
Hector Flores-Hernandez, Emmanuel Martinez-Ledesma
{"title":"A systematic review of deep learning chemical language models in recent era","authors":"Hector Flores-Hernandez,&nbsp;Emmanuel Martinez-Ledesma","doi":"10.1186/s13321-024-00916-y","DOIUrl":null,"url":null,"abstract":"<div><p>Discovering new chemical compounds with specific properties can provide advantages for fields that rely on materials for their development, although this task comes at a high cost in terms of complexity and resources. Since the beginning of the data age, deep learning techniques have revolutionized the process of designing molecules by analyzing and learning from representations of molecular data, greatly reducing the resources and time involved. Various deep learning approaches have been developed to date, using a variety of architectures and strategies, in order to explore the extensive and discontinuous chemical space, providing benefits for generating compounds with specific properties. In this study, we present a systematic review that offers a statistical description and comparison of the strategies utilized to generate molecules through deep learning techniques, utilizing the metrics proposed in Molecular Sets (MOSES) or Guacamol. The study included 48 articles retrieved from a query-based search of Scopus and Web of Science and 25 articles retrieved from citation search, yielding a total of 72 retrieved articles, of which 62 correspond to chemical language models approaches to molecule generation and other 10 retrieved articles correspond to molecular graph representations. Transformers, recurrent neural networks (RNNs), generative adversarial networks (GANs), Structured Space State Sequence (S4) models, and variational autoencoders (VAEs) are considered the main deep learning architectures used for molecule generation in the set of retrieved articles. In addition, transfer learning, reinforcement learning, and conditional learning are the most employed techniques for biased model generation and exploration of specific chemical space regions. Finally, this analysis focuses on the central themes of molecular representation, databases, training dataset size, validity-novelty trade-off, and performance of unbiased and biased chemical language models. These themes were selected to conduct a statistical analysis utilizing graphical representation and statistical tests. The resulting analysis reveals the main challenges, advantages, and opportunities in the field of chemical language models over the past four years.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1000,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00916-y","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1186/s13321-024-00916-y","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Discovering new chemical compounds with specific properties can provide advantages for fields that rely on materials for their development, although this task comes at a high cost in terms of complexity and resources. Since the beginning of the data age, deep learning techniques have revolutionized the process of designing molecules by analyzing and learning from representations of molecular data, greatly reducing the resources and time involved. Various deep learning approaches have been developed to date, using a variety of architectures and strategies, in order to explore the extensive and discontinuous chemical space, providing benefits for generating compounds with specific properties. In this study, we present a systematic review that offers a statistical description and comparison of the strategies utilized to generate molecules through deep learning techniques, utilizing the metrics proposed in Molecular Sets (MOSES) or Guacamol. The study included 48 articles retrieved from a query-based search of Scopus and Web of Science and 25 articles retrieved from citation search, yielding a total of 72 retrieved articles, of which 62 correspond to chemical language models approaches to molecule generation and other 10 retrieved articles correspond to molecular graph representations. Transformers, recurrent neural networks (RNNs), generative adversarial networks (GANs), Structured Space State Sequence (S4) models, and variational autoencoders (VAEs) are considered the main deep learning architectures used for molecule generation in the set of retrieved articles. In addition, transfer learning, reinforcement learning, and conditional learning are the most employed techniques for biased model generation and exploration of specific chemical space regions. Finally, this analysis focuses on the central themes of molecular representation, databases, training dataset size, validity-novelty trade-off, and performance of unbiased and biased chemical language models. These themes were selected to conduct a statistical analysis utilizing graphical representation and statistical tests. The resulting analysis reveals the main challenges, advantages, and opportunities in the field of chemical language models over the past four years.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
近代深度学习化学语言模型的系统回顾。
发现具有特定性质的新化合物可以为依赖材料进行开发的领域带来优势,尽管这项任务在复杂性和资源方面需要付出高昂的代价。自数据时代开始以来,深度学习技术通过分析和学习分子数据表示,彻底改变了分子设计过程,大大减少了所需的资源和时间。迄今为止,已经开发出了多种深度学习方法,采用了各种架构和策略,以探索广泛而不连续的化学空间,为生成具有特定性质的化合物提供益处。在本研究中,我们利用分子集(MOSES)或 Guacamol 中提出的指标,对通过深度学习技术生成分子的策略进行了统计描述和比较,并提交了一篇系统性综述。这项研究包括从 Scopus 和 Web of Science 的查询式搜索中检索到的 48 篇文章,以及从引文搜索中检索到的 25 篇文章,共检索到 72 篇文章,其中 62 篇与分子生成的化学语言模型方法相对应,另外 10 篇检索到的文章与分子图表示方法相对应。变换器、递归神经网络(RNN)、生成对抗网络(GAN)、结构空间状态序列(S4)模型和变异自动编码器(VAE)被认为是检索文章中用于分子生成的主要深度学习架构。此外,迁移学习、强化学习和条件学习也是最常用的技术,用于生成有偏差的模型和探索特定的化学空间区域。最后,本分析侧重于分子表征、数据库、训练数据集规模、有效性-新颖性权衡以及无偏和有偏化学语言模型的性能等中心主题。选定这些主题后,利用图形表示法和统计检验法进行了统计分析。分析结果揭示了过去四年中化学语言模型领域的主要挑战、优势和机遇。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Cheminformatics
Journal of Cheminformatics CHEMISTRY, MULTIDISCIPLINARY-COMPUTER SCIENCE, INFORMATION SYSTEMS
CiteScore
14.10
自引率
7.00%
发文量
82
审稿时长
3 months
期刊介绍: Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling. Coverage includes, but is not limited to: chemical information systems, software and databases, and molecular modelling, chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases, computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.
期刊最新文献
A systematic review of deep learning chemical language models in recent era Accelerated hit identification with target evaluation, deep learning and automated labs: prospective validation in IRAK1 QSPRpred: a Flexible Open-Source Quantitative Structure-Property Relationship Modelling Tool Comparative evaluation of methods for the prediction of protein–ligand binding sites Protein-small molecule binding site prediction based on a pre-trained protein language model with contrastive learning
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1