Mol-BERT: An Effective Molecular Representation with BERT for Molecular Property Prediction

Juncai Li, Xiaofei Jiang
{"title":"Mol-BERT: An Effective Molecular Representation with BERT for Molecular Property Prediction","authors":"Juncai Li, Xiaofei Jiang","doi":"10.1155/2021/7181815","DOIUrl":null,"url":null,"abstract":"Molecular property prediction is an essential task in drug discovery. Most computational approaches with deep learning techniques either focus on designing novel molecular representation or combining with some advanced models together. However, researchers pay fewer attention to the potential benefits in massive unlabeled molecular data (e.g., ZINC). This task becomes increasingly challenging owing to the limitation of the scale of labeled data. Motivated by the recent advancements of pretrained models in natural language processing, the drug molecule can be naturally viewed as language to some extent. In this paper, we investigate how to develop the pretrained model BERT to extract useful molecular substructure information for molecular property prediction. We present a novel end-to-end deep learning framework, named Mol-BERT, that combines an effective molecular representation with pretrained BERT model tailored for molecular property prediction. Specifically, a large-scale prediction BERT model is pretrained to generate the embedding of molecular substructures, by using four million unlabeled drug SMILES (i.e., ZINC 15 and ChEMBL 27). Then, the pretrained BERT model can be fine-tuned on various molecular property prediction tasks. To examine the performance of our proposed Mol-BERT, we conduct several experiments on 4 widely used molecular datasets. In comparison to the traditional and state-of-the-art baselines, the results illustrate that our proposed Mol-BERT can outperform the current sequence-based methods and achieve at least 2% improvement on ROC-AUC score on Tox21, SIDER, and ClinTox dataset.","PeriodicalId":23995,"journal":{"name":"Wirel. Commun. Mob. Comput.","volume":"26 1","pages":"7181815:1-7181815:7"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Wirel. Commun. Mob. Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1155/2021/7181815","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16

Abstract

Molecular property prediction is an essential task in drug discovery. Most computational approaches with deep learning techniques either focus on designing novel molecular representation or combining with some advanced models together. However, researchers pay fewer attention to the potential benefits in massive unlabeled molecular data (e.g., ZINC). This task becomes increasingly challenging owing to the limitation of the scale of labeled data. Motivated by the recent advancements of pretrained models in natural language processing, the drug molecule can be naturally viewed as language to some extent. In this paper, we investigate how to develop the pretrained model BERT to extract useful molecular substructure information for molecular property prediction. We present a novel end-to-end deep learning framework, named Mol-BERT, that combines an effective molecular representation with pretrained BERT model tailored for molecular property prediction. Specifically, a large-scale prediction BERT model is pretrained to generate the embedding of molecular substructures, by using four million unlabeled drug SMILES (i.e., ZINC 15 and ChEMBL 27). Then, the pretrained BERT model can be fine-tuned on various molecular property prediction tasks. To examine the performance of our proposed Mol-BERT, we conduct several experiments on 4 widely used molecular datasets. In comparison to the traditional and state-of-the-art baselines, the results illustrate that our proposed Mol-BERT can outperform the current sequence-based methods and achieve at least 2% improvement on ROC-AUC score on Tox21, SIDER, and ClinTox dataset.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
moll -BERT:一种用于分子性质预测的有效分子表征
分子性质预测是药物发现中的一项重要工作。大多数使用深度学习技术的计算方法要么专注于设计新的分子表示,要么将一些先进的模型结合在一起。然而,研究人员很少关注大量未标记分子数据(例如锌)的潜在益处。由于标注数据规模的限制,这一任务变得越来越具有挑战性。由于自然语言处理中预训练模型的最新进展,药物分子在某种程度上可以被自然地视为语言。在本文中,我们研究了如何开发预训练模型BERT来提取有用的分子子结构信息用于分子性质预测。我们提出了一种新的端到端深度学习框架,名为Mol-BERT,它将有效的分子表示与为分子性质预测量身定制的预训练BERT模型相结合。具体来说,通过使用400万种未标记的药物smile(即ZINC 15和ChEMBL 27),预训练大规模预测BERT模型来生成分子亚结构的嵌入。然后,预训练的BERT模型可以在各种分子性质预测任务上进行微调。为了检验我们提出的molbert的性能,我们在4个广泛使用的分子数据集上进行了几个实验。与传统和最先进的基线相比,结果表明,我们提出的molbert可以优于当前基于序列的方法,并且在Tox21, SIDER和ClinTox数据集上的ROC-AUC评分至少提高2%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
AI-Empowered Propagation Prediction and Optimization for Reconfigurable Wireless Networks C SVM Classification and KNN Techniques for Cyber Crime Detection A Secure and Efficient Energy Trading Model Using Blockchain for a 5G-Deployed Smart Community Fusion Deep Learning and Machine Learning for Heterogeneous Military Entity Recognition Influence of Embedded Microprocessor Wireless Communication and Computer Vision in Wushu Competition Referees' Decision Support
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1