Mol-BERT: An Effective Molecular Representation with BERT for Molecular Property Prediction

Wirel. Commun. Mob. Comput. Pub Date : 2021-09-02 DOI:10.1155/2021/7181815

Juncai Li, Xiaofei Jiang

{"title":"Mol-BERT: An Effective Molecular Representation with BERT for Molecular Property Prediction","authors":"Juncai Li, Xiaofei Jiang","doi":"10.1155/2021/7181815","DOIUrl":null,"url":null,"abstract":"Molecular property prediction is an essential task in drug discovery. Most computational approaches with deep learning techniques either focus on designing novel molecular representation or combining with some advanced models together. However, researchers pay fewer attention to the potential benefits in massive unlabeled molecular data (e.g., ZINC). This task becomes increasingly challenging owing to the limitation of the scale of labeled data. Motivated by the recent advancements of pretrained models in natural language processing, the drug molecule can be naturally viewed as language to some extent. In this paper, we investigate how to develop the pretrained model BERT to extract useful molecular substructure information for molecular property prediction. We present a novel end-to-end deep learning framework, named Mol-BERT, that combines an effective molecular representation with pretrained BERT model tailored for molecular property prediction. Specifically, a large-scale prediction BERT model is pretrained to generate the embedding of molecular substructures, by using four million unlabeled drug SMILES (i.e., ZINC 15 and ChEMBL 27). Then, the pretrained BERT model can be fine-tuned on various molecular property prediction tasks. To examine the performance of our proposed Mol-BERT, we conduct several experiments on 4 widely used molecular datasets. In comparison to the traditional and state-of-the-art baselines, the results illustrate that our proposed Mol-BERT can outperform the current sequence-based methods and achieve at least 2% improvement on ROC-AUC score on Tox21, SIDER, and ClinTox dataset.","PeriodicalId":23995,"journal":{"name":"Wirel. Commun. Mob. Comput.","volume":"26 1","pages":"7181815:1-7181815:7"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Wirel. Commun. Mob. Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1155/2021/7181815","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

Abstract

Molecular property prediction is an essential task in drug discovery. Most computational approaches with deep learning techniques either focus on designing novel molecular representation or combining with some advanced models together. However, researchers pay fewer attention to the potential benefits in massive unlabeled molecular data (e.g., ZINC). This task becomes increasingly challenging owing to the limitation of the scale of labeled data. Motivated by the recent advancements of pretrained models in natural language processing, the drug molecule can be naturally viewed as language to some extent. In this paper, we investigate how to develop the pretrained model BERT to extract useful molecular substructure information for molecular property prediction. We present a novel end-to-end deep learning framework, named Mol-BERT, that combines an effective molecular representation with pretrained BERT model tailored for molecular property prediction. Specifically, a large-scale prediction BERT model is pretrained to generate the embedding of molecular substructures, by using four million unlabeled drug SMILES (i.e., ZINC 15 and ChEMBL 27). Then, the pretrained BERT model can be fine-tuned on various molecular property prediction tasks. To examine the performance of our proposed Mol-BERT, we conduct several experiments on 4 widely used molecular datasets. In comparison to the traditional and state-of-the-art baselines, the results illustrate that our proposed Mol-BERT can outperform the current sequence-based methods and achieve at least 2% improvement on ROC-AUC score on Tox21, SIDER, and ClinTox dataset.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

moll -BERT:一种用于分子性质预测的有效分子表征

分子性质预测是药物发现中的一项重要工作。大多数使用深度学习技术的计算方法要么专注于设计新的分子表示，要么将一些先进的模型结合在一起。然而，研究人员很少关注大量未标记分子数据(例如锌)的潜在益处。由于标注数据规模的限制，这一任务变得越来越具有挑战性。由于自然语言处理中预训练模型的最新进展，药物分子在某种程度上可以被自然地视为语言。在本文中，我们研究了如何开发预训练模型BERT来提取有用的分子子结构信息用于分子性质预测。我们提出了一种新的端到端深度学习框架，名为Mol-BERT，它将有效的分子表示与为分子性质预测量身定制的预训练BERT模型相结合。具体来说，通过使用400万种未标记的药物smile(即ZINC 15和ChEMBL 27)，预训练大规模预测BERT模型来生成分子亚结构的嵌入。然后，预训练的BERT模型可以在各种分子性质预测任务上进行微调。为了检验我们提出的molbert的性能，我们在4个广泛使用的分子数据集上进行了几个实验。与传统和最先进的基线相比，结果表明，我们提出的molbert可以优于当前基于序列的方法，并且在Tox21, SIDER和ClinTox数据集上的ROC-AUC评分至少提高2%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Wirel. Commun. Mob. Comput.

自引率

0.00%

发文量