GCLmf: A Novel Molecular Graph Contrastive Learning Framework Based on Hard Negatives and Application in Toxicity Prediction.

IF 2.8 4区医学 Q3 CHEMISTRY, MEDICINAL Molecular Informatics Pub Date : 2024-10-18 DOI:10.1002/minf.202400169

Xinxin Yu, Yuanting Chen, Long Chen, Weihua Li, Yuhao Wang, Yun Tang, Guixia Liu

{"title":"GCLmf: A Novel Molecular Graph Contrastive Learning Framework Based on Hard Negatives and Application in Toxicity Prediction.","authors":"Xinxin Yu, Yuanting Chen, Long Chen, Weihua Li, Yuhao Wang, Yun Tang, Guixia Liu","doi":"10.1002/minf.202400169","DOIUrl":null,"url":null,"abstract":"<p><p>In silico methods for prediction of chemical toxicity can decrease the cost and increase the efficiency in the early stage of drug discovery. However, due to low accessibility of sufficient and reliable toxicity data, constructing robust and accurate prediction models is challenging. Contrastive learning, a type of self-supervised learning, leverages large unlabeled data to obtain more expressive molecular representations, which can boost the prediction performance on downstream tasks. While molecular graph contrastive learning has gathered growing attentions, current models neglect the quality of negative data set. Here, we proposed a self-supervised pretraining deep learning framework named GCLmf. We first utilized molecular fragments that meet specific conditions as hard negative samples to boost the quality of the negative set and thus increase the difficulty of the proxy tasks during pre-training to learn informative representations. GCLmf has shown excellent predictive power on various molecular property benchmarks and demonstrates high performance in 33 toxicity tasks in comparison with multiple baselines. In addition, we further investigated the necessity of introducing hard negatives in model building and the impact of the proportion of hard negatives on the model.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":null,"pages":null},"PeriodicalIF":2.8000,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/minf.202400169","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}

引用次数: 0

Abstract

In silico methods for prediction of chemical toxicity can decrease the cost and increase the efficiency in the early stage of drug discovery. However, due to low accessibility of sufficient and reliable toxicity data, constructing robust and accurate prediction models is challenging. Contrastive learning, a type of self-supervised learning, leverages large unlabeled data to obtain more expressive molecular representations, which can boost the prediction performance on downstream tasks. While molecular graph contrastive learning has gathered growing attentions, current models neglect the quality of negative data set. Here, we proposed a self-supervised pretraining deep learning framework named GCLmf. We first utilized molecular fragments that meet specific conditions as hard negative samples to boost the quality of the negative set and thus increase the difficulty of the proxy tasks during pre-training to learn informative representations. GCLmf has shown excellent predictive power on various molecular property benchmarks and demonstrates high performance in 33 toxicity tasks in comparison with multiple baselines. In addition, we further investigated the necessity of introducing hard negatives in model building and the impact of the proportion of hard negatives on the model.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

GCLmf：基于硬阴性的新型分子图对比学习框架及其在毒性预测中的应用

在药物发现的早期阶段，预测化学毒性的硅学方法可以降低成本，提高效率。然而，由于难以获得充足可靠的毒性数据，构建稳健准确的预测模型具有挑战性。对比学习是一种自监督学习，它利用大量未标记数据来获得更具表现力的分子表征，从而提高下游任务的预测性能。虽然分子图对比学习受到越来越多的关注，但目前的模型忽视了负数据集的质量。在此，我们提出了一种名为 GCLmf 的自监督预训练深度学习框架。我们首先利用符合特定条件的分子片段作为硬负样本，以提高负集的质量，从而在预训练过程中增加代理任务的难度，以学习信息表征。GCLmf 在各种分子特性基准上都表现出了卓越的预测能力，与多个基线相比，它在 33 个毒性任务中表现出了很高的性能。此外，我们还进一步研究了在建立模型时引入硬阴性的必要性以及硬阴性比例对模型的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Molecular Informatics CHEMISTRY, MEDICINAL-MATHEMATICAL & COMPUTATIONAL BIOLOGY

CiteScore

7.30

自引率

2.80%

发文量

审稿时长

3 months

期刊介绍： Molecular Informatics is a peer-reviewed, international forum for publication of high-quality, interdisciplinary research on all molecular aspects of bio/cheminformatics and computer-assisted molecular design. Molecular Informatics succeeded QSAR & Combinatorial Science in 2010. Molecular Informatics presents methodological innovations that will lead to a deeper understanding of ligand-receptor interactions, macromolecular complexes, molecular networks, design concepts and processes that demonstrate how ideas and design concepts lead to molecules with a desired structure or function, preferably including experimental validation. The journal''s scope includes but is not limited to the fields of drug discovery and chemical biology, protein and nucleic acid engineering and design, the design of nanomolecular structures, strategies for modeling of macromolecular assemblies, molecular networks and systems, pharmaco- and chemogenomics, computer-assisted screening strategies, as well as novel technologies for the de novo design of biologically active molecules. As a unique feature Molecular Informatics publishes so-called "Methods Corner" review-type articles which feature important technological concepts and advances within the scope of the journal.