Improving drug repositioning with negative data labeling using large language models

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Journal of Cheminformatics Pub Date : 2025-02-04 DOI:10.1186/s13321-025-00962-0

Milan Picard, Mickael Leclercq, Antoine Bodein, Marie Pier Scott-Boyer, Olivier Perin, Arnaud Droit

{"title":"Improving drug repositioning with negative data labeling using large language models","authors":"Milan Picard, Mickael Leclercq, Antoine Bodein, Marie Pier Scott-Boyer, Olivier Perin, Arnaud Droit","doi":"10.1186/s13321-025-00962-0","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><p>Drug repositioning offers numerous advantages, such as faster development timelines, reduced costs, and lower failure rates in drug development. Supervised machine learning is commonly used to score drug candidates but is hindered by the lack of reliable negative data—drugs that fail due to inefficacy or toxicity— which is difficult to access, lowering their prediction accuracy and generalization. Positive-Unlabeled (PU) learning has been used to overcome this issue by either randomly sampling unlabeled drugs or identifying probable negatives but still suffers from misclassification or oversimplified decision boundaries.</p><h3>Results</h3><p>We proposed a novel strategy using Large Language Models (GPT-4) to analyze all clinical trials on prostate cancer and systematically identify true negatives. This approach showed remarkable improvement in predictive accuracy on independent test sets with a Matthews Correlation Coefficient of 0.76 (± 0.33) compared to 0.55 (± 0.15) and 0.48 (± 0.18) for two commonly used PU learning approaches. Using our labeling strategy, we created a training set of 26 positive and 54 experimentally validated negative drugs. We then applied a machine learning ensemble to this new dataset to assess the repurposing potential of the remaining 11,043 drugs in the DrugBank database. This analysis identified 980 potential candidates for prostate cancer. A detailed review of the top 30 revealed 9 promising drugs targeting various mechanisms such as genomic instability, p53 regulation, or TMPRSS2-ERG fusion.</p><h3>Conclusion</h3><p>By expanding our negative data labeling approach to all diseases within the ClinicalTrials.gov database, our method could greatly advance supervised drug repositioning, offering a more accurate and data-driven path for discovering new treatments.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7000,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00962-0","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1186/s13321-025-00962-0","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction

Drug repositioning offers numerous advantages, such as faster development timelines, reduced costs, and lower failure rates in drug development. Supervised machine learning is commonly used to score drug candidates but is hindered by the lack of reliable negative data—drugs that fail due to inefficacy or toxicity— which is difficult to access, lowering their prediction accuracy and generalization. Positive-Unlabeled (PU) learning has been used to overcome this issue by either randomly sampling unlabeled drugs or identifying probable negatives but still suffers from misclassification or oversimplified decision boundaries.

Results

We proposed a novel strategy using Large Language Models (GPT-4) to analyze all clinical trials on prostate cancer and systematically identify true negatives. This approach showed remarkable improvement in predictive accuracy on independent test sets with a Matthews Correlation Coefficient of 0.76 (± 0.33) compared to 0.55 (± 0.15) and 0.48 (± 0.18) for two commonly used PU learning approaches. Using our labeling strategy, we created a training set of 26 positive and 54 experimentally validated negative drugs. We then applied a machine learning ensemble to this new dataset to assess the repurposing potential of the remaining 11,043 drugs in the DrugBank database. This analysis identified 980 potential candidates for prostate cancer. A detailed review of the top 30 revealed 9 promising drugs targeting various mechanisms such as genomic instability, p53 regulation, or TMPRSS2-ERG fusion.

Conclusion

By expanding our negative data labeling approach to all diseases within the ClinicalTrials.gov database, our method could greatly advance supervised drug repositioning, offering a more accurate and data-driven path for discovering new treatments.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用大型语言模型改进负数据标记的药物重新定位

药物重新定位提供了许多优势，例如更快的开发时间、更低的成本和更低的药物开发失败率。监督式机器学习通常用于对候选药物进行评分，但由于缺乏可靠的负面数据（药物由于无效或毒性而失败）而受到阻碍，这很难获得，从而降低了它们的预测准确性和泛化性。正-未标记（PU）学习已经被用来克服这个问题，通过随机抽样未标记的药物或识别可能的阴性，但仍然存在错误分类或过于简化的决策边界。我们提出了一种新的策略，使用大语言模型（GPT-4）来分析前列腺癌的所有临床试验，并系统地识别真阴性。该方法在独立测试集上的预测准确率有显著提高，马修斯相关系数为0.76（±0.33），而两种常用的PU学习方法的马修斯相关系数分别为0.55（±0.15）和0.48（±0.18）。使用我们的标签策略，我们创建了一个由26个阳性药物和54个实验验证的阴性药物组成的训练集。然后，我们将机器学习集成应用于这个新数据集，以评估DrugBank数据库中剩余11043种药物的再利用潜力。该分析确定了980种前列腺癌的潜在候选。对前30名的详细回顾揭示了9种有前景的药物针对各种机制，如基因组不稳定性，p53调节或TMPRSS2-ERG融合。通过将我们的负面数据标签方法扩展到ClinicalTrials.gov数据库中的所有疾病，我们的方法可以极大地推进监督药物重新定位，为发现新的治疗方法提供更准确和数据驱动的途径。在药物重新定位的背景下，与常用的PU学习方法相比，在真实负数据上训练的机器学习算法始终更准确、更一般化。大型语言模型可以通过从在线数据库中提取相关的生物和医学数据来增强机器学习数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Cheminformatics CHEMISTRY, MULTIDISCIPLINARY-COMPUTER SCIENCE, INFORMATION SYSTEMS

CiteScore

14.10

自引率

7.00%

发文量

审稿时长

3 months

期刊介绍： Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling. Coverage includes, but is not limited to: chemical information systems, software and databases, and molecular modelling, chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases, computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.