Skylar Connor, Leihong Wu, Ruth A Roberts, Weida Tong
{"title":"Is ChatGPT ready for public use in organ-specific drug toxicity research?","authors":"Skylar Connor, Leihong Wu, Ruth A Roberts, Weida Tong","doi":"10.1016/j.drudis.2025.104297","DOIUrl":null,"url":null,"abstract":"<p><p>The growing impact of large language models (LLMs), such as ChatGPT, prompts questions about the reliability of their application in public health. We compared drug toxicity assessments by GPT-4 for liver, heart, and kidney against expert assessments using US Food and Drug Administration (FDA) drug-labeling documents. Two approaches were assessed: a 'General prompt', mimicking the conversational style used by the general public, and an 'Expert prompt' engineered to represent an approach of an expert. The Expert prompt achieved higher accuracy (64-75%) compared with the General prompt (48-72%), but the overall performance was moderate, indicating that caution is needed when using GPT-4 for public health. To improve reliability, an advanced framework ,such as Retrieval Augmented Generation (RAG), might be required to leverage knowledge embedded in GPT-4.</p>","PeriodicalId":301,"journal":{"name":"Drug Discovery Today","volume":" ","pages":"104297"},"PeriodicalIF":6.5000,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Drug Discovery Today","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.drudis.2025.104297","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PHARMACOLOGY & PHARMACY","Score":null,"Total":0}
引用次数: 0
Abstract
The growing impact of large language models (LLMs), such as ChatGPT, prompts questions about the reliability of their application in public health. We compared drug toxicity assessments by GPT-4 for liver, heart, and kidney against expert assessments using US Food and Drug Administration (FDA) drug-labeling documents. Two approaches were assessed: a 'General prompt', mimicking the conversational style used by the general public, and an 'Expert prompt' engineered to represent an approach of an expert. The Expert prompt achieved higher accuracy (64-75%) compared with the General prompt (48-72%), but the overall performance was moderate, indicating that caution is needed when using GPT-4 for public health. To improve reliability, an advanced framework ,such as Retrieval Augmented Generation (RAG), might be required to leverage knowledge embedded in GPT-4.
期刊介绍:
Drug Discovery Today delivers informed and highly current reviews for the discovery community. The magazine addresses not only the rapid scientific developments in drug discovery associated technologies but also the management, commercial and regulatory issues that increasingly play a part in how R&D is planned, structured and executed.
Features include comment by international experts, news and analysis of important developments, reviews of key scientific and strategic issues, overviews of recent progress in specific therapeutic areas and conference reports.