{"title":"TESTING THE ACCURACY OF MODERN LLMS IN ANSWERING GENERAL MEDICAL PROMPTS","authors":"Sahil Narula, Sanaa Karkera, Rushil Challa, Sarina Virmani, Nithya Chilukuri, Mason Elkas, Nidhi Thammineni, Ankita Kamath, Parth Jaiswal, Abhishek Krishnan","doi":"10.46609/ijsser.2023.v08i09.021","DOIUrl":null,"url":null,"abstract":"The rising use of large language models (LLMs) for answering medical questions necessitates an evaluation of their accuracy, especially given the implications for public health. This study employed a comprehensive test suite of 500 medical prompts, evaluated by a panel of medical experts for factual accuracy, contextual relevance, and potential risk. The responses from state-of-the-art LLMs were also compared with answers from a control group of medical students. Results indicated a high level of accuracy among LLMs, with a median score of 88%. While LLMs performed well on general wellness questions (92% accuracy), they were less reliable for specialized medical queries (80% accuracy). The control group of medical students outperformed LLMs in answering specialized medical questions. In conclusion, while LLMs demonstrate a high degree of factual accuracy for general medical information, they are less reliable for specialized or complex health-related queries. Given their widespread use, LLMs could be a preliminary source for general medical advice, but their limitations underscore the need for consulting experts for specialized medical conditions. Future work should focus on enhancing the models' capabilities in specialized domains and evaluating the ethical implications of using LLMs for medical information dissemination. This study serves as a baseline for the responsible use of AI in healthcare.","PeriodicalId":500023,"journal":{"name":"International journal of social science and economic research","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of social science and economic research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.46609/ijsser.2023.v08i09.021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The rising use of large language models (LLMs) for answering medical questions necessitates an evaluation of their accuracy, especially given the implications for public health. This study employed a comprehensive test suite of 500 medical prompts, evaluated by a panel of medical experts for factual accuracy, contextual relevance, and potential risk. The responses from state-of-the-art LLMs were also compared with answers from a control group of medical students. Results indicated a high level of accuracy among LLMs, with a median score of 88%. While LLMs performed well on general wellness questions (92% accuracy), they were less reliable for specialized medical queries (80% accuracy). The control group of medical students outperformed LLMs in answering specialized medical questions. In conclusion, while LLMs demonstrate a high degree of factual accuracy for general medical information, they are less reliable for specialized or complex health-related queries. Given their widespread use, LLMs could be a preliminary source for general medical advice, but their limitations underscore the need for consulting experts for specialized medical conditions. Future work should focus on enhancing the models' capabilities in specialized domains and evaluating the ethical implications of using LLMs for medical information dissemination. This study serves as a baseline for the responsible use of AI in healthcare.