Kendall A Flaharty, Ping Hu, Suzanna Ledgister Hanchard, Molly E Ripper, Dat Duong, Rebekah L Waikel, Benjamin D Solomon
{"title":"在医学、非专业语言和自我报告的遗传条件描述上评估大型语言模型。","authors":"Kendall A Flaharty, Ping Hu, Suzanna Ledgister Hanchard, Molly E Ripper, Dat Duong, Rebekah L Waikel, Benjamin D Solomon","doi":"10.1016/j.ajhg.2024.07.011","DOIUrl":null,"url":null,"abstract":"<p><p>Large language models (LLMs) are generating interest in medical settings. For example, LLMs can respond coherently to medical queries by providing plausible differential diagnoses based on clinical notes. However, there are many questions to explore, such as evaluating differences between open- and closed-source LLMs as well as LLM performance on queries from both medical and non-medical users. In this study, we assessed multiple LLMs, including Llama-2-chat, Vicuna, Medllama2, Bard/Gemini, Claude, ChatGPT3.5, and ChatGPT-4, as well as non-LLM approaches (Google search and Phenomizer) regarding their ability to identify genetic conditions from textbook-like clinician questions and their corresponding layperson translations related to 63 genetic conditions. For open-source LLMs, larger models were more accurate than smaller LLMs: 7b, 13b, and larger than 33b parameter models obtained accuracy ranges from 21%-49%, 41%-51%, and 54%-68%, respectively. Closed-source LLMs outperformed open-source LLMs, with ChatGPT-4 performing best (89%-90%). Three of 11 LLMs and Google search had significant performance gaps between clinician and layperson prompts. We also evaluated how in-context prompting and keyword removal affected open-source LLM performance. Models were provided with 2 types of in-context prompts: list-type prompts, which improved LLM performance, and definition-type prompts, which did not. We further analyzed removal of rare terms from descriptions, which decreased accuracy for 5 of 7 evaluated LLMs. Finally, we observed much lower performance with real individuals' descriptions; LLMs answered these questions with a maximum 21% accuracy.</p>","PeriodicalId":7659,"journal":{"name":"American journal of human genetics","volume":null,"pages":null},"PeriodicalIF":8.1000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11393706/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluating large language models on medical, lay-language, and self-reported descriptions of genetic conditions.\",\"authors\":\"Kendall A Flaharty, Ping Hu, Suzanna Ledgister Hanchard, Molly E Ripper, Dat Duong, Rebekah L Waikel, Benjamin D Solomon\",\"doi\":\"10.1016/j.ajhg.2024.07.011\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Large language models (LLMs) are generating interest in medical settings. For example, LLMs can respond coherently to medical queries by providing plausible differential diagnoses based on clinical notes. However, there are many questions to explore, such as evaluating differences between open- and closed-source LLMs as well as LLM performance on queries from both medical and non-medical users. In this study, we assessed multiple LLMs, including Llama-2-chat, Vicuna, Medllama2, Bard/Gemini, Claude, ChatGPT3.5, and ChatGPT-4, as well as non-LLM approaches (Google search and Phenomizer) regarding their ability to identify genetic conditions from textbook-like clinician questions and their corresponding layperson translations related to 63 genetic conditions. For open-source LLMs, larger models were more accurate than smaller LLMs: 7b, 13b, and larger than 33b parameter models obtained accuracy ranges from 21%-49%, 41%-51%, and 54%-68%, respectively. Closed-source LLMs outperformed open-source LLMs, with ChatGPT-4 performing best (89%-90%). Three of 11 LLMs and Google search had significant performance gaps between clinician and layperson prompts. We also evaluated how in-context prompting and keyword removal affected open-source LLM performance. Models were provided with 2 types of in-context prompts: list-type prompts, which improved LLM performance, and definition-type prompts, which did not. We further analyzed removal of rare terms from descriptions, which decreased accuracy for 5 of 7 evaluated LLMs. Finally, we observed much lower performance with real individuals' descriptions; LLMs answered these questions with a maximum 21% accuracy.</p>\",\"PeriodicalId\":7659,\"journal\":{\"name\":\"American journal of human genetics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":8.1000,\"publicationDate\":\"2024-09-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11393706/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"American journal of human genetics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1016/j.ajhg.2024.07.011\",\"RegionNum\":1,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/8/14 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"GENETICS & HEREDITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"American journal of human genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.ajhg.2024.07.011","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/14 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
Evaluating large language models on medical, lay-language, and self-reported descriptions of genetic conditions.
Large language models (LLMs) are generating interest in medical settings. For example, LLMs can respond coherently to medical queries by providing plausible differential diagnoses based on clinical notes. However, there are many questions to explore, such as evaluating differences between open- and closed-source LLMs as well as LLM performance on queries from both medical and non-medical users. In this study, we assessed multiple LLMs, including Llama-2-chat, Vicuna, Medllama2, Bard/Gemini, Claude, ChatGPT3.5, and ChatGPT-4, as well as non-LLM approaches (Google search and Phenomizer) regarding their ability to identify genetic conditions from textbook-like clinician questions and their corresponding layperson translations related to 63 genetic conditions. For open-source LLMs, larger models were more accurate than smaller LLMs: 7b, 13b, and larger than 33b parameter models obtained accuracy ranges from 21%-49%, 41%-51%, and 54%-68%, respectively. Closed-source LLMs outperformed open-source LLMs, with ChatGPT-4 performing best (89%-90%). Three of 11 LLMs and Google search had significant performance gaps between clinician and layperson prompts. We also evaluated how in-context prompting and keyword removal affected open-source LLM performance. Models were provided with 2 types of in-context prompts: list-type prompts, which improved LLM performance, and definition-type prompts, which did not. We further analyzed removal of rare terms from descriptions, which decreased accuracy for 5 of 7 evaluated LLMs. Finally, we observed much lower performance with real individuals' descriptions; LLMs answered these questions with a maximum 21% accuracy.
期刊介绍:
The American Journal of Human Genetics (AJHG) is a monthly journal published by Cell Press, chosen by The American Society of Human Genetics (ASHG) as its premier publication starting from January 2008. AJHG represents Cell Press's first society-owned journal, and both ASHG and Cell Press anticipate significant synergies between AJHG content and that of other Cell Press titles.