Shujun Xia, Qing Hua, Zihan Mei, Wenwen Xu, Limei Lai, Minyan Wei, Yu Qin, Lin Luo, Changhua Wang, ShengNan Huo, Lijun Fu, Feidu Zhou, Jiang Wu, Li Zhang, De Lv, Jianxin Li, Xin Wang, Ning Li, Yanyan Song, Jianqiao Zhou
{"title":"大语言模型的临床应用潜力:基于甲状腺结节的研究。","authors":"Shujun Xia, Qing Hua, Zihan Mei, Wenwen Xu, Limei Lai, Minyan Wei, Yu Qin, Lin Luo, Changhua Wang, ShengNan Huo, Lijun Fu, Feidu Zhou, Jiang Wu, Li Zhang, De Lv, Jianxin Li, Xin Wang, Ning Li, Yanyan Song, Jianqiao Zhou","doi":"10.1007/s12020-024-03981-3","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Limited data indicated the performance of large language model (LLM) taking on the role of doctors. We aimed to investigate the potential for ChatGPT-3.5 and New Bing Chat acting as doctors using thyroid nodules as an example.</p><p><strong>Methods: </strong>A total of 145 patients with thyroid nodules were included for generating questions. Each question was entered into chatbot of ChatGPT-3.5 and New Bing Chat five times and five responses were acquired respectively. These responses were compared with answers given by five junior doctors. Responses from five senior doctors were regarded as gold standard. Accuracy and reproducibility of responses from ChatGPT-3.5 and New Bing Chat were evaluated.</p><p><strong>Results: </strong>The accuracy of ChatGPT-3.5 and New Bing Chat in answering Q2, Q3, Q5 were lower than that of junior doctors (all P < 0.05), while both LLMs were comparable to junior doctors when answering Q4 and Q6. In terms of \"high reproducibility and accuracy\", ChatGPT-3.5 outperformed New Bing Chat in Q1 and Q5 (P < 0.001 and P = 0.008, respectively), but showed no significant difference in Q2, Q3, Q4, and Q6 (P > 0.05 for all). New Bing Chat generated higher accuracy than ChatGPT-3.5 (72.41% vs 58.62%) (P = 0.003) in decision making of thyroid nodules, and both were less accurate than junior doctors (89.66%, P < 0.001 for both).</p><p><strong>Conclusions: </strong>The exploration of ChatGPT-3.5 and New Bing Chat in the diagnosis and management of thyroid nodules illustrates that LLMs currently demonstrate the potential for medical applications, but do not yet reach the clinical decision-making capacity of doctors.</p>","PeriodicalId":11572,"journal":{"name":"Endocrine","volume":" ","pages":"206-213"},"PeriodicalIF":3.7000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Clinical application potential of large language model: a study based on thyroid nodules.\",\"authors\":\"Shujun Xia, Qing Hua, Zihan Mei, Wenwen Xu, Limei Lai, Minyan Wei, Yu Qin, Lin Luo, Changhua Wang, ShengNan Huo, Lijun Fu, Feidu Zhou, Jiang Wu, Li Zhang, De Lv, Jianxin Li, Xin Wang, Ning Li, Yanyan Song, Jianqiao Zhou\",\"doi\":\"10.1007/s12020-024-03981-3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Limited data indicated the performance of large language model (LLM) taking on the role of doctors. We aimed to investigate the potential for ChatGPT-3.5 and New Bing Chat acting as doctors using thyroid nodules as an example.</p><p><strong>Methods: </strong>A total of 145 patients with thyroid nodules were included for generating questions. Each question was entered into chatbot of ChatGPT-3.5 and New Bing Chat five times and five responses were acquired respectively. These responses were compared with answers given by five junior doctors. Responses from five senior doctors were regarded as gold standard. Accuracy and reproducibility of responses from ChatGPT-3.5 and New Bing Chat were evaluated.</p><p><strong>Results: </strong>The accuracy of ChatGPT-3.5 and New Bing Chat in answering Q2, Q3, Q5 were lower than that of junior doctors (all P < 0.05), while both LLMs were comparable to junior doctors when answering Q4 and Q6. In terms of \\\"high reproducibility and accuracy\\\", ChatGPT-3.5 outperformed New Bing Chat in Q1 and Q5 (P < 0.001 and P = 0.008, respectively), but showed no significant difference in Q2, Q3, Q4, and Q6 (P > 0.05 for all). New Bing Chat generated higher accuracy than ChatGPT-3.5 (72.41% vs 58.62%) (P = 0.003) in decision making of thyroid nodules, and both were less accurate than junior doctors (89.66%, P < 0.001 for both).</p><p><strong>Conclusions: </strong>The exploration of ChatGPT-3.5 and New Bing Chat in the diagnosis and management of thyroid nodules illustrates that LLMs currently demonstrate the potential for medical applications, but do not yet reach the clinical decision-making capacity of doctors.</p>\",\"PeriodicalId\":11572,\"journal\":{\"name\":\"Endocrine\",\"volume\":\" \",\"pages\":\"206-213\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Endocrine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s12020-024-03981-3\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/7/30 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Endocrine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s12020-024-03981-3","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/30 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0
摘要
背景:有限的数据显示了大型语言模型(LLM)扮演医生角色的性能。我们旨在以甲状腺结节为例,研究 ChatGPT-3.5 和 New Bing Chat 扮演医生角色的潜力:方法:共有 145 名甲状腺结节患者参与了问题生成。每个问题在 ChatGPT-3.5 和 New Bing Chat 的聊天机器人中输入五次,并分别获得五次回复。这些回答与五位初级医生的回答进行了比较。五位资深医生的回答被视为黄金标准。对来自 ChatGPT-3.5 和 New Bing Chat 的回答的准确性和可重复性进行了评估:结果:ChatGPT-3.5 和 New Bing Chat 在回答 Q2、Q3 和 Q5 时的准确性低于初级医生(均为 P 0.05)。在甲状腺结节的决策方面,新版必应聊天工具的准确率高于 ChatGPT-3.5(72.41% vs 58.62%)(P = 0.003),而两者的准确率均低于初级医生(89.66%,P 结论:ChatGPT-3.5 和新版必应聊天工具在甲状腺结节的决策方面均有较高的准确率,但两者的准确率均低于初级医生(P = 0.05):ChatGPT-3.5 和 New Bing Chat 在甲状腺结节诊断和管理方面的探索表明,LLM 目前显示出医疗应用的潜力,但尚未达到医生的临床决策能力。
Clinical application potential of large language model: a study based on thyroid nodules.
Background: Limited data indicated the performance of large language model (LLM) taking on the role of doctors. We aimed to investigate the potential for ChatGPT-3.5 and New Bing Chat acting as doctors using thyroid nodules as an example.
Methods: A total of 145 patients with thyroid nodules were included for generating questions. Each question was entered into chatbot of ChatGPT-3.5 and New Bing Chat five times and five responses were acquired respectively. These responses were compared with answers given by five junior doctors. Responses from five senior doctors were regarded as gold standard. Accuracy and reproducibility of responses from ChatGPT-3.5 and New Bing Chat were evaluated.
Results: The accuracy of ChatGPT-3.5 and New Bing Chat in answering Q2, Q3, Q5 were lower than that of junior doctors (all P < 0.05), while both LLMs were comparable to junior doctors when answering Q4 and Q6. In terms of "high reproducibility and accuracy", ChatGPT-3.5 outperformed New Bing Chat in Q1 and Q5 (P < 0.001 and P = 0.008, respectively), but showed no significant difference in Q2, Q3, Q4, and Q6 (P > 0.05 for all). New Bing Chat generated higher accuracy than ChatGPT-3.5 (72.41% vs 58.62%) (P = 0.003) in decision making of thyroid nodules, and both were less accurate than junior doctors (89.66%, P < 0.001 for both).
Conclusions: The exploration of ChatGPT-3.5 and New Bing Chat in the diagnosis and management of thyroid nodules illustrates that LLMs currently demonstrate the potential for medical applications, but do not yet reach the clinical decision-making capacity of doctors.
期刊介绍:
Well-established as a major journal in today’s rapidly advancing experimental and clinical research areas, Endocrine publishes original articles devoted to basic (including molecular, cellular and physiological studies), translational and clinical research in all the different fields of endocrinology and metabolism. Articles will be accepted based on peer-reviews, priority, and editorial decision. Invited reviews, mini-reviews and viewpoints on relevant pathophysiological and clinical topics, as well as Editorials on articles appearing in the Journal, are published. Unsolicited Editorials will be evaluated by the editorial team. Outcomes of scientific meetings, as well as guidelines and position statements, may be submitted. The Journal also considers special feature articles in the field of endocrine genetics and epigenetics, as well as articles devoted to novel methods and techniques in endocrinology.
Endocrine covers controversial, clinical endocrine issues. Meta-analyses on endocrine and metabolic topics are also accepted. Descriptions of single clinical cases and/or small patients studies are not published unless of exceptional interest. However, reports of novel imaging studies and endocrine side effects in single patients may be considered. Research letters and letters to the editor related or unrelated to recently published articles can be submitted.
Endocrine covers leading topics in endocrinology such as neuroendocrinology, pituitary and hypothalamic peptides, thyroid physiological and clinical aspects, bone and mineral metabolism and osteoporosis, obesity, lipid and energy metabolism and food intake control, insulin, Type 1 and Type 2 diabetes, hormones of male and female reproduction, adrenal diseases pediatric and geriatric endocrinology, endocrine hypertension and endocrine oncology.