Shujun Xia, Qing Hua, Zihan Mei, Wenwen Xu, Limei Lai, Minyan Wei, Yu Qin, Lin Luo, Changhua Wang, ShengNan Huo, Lijun Fu, Feidu Zhou, Jiang Wu, Li Zhang, De Lv, Jianxin Li, Xin Wang, Ning Li, Yanyan Song, Jianqiao Zhou
{"title":"Clinical application potential of large language model: a study based on thyroid nodules.","authors":"Shujun Xia, Qing Hua, Zihan Mei, Wenwen Xu, Limei Lai, Minyan Wei, Yu Qin, Lin Luo, Changhua Wang, ShengNan Huo, Lijun Fu, Feidu Zhou, Jiang Wu, Li Zhang, De Lv, Jianxin Li, Xin Wang, Ning Li, Yanyan Song, Jianqiao Zhou","doi":"10.1007/s12020-024-03981-3","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Limited data indicated the performance of large language model (LLM) taking on the role of doctors. We aimed to investigate the potential for ChatGPT-3.5 and New Bing Chat acting as doctors using thyroid nodules as an example.</p><p><strong>Methods: </strong>A total of 145 patients with thyroid nodules were included for generating questions. Each question was entered into chatbot of ChatGPT-3.5 and New Bing Chat five times and five responses were acquired respectively. These responses were compared with answers given by five junior doctors. Responses from five senior doctors were regarded as gold standard. Accuracy and reproducibility of responses from ChatGPT-3.5 and New Bing Chat were evaluated.</p><p><strong>Results: </strong>The accuracy of ChatGPT-3.5 and New Bing Chat in answering Q2, Q3, Q5 were lower than that of junior doctors (all P < 0.05), while both LLMs were comparable to junior doctors when answering Q4 and Q6. In terms of \"high reproducibility and accuracy\", ChatGPT-3.5 outperformed New Bing Chat in Q1 and Q5 (P < 0.001 and P = 0.008, respectively), but showed no significant difference in Q2, Q3, Q4, and Q6 (P > 0.05 for all). New Bing Chat generated higher accuracy than ChatGPT-3.5 (72.41% vs 58.62%) (P = 0.003) in decision making of thyroid nodules, and both were less accurate than junior doctors (89.66%, P < 0.001 for both).</p><p><strong>Conclusions: </strong>The exploration of ChatGPT-3.5 and New Bing Chat in the diagnosis and management of thyroid nodules illustrates that LLMs currently demonstrate the potential for medical applications, but do not yet reach the clinical decision-making capacity of doctors.</p>","PeriodicalId":11572,"journal":{"name":"Endocrine","volume":" ","pages":"206-213"},"PeriodicalIF":3.7000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Endocrine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s12020-024-03981-3","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/30 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Limited data indicated the performance of large language model (LLM) taking on the role of doctors. We aimed to investigate the potential for ChatGPT-3.5 and New Bing Chat acting as doctors using thyroid nodules as an example.
Methods: A total of 145 patients with thyroid nodules were included for generating questions. Each question was entered into chatbot of ChatGPT-3.5 and New Bing Chat five times and five responses were acquired respectively. These responses were compared with answers given by five junior doctors. Responses from five senior doctors were regarded as gold standard. Accuracy and reproducibility of responses from ChatGPT-3.5 and New Bing Chat were evaluated.
Results: The accuracy of ChatGPT-3.5 and New Bing Chat in answering Q2, Q3, Q5 were lower than that of junior doctors (all P < 0.05), while both LLMs were comparable to junior doctors when answering Q4 and Q6. In terms of "high reproducibility and accuracy", ChatGPT-3.5 outperformed New Bing Chat in Q1 and Q5 (P < 0.001 and P = 0.008, respectively), but showed no significant difference in Q2, Q3, Q4, and Q6 (P > 0.05 for all). New Bing Chat generated higher accuracy than ChatGPT-3.5 (72.41% vs 58.62%) (P = 0.003) in decision making of thyroid nodules, and both were less accurate than junior doctors (89.66%, P < 0.001 for both).
Conclusions: The exploration of ChatGPT-3.5 and New Bing Chat in the diagnosis and management of thyroid nodules illustrates that LLMs currently demonstrate the potential for medical applications, but do not yet reach the clinical decision-making capacity of doctors.
期刊介绍:
Well-established as a major journal in today’s rapidly advancing experimental and clinical research areas, Endocrine publishes original articles devoted to basic (including molecular, cellular and physiological studies), translational and clinical research in all the different fields of endocrinology and metabolism. Articles will be accepted based on peer-reviews, priority, and editorial decision. Invited reviews, mini-reviews and viewpoints on relevant pathophysiological and clinical topics, as well as Editorials on articles appearing in the Journal, are published. Unsolicited Editorials will be evaluated by the editorial team. Outcomes of scientific meetings, as well as guidelines and position statements, may be submitted. The Journal also considers special feature articles in the field of endocrine genetics and epigenetics, as well as articles devoted to novel methods and techniques in endocrinology.
Endocrine covers controversial, clinical endocrine issues. Meta-analyses on endocrine and metabolic topics are also accepted. Descriptions of single clinical cases and/or small patients studies are not published unless of exceptional interest. However, reports of novel imaging studies and endocrine side effects in single patients may be considered. Research letters and letters to the editor related or unrelated to recently published articles can be submitted.
Endocrine covers leading topics in endocrinology such as neuroendocrinology, pituitary and hypothalamic peptides, thyroid physiological and clinical aspects, bone and mineral metabolism and osteoporosis, obesity, lipid and energy metabolism and food intake control, insulin, Type 1 and Type 2 diabetes, hormones of male and female reproduction, adrenal diseases pediatric and geriatric endocrinology, endocrine hypertension and endocrine oncology.