评估耳鼻喉科领域不同大型语言模型的未知潜在质量和局限性。

IF 16.4 1区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Accounts of Chemical Research Pub Date : 2024-03-01 Epub Date: 2024-05-23 DOI:10.1080/00016489.2024.2352843

Christoph R Buhr, Harry Smith, Tilman Huppertz, Katharina Bahr-Hamm, Christoph Matthias, Clemens Cuny, Jan Phillipp Snijders, Benjamin Philipp Ernst, Andrew Blaikie, Tom Kelsey, Sebastian Kuhn, Jonas Eckrich

{"title":"评估耳鼻喉科领域不同大型语言模型的未知潜在质量和局限性。","authors":"Christoph R Buhr, Harry Smith, Tilman Huppertz, Katharina Bahr-Hamm, Christoph Matthias, Clemens Cuny, Jan Phillipp Snijders, Benjamin Philipp Ernst, Andrew Blaikie, Tom Kelsey, Sebastian Kuhn, Jonas Eckrich","doi":"10.1080/00016489.2024.2352843","DOIUrl":null,"url":null,"abstract":"Background: Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear.Aims/objectives: Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL).Material and methods: Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared.Results: LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants' answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants.Conclusions and significance: Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.","PeriodicalId":1,"journal":{"name":"Accounts of Chemical Research","volume":null,"pages":null},"PeriodicalIF":16.4000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Assessing unknown potential-quality and limitations of different large language models in the field of otorhinolaryngology.\",\"authors\":\"Christoph R Buhr, Harry Smith, Tilman Huppertz, Katharina Bahr-Hamm, Christoph Matthias, Clemens Cuny, Jan Phillipp Snijders, Benjamin Philipp Ernst, Andrew Blaikie, Tom Kelsey, Sebastian Kuhn, Jonas Eckrich\",\"doi\":\"10.1080/00016489.2024.2352843\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear.Aims/objectives: Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL).Material and methods: Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared.Results: LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants' answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants.Conclusions and significance: Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.\",\"PeriodicalId\":1,\"journal\":{\"name\":\"Accounts of Chemical Research\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":16.4000,\"publicationDate\":\"2024-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Accounts of Chemical Research\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1080/00016489.2024.2352843\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/5/23 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Accounts of Chemical Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/00016489.2024.2352843","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/5/23 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

背景：大型语言模型（LLMs）可以为缺乏训练有素的医疗人员提供解决方案，尤其是在中低收入国家。然而，它们的优缺点仍不明确：在此，我们以耳鼻喉科（ORL）的六名顾问为对象，对不同的语言模型（Bard 2023.07.13、Claude 2、ChatGPT 4）进行了基准测试：从文献和德国国家考试中提取了基于案例的问题。对 Bard 2023.07.13、Claude 2、ChatGPT 4 和六位耳鼻喉科顾问的答案进行了盲评，采用李克特（Likert）6 点量表，对医学充分性、可理解性、连贯性和简洁性进行评分。给出的答案与经过验证的答案进行了比较，并对危险性进行了评估。进行了修改后的图灵测试，并对字符数进行了比较：结果：在所有类别中，法律硕士的答案都不如顾问。然而，顾问和法律硕士之间的差距微乎其微，在简洁性方面差距最明显，而在可理解性方面差距最小。在法律硕士中，克劳德 2 在医学充分性和简洁性方面被评为最佳。顾问的答案有 93%（228/246）与验证方案相符，ChatGPT 4 有 85%（35/41），Claude 2 有 78%（32/41），Bard 2023.07.13 有 59%（24/41）。在 ChatGPT 4 中，10%（24/246）的答案被评为潜在危险；在 Claude 2 中，14%（34/246）的答案被评为潜在危险；在 Bard 2023.07.13 中，19%（46/264）的答案被评为潜在危险；在顾问中，6%（71/1230）的答案被评为潜在危险：尽管咨询师的性能更优越，但 LLM 在 ORL 的临床应用中仍有潜力。未来的研究应更大规模地评估其性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Assessing unknown potential-quality and limitations of different large language models in the field of otorhinolaryngology.

Background: Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear.

Aims/objectives: Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL).

Material and methods: Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared.

Results: LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants' answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants.

Conclusions and significance: Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Accounts of Chemical Research 化学-化学综合

CiteScore

31.40

自引率

1.10%

发文量

312

审稿时长

2 months

期刊介绍： Accounts of Chemical Research presents short, concise and critical articles offering easy-to-read overviews of basic research and applications in all areas of chemistry and biochemistry. These short reviews focus on research from the author’s own laboratory and are designed to teach the reader about a research project. In addition, Accounts of Chemical Research publishes commentaries that give an informed opinion on a current research problem. Special Issues online are devoted to a single topic of unusual activity and significance. Accounts of Chemical Research replaces the traditional article abstract with an article "Conspectus." These entries synopsize the research affording the reader a closer look at the content and significance of an article. Through this provision of a more detailed description of the article contents, the Conspectus enhances the article's discoverability by search engines and the exposure for the research.