评估大语言模型在结膜炎患者教育中的有效性

IF 3.7 2区医学 Q1 OPHTHALMOLOGY British Journal of Ophthalmology Pub Date : 2024-08-30 DOI:10.1136/bjo-2024-325599

Jingyuan Wang, Runhan Shi, Qihua Le, Kun Shan, Zhi Chen, Xujiao Zhou, Yao He, Jiaxu Hong

{"title":"评估大语言模型在结膜炎患者教育中的有效性","authors":"Jingyuan Wang, Runhan Shi, Qihua Le, Kun Shan, Zhi Chen, Xujiao Zhou, Yao He, Jiaxu Hong","doi":"10.1136/bjo-2024-325599","DOIUrl":null,"url":null,"abstract":"Aims To evaluate the quality of responses from large language models (LLMs) to patient-generated conjunctivitis questions. Methods A two-phase, cross-sectional study was conducted at the Eye and ENT Hospital of Fudan University. In phase 1, four LLMs (GPT-4, Qwen, Baichuan 2 and PaLM 2) responded to 22 frequently asked conjunctivitis questions. Six expert ophthalmologists assessed these responses using a 5-point Likert scale for correctness, completeness, readability, helpfulness and safety, supplemented by objective readability analysis. Phase 2 involved 30 conjunctivitis patients who interacted with GPT-4 or Qwen, evaluating the LLM-generated responses based on satisfaction, humanisation, professionalism and the same dimensions except for correctness from phase 1. Three ophthalmologists assessed responses using phase 1 criteria, allowing for a comparative analysis between medical and patient evaluations, probing the study’s practical significance. Results In phase 1, GPT-4 excelled across all metrics, particularly in correctness (4.39±0.76), completeness (4.31±0.96) and readability (4.65±0.59) while Qwen showed similarly strong performance in helpfulness (4.37±0.93) and safety (4.25±1.03). Baichuan 2 and PaLM 2 were effective but trailed behind GPT-4 and Qwen. The objective readability analysis revealed GPT-4’s responses as the most detailed, with PaLM 2’s being the most succinct. Phase 2 demonstrated GPT-4 and Qwen’s robust performance, with high satisfaction levels and consistent evaluations from both patients and professionals. Conclusions Our study showed LLMs effectively improve patient education in conjunctivitis. These models showed considerable promise in real-world patient interactions. Despite encouraging results, further refinement, particularly in personalisation and handling complex inquiries, is essential prior to the clinical integration of these LLMs. All data relevant to the study are included in the article or uploaded as online supplemental information.","PeriodicalId":9313,"journal":{"name":"British Journal of Ophthalmology","volume":null,"pages":null},"PeriodicalIF":3.7000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating the effectiveness of large language models in patient education for conjunctivitis\",\"authors\":\"Jingyuan Wang, Runhan Shi, Qihua Le, Kun Shan, Zhi Chen, Xujiao Zhou, Yao He, Jiaxu Hong\",\"doi\":\"10.1136/bjo-2024-325599\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Aims To evaluate the quality of responses from large language models (LLMs) to patient-generated conjunctivitis questions. Methods A two-phase, cross-sectional study was conducted at the Eye and ENT Hospital of Fudan University. In phase 1, four LLMs (GPT-4, Qwen, Baichuan 2 and PaLM 2) responded to 22 frequently asked conjunctivitis questions. Six expert ophthalmologists assessed these responses using a 5-point Likert scale for correctness, completeness, readability, helpfulness and safety, supplemented by objective readability analysis. Phase 2 involved 30 conjunctivitis patients who interacted with GPT-4 or Qwen, evaluating the LLM-generated responses based on satisfaction, humanisation, professionalism and the same dimensions except for correctness from phase 1. Three ophthalmologists assessed responses using phase 1 criteria, allowing for a comparative analysis between medical and patient evaluations, probing the study’s practical significance. Results In phase 1, GPT-4 excelled across all metrics, particularly in correctness (4.39±0.76), completeness (4.31±0.96) and readability (4.65±0.59) while Qwen showed similarly strong performance in helpfulness (4.37±0.93) and safety (4.25±1.03). Baichuan 2 and PaLM 2 were effective but trailed behind GPT-4 and Qwen. The objective readability analysis revealed GPT-4’s responses as the most detailed, with PaLM 2’s being the most succinct. Phase 2 demonstrated GPT-4 and Qwen’s robust performance, with high satisfaction levels and consistent evaluations from both patients and professionals. Conclusions Our study showed LLMs effectively improve patient education in conjunctivitis. These models showed considerable promise in real-world patient interactions. Despite encouraging results, further refinement, particularly in personalisation and handling complex inquiries, is essential prior to the clinical integration of these LLMs. All data relevant to the study are included in the article or uploaded as online supplemental information.\",\"PeriodicalId\":9313,\"journal\":{\"name\":\"British Journal of Ophthalmology\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-08-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"British Journal of Ophthalmology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1136/bjo-2024-325599\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"OPHTHALMOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Ophthalmology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1136/bjo-2024-325599","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

目的评估大语言模型（LLM）对患者提出的结膜炎问题的回答质量。方法在复旦大学附属眼耳鼻喉科医院进行了一项分两个阶段的横断面研究。在第一阶段，四种 LLM（GPT-4、Qwen、Baichuan 2 和 PaLM 2）回答了 22 个结膜炎常见问题。六位眼科专家采用 5 分李克特量表对这些回答的正确性、完整性、可读性、有用性和安全性进行评估，并辅以客观的可读性分析。第 2 阶段有 30 名结膜炎患者与 GPT-4 或 Qwen 进行了互动，根据满意度、人性化、专业性以及除正确性外与第 1 阶段相同的维度对 LLM 生成的回复进行了评估。三位眼科医生根据第一阶段的标准对回答进行了评估，以便对医学和患者的评价进行比较分析，从而探究研究的实际意义。结果在第 1 阶段，GPT-4 在所有指标上都表现出色，尤其是在正确性（4.39±0.76）、完整性（4.31±0.96）和可读性（4.65±0.59）方面，而 Qwen 在有用性（4.37±0.93）和安全性（4.25±1.03）方面表现同样出色。百川 2 号 "和 "PaLM 2 号 "虽然有效，但落后于 "GPT-4 号 "和 "Qwen 号"。客观可读性分析表明，GPT-4 的回答最为详细，而 PaLM 2 的回答最为简洁。第二阶段的研究表明，GPT-4 和 Qwen 表现出色，患者和专业人员的满意度很高，评价一致。结论我们的研究表明，LLM 能有效改善结膜炎患者的教育。这些模型在现实世界的患者互动中显示出相当大的前景。尽管结果令人鼓舞，但在将这些 LLMs 应用于临床之前，还必须进一步完善，尤其是在个性化和处理复杂问题方面。与该研究相关的所有数据均包含在文章中或作为在线补充信息上传。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Evaluating the effectiveness of large language models in patient education for conjunctivitis

Aims To evaluate the quality of responses from large language models (LLMs) to patient-generated conjunctivitis questions. Methods A two-phase, cross-sectional study was conducted at the Eye and ENT Hospital of Fudan University. In phase 1, four LLMs (GPT-4, Qwen, Baichuan 2 and PaLM 2) responded to 22 frequently asked conjunctivitis questions. Six expert ophthalmologists assessed these responses using a 5-point Likert scale for correctness, completeness, readability, helpfulness and safety, supplemented by objective readability analysis. Phase 2 involved 30 conjunctivitis patients who interacted with GPT-4 or Qwen, evaluating the LLM-generated responses based on satisfaction, humanisation, professionalism and the same dimensions except for correctness from phase 1. Three ophthalmologists assessed responses using phase 1 criteria, allowing for a comparative analysis between medical and patient evaluations, probing the study’s practical significance. Results In phase 1, GPT-4 excelled across all metrics, particularly in correctness (4.39±0.76), completeness (4.31±0.96) and readability (4.65±0.59) while Qwen showed similarly strong performance in helpfulness (4.37±0.93) and safety (4.25±1.03). Baichuan 2 and PaLM 2 were effective but trailed behind GPT-4 and Qwen. The objective readability analysis revealed GPT-4’s responses as the most detailed, with PaLM 2’s being the most succinct. Phase 2 demonstrated GPT-4 and Qwen’s robust performance, with high satisfaction levels and consistent evaluations from both patients and professionals. Conclusions Our study showed LLMs effectively improve patient education in conjunctivitis. These models showed considerable promise in real-world patient interactions. Despite encouraging results, further refinement, particularly in personalisation and handling complex inquiries, is essential prior to the clinical integration of these LLMs. All data relevant to the study are included in the article or uploaded as online supplemental information.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

British Journal of Ophthalmology 医学-眼科学

CiteScore

10.30

自引率

2.40%

发文量

213

审稿时长

3-6 weeks

期刊介绍： The British Journal of Ophthalmology (BJO) is an international peer-reviewed journal for ophthalmologists and visual science specialists. BJO publishes clinical investigations, clinical observations, and clinically relevant laboratory investigations related to ophthalmology. It also provides major reviews and also publishes manuscripts covering regional issues in a global context.