ChatGPT-3.5 and -4.0 Do Not Reliably Create Readable Patient Education Materials for Common Orthopaedic Upper- and Lower-Extremity Conditions

Q3 Medicine Arthroscopy Sports Medicine and Rehabilitation Pub Date : 2025-02-01 Epub Date: 2024-10-10 DOI:10.1016/j.asmr.2024.101027

Ryan S. Marder M.D. , George Abdelmalek M.D. , Sean M. Richards B.A. , Nicolas J. Nadeau B.S. , Daniel J. Garcia B.S. , Peter J. Attia B.A. , Gavin Rallis M.D. , Anthony J. Scillia M.D.

{"title":"ChatGPT-3.5 and -4.0 Do Not Reliably Create Readable Patient Education Materials for Common Orthopaedic Upper- and Lower-Extremity Conditions","authors":"Ryan S. Marder M.D. , George Abdelmalek M.D. , Sean M. Richards B.A. , Nicolas J. Nadeau B.S. , Daniel J. Garcia B.S. , Peter J. Attia B.A. , Gavin Rallis M.D. , Anthony J. Scillia M.D.","doi":"10.1016/j.asmr.2024.101027","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>To investigate whether ChatGPT-3.5 and -4.0 can serve as a viable tool to create readable patient education materials for patients with common orthopaedic upper- and lower-extremity conditions.</div></div><div><h3>Methods</h3><div>Using ChatGPT versions 3.5 and 4.0, we asked the artificial intelligence program a series of 2 questions pertaining to patient education for 50 common orthopaedic upper-extremity pathologies and 50 common orthopaedic lower-extremity pathologies. Two templated questions were created and used for all conditions. Readability scores were calculated using the Python library Textstat. Multiple readability test scores were generated, and a consensus reading level was created taking into account the results of 8 reading tests.</div></div><div><h3>Results</h3><div>ChatGPT-3.5 produced only 2% and 4% of responses at the appropriate reading level for upper- and lower-extremity conditions, respectively, compared with 54% produced by ChatGPT-4.0 for both upper- and lower-extremity conditions (both <em>P</em> < .0001). After a priming phase, ChatGPT-3.5 did not produce any viable responses for either the upper- or lower-extremity conditions, compared with 64% for both upper- and lower-extremity conditions by ChatGPT-4.0 (both <em>P</em> < .0001). Additionally, ChatGPT-4.0 was more successful than ChatGPT-3.5 in producing viable responses both before and after a priming phase based on all available metrics for reading level (all <em>P</em> < .001), including the Automated Readability index, Coleman-Liau index, Dale-Chall formula, Flesch-Kincaid grade, Flesch Reading Ease score, Gunning Fog score, Linsear Write Formula score, and Simple Measure of Gobbledygook index.</div></div><div><h3>Conclusions</h3><div>Our results indicate that ChatGPT-3.5 and -4.0 unreliably created readable patient education materials for common orthopaedic upper- and lower-extremity conditions at the time of the study.</div></div><div><h3>Clinical Relevance</h3><div>The findings of this study suggest that ChatGPT, while constantly improving as evidenced by the advances from version 3.5 to version 4.0, should not be substituted for traditional methods of patient education at this time and, in its current state, may be used as a supplemental resource at the discretion of providers.</div></div>","PeriodicalId":34631,"journal":{"name":"Arthroscopy Sports Medicine and Rehabilitation","volume":"7 1","pages":"Article 101027"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Arthroscopy Sports Medicine and Rehabilitation","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666061X24001706","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/10 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose

To investigate whether ChatGPT-3.5 and -4.0 can serve as a viable tool to create readable patient education materials for patients with common orthopaedic upper- and lower-extremity conditions.

Methods

Using ChatGPT versions 3.5 and 4.0, we asked the artificial intelligence program a series of 2 questions pertaining to patient education for 50 common orthopaedic upper-extremity pathologies and 50 common orthopaedic lower-extremity pathologies. Two templated questions were created and used for all conditions. Readability scores were calculated using the Python library Textstat. Multiple readability test scores were generated, and a consensus reading level was created taking into account the results of 8 reading tests.

Results

ChatGPT-3.5 produced only 2% and 4% of responses at the appropriate reading level for upper- and lower-extremity conditions, respectively, compared with 54% produced by ChatGPT-4.0 for both upper- and lower-extremity conditions (both P < .0001). After a priming phase, ChatGPT-3.5 did not produce any viable responses for either the upper- or lower-extremity conditions, compared with 64% for both upper- and lower-extremity conditions by ChatGPT-4.0 (both P < .0001). Additionally, ChatGPT-4.0 was more successful than ChatGPT-3.5 in producing viable responses both before and after a priming phase based on all available metrics for reading level (all P < .001), including the Automated Readability index, Coleman-Liau index, Dale-Chall formula, Flesch-Kincaid grade, Flesch Reading Ease score, Gunning Fog score, Linsear Write Formula score, and Simple Measure of Gobbledygook index.

Conclusions

Our results indicate that ChatGPT-3.5 and -4.0 unreliably created readable patient education materials for common orthopaedic upper- and lower-extremity conditions at the time of the study.

Clinical Relevance

The findings of this study suggest that ChatGPT, while constantly improving as evidenced by the advances from version 3.5 to version 4.0, should not be substituted for traditional methods of patient education at this time and, in its current state, may be used as a supplemental resource at the discretion of providers.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ChatGPT-3.5和-4.0不可靠地创建可读的患者教育材料，为常见的骨科上肢和下肢条件

目的探讨ChatGPT-3.5和-4.0是否可以作为一种可行的工具，为骨科上、下肢常见疾病的患者制作可读的患者教育材料。方法使用ChatGPT 3.5和4.0版本，我们向人工智能程序提出了一系列关于50种常见骨科上肢病理和50种常见骨科下肢病理的患者教育的2个问题。创建了两个模板问题，并用于所有条件。可读性分数是使用Python库Textstat计算的。生成了多个可读性测试分数，并根据8个阅读测试的结果创建了一致的阅读水平。结果在上肢和下肢条件下，schatgpt -3.5分别仅产生了2%和4%的适当阅读水平的反应，而在上肢和下肢条件下，ChatGPT-4.0产生了54%的反应(P <；。)。在启动阶段之后，ChatGPT-3.5在上肢和下肢条件下都没有产生任何可行的反应，而ChatGPT-4.0在上肢和下肢条件下都有64%的反应(P <；。)。此外，ChatGPT-4.0比ChatGPT-3.5更成功地在启动阶段前后产生可行的反应，基于所有可用的阅读水平指标(所有P <；.001)，包括自动可读性指数、Coleman-Liau指数、Dale-Chall公式、Flesch- kincaid评分、Flesch阅读易用性评分、Gunning Fog评分、Linsear写作公式评分和简单测量的Gobbledygook指数。研究结果表明，ChatGPT-3.5和-4.0不可靠地为研究时常见的骨科上、下肢疾病创建可读的患者教育材料。临床相关性本研究结果表明，ChatGPT虽然从3.5版本到4.0版本不断改进，但目前不应取代传统的患者教育方法，在目前的状态下，可以作为补充资源，由提供者自行决定。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊