Effectiveness of a large language model for clinical information retrieval regarding shoulder arthroplasty

IF 2 Q2 ORTHOPEDICS Journal of Experimental Orthopaedics Pub Date : 2024-12-17 DOI:10.1002/jeo2.70114

Jacob F. Oeding, Amy Z. Lu, Michael Mazzucco, Michael C. Fu, David M. Dines, Russell F. Warren, Lawrence V. Gulotta, Joshua S. Dines, Kyle N. Kunze

{"title":"Effectiveness of a large language model for clinical information retrieval regarding shoulder arthroplasty","authors":"Jacob F. Oeding, Amy Z. Lu, Michael Mazzucco, Michael C. Fu, David M. Dines, Russell F. Warren, Lawrence V. Gulotta, Joshua S. Dines, Kyle N. Kunze","doi":"10.1002/jeo2.70114","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Purpose</h3>\n \n <p>To determine the scope and accuracy of medical information provided by ChatGPT-4 in response to clinical queries concerning total shoulder arthroplasty (TSA), and to compare these results to those of the Google search engine.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>A patient-replicated query for ‘total shoulder replacement’ was performed using both Google Web Search (the most frequently used search engine worldwide) and ChatGPT-4. The top 10 frequently asked questions (FAQs), answers, and associated sources were extracted. This search was performed again independently to identify the top 10 FAQs necessitating numerical responses such that the concordance of answers could be compared between Google and ChatGPT-4. The clinical relevance and accuracy of the provided information were graded by two blinded orthopaedic shoulder surgeons.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Concerning FAQs with numeric responses, 8 out of 10 (80%) had identical answers or substantial overlap between ChatGPT-4 and Google. Accuracy of information was not significantly different (<i>p</i> = 0.32). Google sources included 40% medical practices, 30% academic, 20% single-surgeon practice, and 10% social media, while ChatGPT-4 used 100% academic sources, representing a statistically significant difference (<i>p</i> = 0.001). Only 3 out of 10 (30%) FAQs with open-ended answers were identical between ChatGPT-4 and Google. The clinical relevance of FAQs was not significantly different (<i>p</i> = 0.18). Google sources for open-ended questions included academic (60%), social media (20%), medical practice (10%) and single-surgeon practice (10%), while 100% of sources for ChatGPT-4 were academic, representing a statistically significant difference (<i>p</i> = 0.0025).</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>ChatGPT-4 provided trustworthy academic sources for medical information retrieval concerning TSA, while sources used by Google were heterogeneous. Accuracy and clinical relevance of information were not significantly different between ChatGPT-4 and Google.</p>\n </section>\n \n <section>\n \n <h3> Level of Evidence</h3>\n \n <p>Level IV cross-sectional.</p>\n </section>\n </div>","PeriodicalId":36909,"journal":{"name":"Journal of Experimental Orthopaedics","volume":"11 4","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11649951/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Experimental Orthopaedics","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/jeo2.70114","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose

To determine the scope and accuracy of medical information provided by ChatGPT-4 in response to clinical queries concerning total shoulder arthroplasty (TSA), and to compare these results to those of the Google search engine.

Methods

A patient-replicated query for ‘total shoulder replacement’ was performed using both Google Web Search (the most frequently used search engine worldwide) and ChatGPT-4. The top 10 frequently asked questions (FAQs), answers, and associated sources were extracted. This search was performed again independently to identify the top 10 FAQs necessitating numerical responses such that the concordance of answers could be compared between Google and ChatGPT-4. The clinical relevance and accuracy of the provided information were graded by two blinded orthopaedic shoulder surgeons.

Results

Concerning FAQs with numeric responses, 8 out of 10 (80%) had identical answers or substantial overlap between ChatGPT-4 and Google. Accuracy of information was not significantly different (p = 0.32). Google sources included 40% medical practices, 30% academic, 20% single-surgeon practice, and 10% social media, while ChatGPT-4 used 100% academic sources, representing a statistically significant difference (p = 0.001). Only 3 out of 10 (30%) FAQs with open-ended answers were identical between ChatGPT-4 and Google. The clinical relevance of FAQs was not significantly different (p = 0.18). Google sources for open-ended questions included academic (60%), social media (20%), medical practice (10%) and single-surgeon practice (10%), while 100% of sources for ChatGPT-4 were academic, representing a statistically significant difference (p = 0.0025).

Conclusion

ChatGPT-4 provided trustworthy academic sources for medical information retrieval concerning TSA, while sources used by Google were heterogeneous. Accuracy and clinical relevance of information were not significantly different between ChatGPT-4 and Google.

Level of Evidence

Level IV cross-sectional.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

大型语言模型对肩关节置换术临床信息检索的有效性。

目的：确定ChatGPT-4在回应全肩关节置换术（TSA）临床查询时提供的医疗信息的范围和准确性，并将这些结果与谷歌搜索引擎的结果进行比较。方法：使用谷歌Web Search（全球最常用的搜索引擎）和ChatGPT-4对“全肩关节置换术”进行患者复制查询。提取了前10个常见问题（FAQs）、答案和相关来源。该搜索再次独立执行，以确定需要数字回答的前10个常见问题，以便可以比较谷歌和chatggt -4之间答案的一致性。两位盲法骨科肩部外科医生对所提供信息的临床相关性和准确性进行了评分。结果：在带有数字回答的常见问题中，ChatGPT-4和谷歌有80%的答案相同或有大量重叠。信息准确性差异无统计学意义（p = 0.32）。谷歌来源包括40%的医疗实践，30%的学术，20%的单一外科医生实践和10%的社交媒体，而ChatGPT-4使用100%的学术来源，代表统计学上显著差异（p = 0.001）。在ChatGPT-4和谷歌之间，只有3 / 10（30%）带有开放式答案的常见问题是相同的。常见问题的临床相关性无显著性差异（p = 0.18）。谷歌开放式问题的来源包括学术（60%）、社交媒体（20%）、医疗实践（10%）和单外科医生实践（10%），而ChatGPT-4的来源100%为学术，差异有统计学意义（p = 0.0025）。结论：ChatGPT-4为TSA医学信息检索提供了可靠的学术来源，而谷歌使用的来源具有异质性。ChatGPT-4和谷歌对信息的准确性和临床相关性无显著差异。证据等级：横截面ⅳ级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊