Background: This study evaluates the use of large language models (LLMs) in generating Patient Education Materials (PEMs) for dental scenarios, focusing on their reliability, readability, understandability, and actionability. The study aimed to assess the performance of four LLMs-ChatGPT-4.0, Claude 3.5 Sonnet, Gemini 1.5 Flash, and Llama 3.1-405b-in generating PEMs for four common dental scenarios.
Methods: A comparative analysis was conducted where five independent dental professionals assessed the materials using the Patient Education Materials Assessment Tool (PEMAT) to evaluate understandability and actionability. Readability was measured with Flesch Reading Ease and Level scores, and inter-rater reliability was assessed using Fleiss' Kappa.
Results: Llama 3.1-405b demonstrated the highest inter-rater reliability (Fleiss' Kappa: 0.78-0.89). ChatGPT-4.0 excelled in understandability, surpassing the PEMAT threshold of 70% in three of the four scenarios. Claude 3.5 Sonnet performed well in understandability for two scenarios but did not consistently meet the 70% threshold for actionability. ChatGPT-4.0 generated the longest responses, while Claude 3.5 Sonnet produced the shortest.
Conclusions: ChatGPT-4.0 demonstrated superior understandability, while Llama 3.1-405b achieved the highest inter-rater reliability. The findings indicate that further refinement and human intervention is necessary for LLM-generated content to meet the standards of effective patient education.
扫码关注我们
求助内容:
应助结果提醒方式:
