Introduction
This study evaluated the accuracy and consistency of two large language models developed by Alphabet Inc., Google Gemini (GG), a base configuration, and NotebookLM (NLM), a document-grounded configuration, when answering clinical questions regarding external cervical resorption using a retrieval-augmented framework.
Methods
Forty-six dichotomous clinical questions related to external cervical resorption were developed by three academic endodontists based on established sources. Each question was submitted to GG and NLM using three independent user accounts, yielding 276 responses. The retrieval-augmented generation configuration was replicated by NLM, which was programmed to generate responses exclusively from the documents provided. Three endodontic experts independently evaluated all responses against predefined gold standard answers. Accuracy was defined as agreement with the gold standard; consistency referred to identical responses across the three trials. Statistical analyses included 95% confidence intervals (CIs) (Wald and Wilson), Fleiss' kappa, and Fisher's exact test.
Results
GG achieved an accuracy of 89% (41/46; 95% CI, 76.96–95.27) and a consistency rate of 93% (κ = 0.89; P < .001). NLM achieved an accuracy of 96% (44/46; 95% CI, 85.47–98.79) and the same consistency (κ = 0.90; P < .001). No significant differences occurred between the large language models for accuracy and consistency.
Conclusions
The NLM and GG models exhibited a high level of accuracy and consistency. Although NLM had a slightly superior performance, retrieval augmentation did not significantly enhance the responses to structured clinical tasks.
扫码关注我们
求助内容:
应助结果提醒方式:
