Large language models such as ChatGPT have demonstrated significant potential in question-answering within ophthalmology, but there is a paucity of literature evaluating its ability to generate clinical assessments and discussions. The objectives of this study were to (1) assess the accuracy of assessment and plans generated by ChatGPT and (2) evaluate ophthalmologists’ abilities to distinguish between responses generated by clinicians versus ChatGPT.
Cross-sectional mixed-methods study.
Sixteen ophthalmologists from a single academic center, of which 10 were board-eligible and 6 were board-certified, were recruited to participate in this study.
Prompt engineering was used to ensure ChatGPT output discussions in the style of the ophthalmologist author of the Medical College of Wisconsin Ophthalmic Case Studies. Cases where ChatGPT accurately identified the primary diagnoses were included and then paired. Masked human-generated and ChatGPT-generated discussions were sent to participating ophthalmologists to identify the author of the discussions. Response confidence was assessed using a 5-point Likert scale score, and subjective feedback was manually reviewed.
Accuracy of ophthalmologist identification of discussion author, as well as subjective perceptions of human-generated versus ChatGPT-generated discussions.
Overall, ChatGPT correctly identified the primary diagnosis in 15 of 17 (88.2%) cases. Two cases were excluded from the paired comparison due to hallucinations or fabrications of nonuser-provided data. Ophthalmologists correctly identified the author in 77.9% ± 26.6% of the 13 included cases, with a mean Likert scale confidence rating of 3.6 ± 1.0. No significant differences in performance or confidence were found between board-certified and board-eligible ophthalmologists. Subjectively, ophthalmologists found that discussions written by ChatGPT tended to have more generic responses, irrelevant information, hallucinated more frequently, and had distinct syntactic patterns (all P < 0.01).
Large language models have the potential to synthesize clinical data and generate ophthalmic discussions. While these findings have exciting implications for artificial intelligence-assisted health care delivery, more rigorous real-world evaluation of these models is necessary before clinical deployment.
The author(s) have no proprietary or commercial interest in any materials discussed in this article.
To evaluate the performance of a large language model (LLM) in classifying electronic health record (EHR) text, and to use this classification to evaluate the type and resolution of hemorrhagic events (HEs) after microinvasive glaucoma surgery (MIGS).
Retrospective cohort study.
Eyes from the Bascom Palmer Glaucoma Repository.
Eyes that underwent MIGS between July 1, 2014 and February 1, 2022 were analyzed. Chat Generative Pre-trained Transformer (ChatGPT) was used to classify deidentified EHR anterior chamber examination text into HE categories (no hyphema, microhyphema, clot, and hyphema). Agreement between classifications by ChatGPT and a glaucoma specialist was evaluated using Cohen’s Kappa and precision-recall (PR) curve. Time to resolution of HEs was assessed using Cox proportional-hazards models. Goniotomy HE resolution was evaluated by degree of angle treatment (90°–179°, 180°–269°, 270°–360°). Logistic regression was used to identify HE risk factors.
Accuracy of ChatGPT HE classification and incidence and resolution of HEs.
The study included 434 goniotomy eyes (368 patients) and 528 Schlemm’s canal stent (SCS) eyes (390 patients). Chat Generative Pre-trained Transformer facilitated excellent HE classification (Cohen’s kappa 0.93, area under PR curve 0.968). Using ChatGPT classifications, at postoperative day 1, HEs occurred in 67.8% of goniotomy and 25.2% of SCS eyes (P < 0.001). The 270° to 360° goniotomy group had the highest HE rate (84.0%, P < 0.001). At postoperative week 1, HEs were observed in 43.4% and 11.3% of goniotomy and SCS eyes, respectively (P < 0.001). By postoperative month 1, HE rates were 13.3% and 1.3% among goniotomy and SCS eyes, respectively (P < 0.001). Time to HE resolution differed between the goniotomy angle groups (log-rank P = 0.034); median time to resolution was 10, 10, and 15 days for the 90° to 179°, 180° to 269°, and 270° to 360° groups, respectively. Risk factor analysis demonstrated greater goniotomy angle was the only significant predictor of HEs (odds ratio for 270°–360°: 4.08, P < 0.001).
Large language models can be effectively used to classify longitudinal EHR free-text examination data with high accuracy, highlighting a promising direction for future LLM-assisted research and clinical decision support. Hemorrhagic events are relatively common self-resolving complications that occur more often in goniotomy cases and with larger goniotomy treatments. Time to HE resolution differs significantly between goniotomy groups.
Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.