Background: The Ventral Rectopexy International Expert Panel recently published a consensus update on ventral rectopexy. The ability of large language models to synthesize the literature on ventral rectopexy without an explicit knowledge base was studied prior to publication of the consensus.
Objective: To compare different large language models' responses and citations on ventral rectopexy using the expert panel consensus as a reference standard.
Design: ChatGPT-4o, Gemini 1.5 Pro, and OpenEvidence were compared on content appropriateness (1-inappropriate to 5-appropriate), readability (Flesch reading ease), response length, citation fabrication, and citation quality per Oxford Levels of Evidence. The most content-appropriate chatbot response selected by the expert panel was de-identified and presented alongside the consensus text to 15 colorectal surgeons who attempted to identify the chatbot-generated text.
Setting: Questions were submitted on September 18-19, 2024. Analysis was performed prior to the consensus's publication online on January 30, 2025.
Main outcome measures: Content appropriateness, fabricated citation rate, citation quality, accuracy of identifying human vs. chatbot text by colorectal surgeons.
Results: OpenEvidence ranked highest for content appropriateness (mean 3.5/5), above Gemini (3.0/5) and ChatGPT (2.8/5) (p<0.001). ChatGPT was most verbose with the highest readability scores (p=0.021). ChatGPT fabricated 53% of citations; Gemini fabricated 12%; OpenEvidence fabricated 0% (p<0.001). All OpenEvidence citations were peer-reviewed with 40/117 (34%) citing Level I-III studies vs. 46/94 (49%) of the references cited in the consensus (p=0.043). Chatbot-generated responses were identified with 28/51 (55%) accuracy.
Limitations: Reproducibility due to nature of large language models and availability of consensus post-publication.
Conclusions: OpenEvidence outperformed Gemini 1.5 Pro and ChatGPT-4o on content appropriateness and peer-reviewed citation quantity and quality. Chatbot-generated text was indistinguishable from expert-authored consensus to subject matter experts. Large language models as an early-stage research tool may be viable for future consensus working groups, provided transparent disclosure and rigorous oversight. See Video Abstract.
扫码关注我们
求助内容:
应助结果提醒方式:
