Introduction: Artificial intelligence (AI) chatbots, powered by large language models, are used increasingly for disseminating surgical information, but concerns about accuracy, hallucinations and source reliability persist. This study evaluates the sources of information upon which these systems rely when producing medical information. As these models generate language without true comprehension or reasoning, assessing the credibility and nature of their referenced sources is essential to promote transparency and support evidence-based integration of AI in healthcare.
Methods: Nine AI chatbots (ChatGPT-5, ChatGPT-5 Think, DeepSeek R1, DeepSeek DeepThink, Google Gemini 2.5 Flash, Grok 3, Grok 4, Perplexity Research and Perplexity Search) were queried with six standardised general surgery prompts, both with and without explicit requests for references (n=108 outputs); 1,249 references were extracted and assessed for quantity, authenticity, quality, source category, accessibility, geographic origin and attribution.
Results: Reference provision varied: four chatbots required explicit prompting, whereas others cited consistently. Hallucination rates ranged from 0% (five models) to 34% (Grok 3). Mean quality scores differed significantly, with Perplexity Research achieving the highest score (4.08) and ChatGPT-5 the lowest (2.39), reflecting differences observed in source type. Most references originated from the US or UK. Accessibility was best in Google Gemini (100% open access, clickable citations). Explicit prompting increased reference quantity significantly in six models and quality in one.
Conclusions: AI chatbots exhibit heterogeneous reference integrity, with risks of hallucinations and biases underscoring the need for prompt engineering, model refinements and ongoing evaluation. Our findings suggest ongoing caution is required in surgical contexts to ensure safe, equitable information dissemination.
扫码关注我们
求助内容:
应助结果提醒方式:
