Purpose
To compare ChatGPT, a general-purpose large language model (LLM), with OpenBioLLM, a domain-specific biomedical model, in generating clinically appropriate anesthesia plans, and to assess the impact of advanced prompt engineering (PE).
Methods
A comparative observational study analyzing anonymized clinical records of 100 cardiac surgery patients using three LLMs under simple querying and PE. Plans were evaluated by anesthesiologists using a double-blind approach. The main outcome measures included clinical alignment, reasoning quality, medication selection, dosage accuracy, omissions, and risk of harm. Scores were rated on a 5-point Likert scale by both experienced and trainee anesthesiologists, and differences analyzed via paired t-tests and analysis of variance.
Results
In clinical matching accuracy, logical reasoning, medication selection, and safety evaluation metrics, ChatGPT consistently outperformed OpenBioLLM. GPT-4o demonstrated superior performance compared to GPT-3.5, with significant performance enhancements achieved through prompt optimization. Both physician groups reported notable improvements in their average scores for the ChatGPT model when utilizing advanced PE, whereas OpenBioLLM showed comparatively smaller score improvements.
Conclusion
ChatGPT’s adaptability and clinical accuracy, enhanced by PE, make it a valuable tool for anesthesia planning, especially in resource-limited settings. However, over-reliance on AI by less experienced clinicians poses risks, underscoring the need for physician oversight and tailored training. Further research is necessary to validate these findings across diverse clinical scenarios.