Objectives: To systematically evaluate the performance of generative artificial intelligence (GenAI) models, DeepSeek-V3 and the Qwen3 series, in the differential diagnosis of weight loss.
Methods: A search was conducted in the PubMed database for all case reports published in the American Journal of Case Reports between January 1, 2012 and June 2, 2025, containing the term "weight loss" in the title or abstract. Two senior general practitioners independently verified and assessed whether each case met the diagnostic criteria for weight loss (emaciation). Cases that did not meet these criteria, had incomplete information, or fell within the scope of clearly defined specialized diagnoses and treatments were excluded. The remaining cases were then compiled into standardized clinical case summaries. These summaries were presented to DeepSeek-V3 and the Qwen3 series models (Qwen3-235B-A22B, Qwen3-30B-A3B, and Qwen3-32B) to generate ranked lists of the top 10 differential diagnoses. The models were not specifically fine-tuned for this task. Sensitivity, precision, and F1-score were used to evaluate performance. Intergroup comparisons were performed using McNemar's test and Cochran's Q test.
Results: A total of 87 case were analyzed. For DeepSeek-V3, the sensitivity for Top1, Top5, and Top10 diagnoses was 26.44%, 56.32%, and 65.52%, respectively, with corresponding precision values of 26.44%, 11.26%, and 6.55%. For Qwen3-235B-A22B, the sensitivity values were 21.84%, 43.68%, and 59.77%, with corre-sponding precision values of 21.84%, 8.74%, and 5.98%. DeepSeek-V3 demonstrated significantly better performance than Qwen3-235B-A22B in sensitivity, precision, and F1-score at the Top5 level (P=0.043). Among the Qwen3 series models, Qwen3-235B-A22B showed the best performance in sensitivity, precision, and F1-score for the Top1 diagnosis, outperforming Qwen3-32B and Qwen3-30B-A3B. However, the differences among the three Qwen3 models across all diagnostic levels were not statistically significant (all P>0.05).
Conclusions: Domestic GenAI models exhibit a characteristic of "breadth over precision" in the differential diagnosis of weight loss, with DeepSeek-V3 performing better at key diagnostic levels. Although the sensitivity and precision for the top-ranked diagnosis require improvement, these models can serve as effective clinical decision support tools, broadening the diagnostic perspectives of general practitioners. They may hold significant application value in the management of undifferentiated diseases.
扫码关注我们
求助内容:
应助结果提醒方式:
