Objectives: To explore the complexities of eliminating race correction in clinical artificial intelligence (AI), the pitfalls of naive solutions, and to propose systematic strategies for equitable model development.
Background and significance: Race correction in clinical AI, as in traditional medicine, introduces biases with potentially harmful consequences. Simple removal of race from models is insufficient due to the lasting influence of historically biased data.
Approach: We analyze 4 standardized scenarios to demonstrate how race correction manifests in clinical AI: use of race-corrected variables, explicit inclusion of race, inference via proxy variables, and use of race-specific models.
Results: For each scenario, the intuitive solution to removing race correction fails to eliminate bias, often due to legacy effects embedded in the data. More thoughtful approaches are required.
Discussion: Ending race correction in clinical AI requires deliberate, context-sensitive interventions, inclusion of diverse stakeholders, and strategies to make model reasoning more transparent and auditable.
Objectives: Accurate triage in emergency departments (ED) is critical for appropriate resource allocation. While artificial intelligence (AI) has been explored for triage, prior models relied on summarized clinical scenarios. We aimed to develop and evaluate large language models (LLMs) trained on real-world clinical conversations to classify patient urgency.
Materials and methods: We used a nationally curated dataset of anonymized triage-level conversations from 3 tertiary Korean hospitals. Two BERT-based models were developed to classify urgency per the Korean Triage and Acuity Scale (KTAS) into urgent (KTAS 3) or non-urgent (KTAS 4-5). One model tokenized the entire conversation, while the other applied a hierarchical structure with sentence-level tokenization and speaker-role embeddings. Performance metrics included accuracy, precision, recall, and F1-score. We compared our models against ChatGPT GPT-4o and ClinicalBERT, and assessed explainability using SHapley Additive exPlanations (SHAP).
Results: A total of 5244 clinical conversations, 1057 triage-level dialogues were used, with 950 for training and 107 for testing. Our model with hierarchical structure achieved accuracies of 75.94%, significantly outperforming ChatGPT (56.68%) or fine-tuned ClinicalBERT (69.42%). For urgent cases, the best model achieved a recall of 0.9610, outperforming ChatGPT (0.5352). SHapley Additive exPlanations analysis confirmed that our model focused on clinically relevant cues aligned with KTAS criteria.
Conclusion: BERT-based LLMs trained on real-world ED conversations significantly outperform general-purpose models like ChatGPT in triage accuracy. This approach demonstrates the potential for enhancing clinical decision support with interpretable and efficient AI.

