Sociodemographic factors are critical determinants of health outcomes and disparities, yet their documentation in electronic medical records is often sparse and confined to unstructured clinical text. This poses substantial challenges for automated extraction and integration into clinical decision-making. In this study, we systematically evaluate and compare 6 convolutional neural network architectures, including hybrid models that integrate traditional classifiers, for binary classification of multiple sociodemographic characteristics from EMR text using data from 4375 patients across 96 primary care clinics. The goal was to assess how model complexity and lexical diversity influence classification performance. Manual annotation achieved high inter-rater reliability (kappa: 0.98 for documentation status, 0.96 for documented information). We report performance using F1 score, precision, recall, area under the precision-recall curve, and Matthews correlation coefficient. Results showed that simpler architectures, particularly a single-layer CNN, consistently outperform deeper or hybrid models across most characteristics (F1 score: 90.99%), especially under conditions of data imbalance and varied documentation patterns. While hybrid models offered gains for well-documented factors like marital status, they were less effective for sparse or diverse characteristics. These findings provide a practical framework for developing efficient, interpretable clinical NLP pipelines and inform model selection strategies for real-world health equity and EMR research applications.
扫码关注我们
求助内容:
应助结果提醒方式:
