Tuberculosis (TB) remains among the leading causes of infectious mortality. Diagnosis is challenging because imaging and clinical data must be interpreted in tandem. Chest X-rays and computed tomography scans provide screening but require expert readers, while clinical variables such as age, body mass index (BMI), symptoms, HIV status and socio-economic factors capture risk information absent from images. Appending clinical scores to radiologic outputs raised the area under the receiver-operating-characteristic curve from around 0.72–0.78 to 0.84, but conventional convolutional networks cannot fully model interactions between modalities. A vision transformer represents chest images as sequences of patch embeddings, while a tabular transformer process cleaned and normalized clinical variables. A cross-modal attention module enables the image tokens to attend to clinical tokens, and a contrastive pre-training objective encourages radiological and clinical representations from the same patient to align while separating those from different patients. In this paper, we propose a hybrid deep learning model, which jointly learns the chest imaging and structured clinical data using a cross-modal transformer with contrastive pre-training. The method jointly learns radiological-clinical representations in order to improve the accuracy and interpretability of the diagnosis. Evaluated on two open-access cohorts (TB Portals and Integrated Mycobacterial CT), the system attained an AUC of 0.96 and accuracy of 0.90 approximately 10 % more than CNN-based baselines. Attention and SHAP analyses showed that variables such as HIV status and low BMI direct the attention of the model to important areas of the lung. This combination of radiological and clinical data promotes better diagnostics and provides a viable pathway for TB diagnosis in resource limited settings.
扫码关注我们
求助内容:
应助结果提醒方式:
