Purpose: Differentiating pulmonary lymphoma from lung infections using CT images is challenging. Existing deep neural network-based lung CT classification models rely on 2D slices, lacking comprehensive information and requiring manual selection. 3D models that involve chunking compromise image information and struggle with parameter reduction, limiting performance. These limitations must be addressed to improve accuracy and practicality.
Methods: We propose a transformer sequential feature encoding structure to integrate multi-level information from complete CT images, inspired by the clinical practice of using a sequence of cross-sectional slices for diagnosis. We incorporate position encoding and cross-level long-range information fusion modules into the feature extraction CNN network for cross-sectional slices, ensuring high-precision feature extraction.
Results: We conducted comprehensive experiments on a dataset of 124 patients, with respective sizes of 64, 20 and 40 for training, validation and testing. The results of ablation experiments and comparative experiments demonstrated the effectiveness of our approach. Our method outperforms existing state-of-the-art methods in the 3D CT image classification problem of distinguishing between lung infections and pulmonary lymphoma, achieving an accuracy of 0.875, AUC of 0.953 and F1 score of 0.889.
Conclusion: The experiments verified that our proposed position-enhanced transformer-based sequential feature encoding model is capable of effectively performing high-precision feature extraction and contextual feature fusion in the lungs. It enhances the ability of a standalone CNN network or transformer to extract features, thereby improving the classification performance. The source code is accessible at https://github.com/imchuyu/PTSFE .