Background: This study employs machine learning strategy algorithms to screen the optimal gene signature of pulmonary arterial hypertension (PAH) under big data in the medical field.
Methods: The public database Gene Expression Omnibus (GEO) was used to analyze datasets of 32 normal controls and 37 PAH disease samples. The enrichment analysis was performed after selecting the differentially expressed genes. Two machine learning methods, the least absolute shrinkage and selection operator (LASSO) and support vector machine (SVM), were used to identify the candidate genes. The external validation data set further tests the expression level and diagnostic value of candidate diagnostic genes. The diagnostic effectiveness was evaluated by obtaining the receiver operating characteristic curve (ROC). The convolution tool CIBERSORT was used to estimate the composition pattern of the immune cell subtypes and to perform correlation analysis based on the combined training dataset.
Results: A total of 564 differentially expressed genes (DEGs) were screened in normal control and pulmonary hypertension samples. The enrichment analysis results were found to be closely related to cardiovascular diseases, inflammatory diseases, and immune-related pathways. The LASSO and SVM algorithms in machine learning used 5 × cross-validation to identify 9 and 7 characteristic genes. The two machine learning algorithms shared Caldesmon 1 (CALD1) and Solute Carrier Family 7 Member 11 (SLC7A11) as genetic signals highly correlated with PAH. The results showed that the area under ROC (AUC) of the specific characteristic diagnostic genes were CALD1 (AUC = 0.924) and SLC7A11 (AUC = 0.962), indicating that the two diagnostic genes have high diagnostic value.
Conclusion: CALD1 and SLC7A11 can be used as diagnostic markers of PAH to obtain new insights for the further study of the immune mechanism involved in PAH.