Fernando García-García, Dae-Jin Lee, Mónica Nieves-Ermecheo, Olaia Bronte, Pedro Pablo España, José María Quintana, Rosario Menéndez, Antoni Torres, Luis Alberto Ruiz Iturriaga, Isabel Urrutia
{"title":"Obtaining patient phenotypes in SARS-CoV-2 pneumonia, and their association with clinical severity and mortality.","authors":"Fernando García-García, Dae-Jin Lee, Mónica Nieves-Ermecheo, Olaia Bronte, Pedro Pablo España, José María Quintana, Rosario Menéndez, Antoni Torres, Luis Alberto Ruiz Iturriaga, Isabel Urrutia","doi":"10.1186/s41479-024-00132-0","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>There exists consistent empirical evidence in the literature pointing out ample heterogeneity in terms of the clinical evolution of patients with COVID-19. The identification of specific phenotypes underlying in the population might contribute towards a better understanding and characterization of the different courses of the disease. The aim of this study was to identify distinct clinical phenotypes among hospitalized patients with SARS-CoV-2 pneumonia using machine learning clustering, and to study their association with subsequent clinical outcomes as severity and mortality.</p><p><strong>Methods: </strong>Multicentric observational, prospective, longitudinal, cohort study conducted in four hospitals in Spain. We included adult patients admitted for in-hospital stay due to SARS-CoV-2 pneumonia. We collected a broad spectrum of variables to describe exhaustively each case: patient demographics, comorbidities, symptoms, physiological status, baseline examinations (blood analytics, arterial gas test), etc. For the development and internal validation of the clustering/phenotype models, the dataset was split into training and test sets (50% each). We proposed a sequence of machine learning stages: feature scaling, missing data imputation, reduction of data dimensionality via Kernel Principal Component Analysis (KPCA), and clustering with the k-means algorithm. The optimal cluster model parameters -including k, the number of phenotypes- were chosen automatically, by maximizing the average Silhouette score across the training set.</p><p><strong>Results: </strong>We enrolled 1548 patients, each of them characterized by 92 clinical attributes (d=109 features after variable encoding). Our clustering algorithm identified k=3 distinct phenotypes and 18 strongly informative variables: Phenotype A (788 cases [50.9% prevalence] - age <math><mo>∼</mo></math> 57, Charlson comorbidity <math><mo>∼</mo></math> 1, pneumonia CURB-65 score <math><mo>∼</mo></math> 0 to 1, respiratory rate at admission <math><mo>∼</mo></math> 18 min<sup>-1</sup>, FiO<sub>2</sub> <math><mo>∼</mo></math> 21%, C-reactive protein CRP <math><mo>∼</mo></math> 49.5 mg/dL [median within cluster]); phenotype B (620 cases [40.0%] - age <math><mo>∼</mo></math> 75, Charlson <math><mo>∼</mo></math> 5, CURB-65 <math><mo>∼</mo></math> 1 to 2, respiration <math><mo>∼</mo></math> 20 min<sup>-1</sup>, FiO<sub>2</sub> <math><mo>∼</mo></math> 21%, CRP <math><mo>∼</mo></math> 101.5 mg/dL); and phenotype C (140 cases [9.0%] - age <math><mo>∼</mo></math> 71, Charlson <math><mo>∼</mo></math> 4, CURB-65 <math><mo>∼</mo></math> 0 to 2, respiration <math><mo>∼</mo></math> 30 min<sup>-1</sup>, FiO<sub>2</sub> <math><mo>∼</mo></math> 38%, CRP <math><mo>∼</mo></math> 152.3 mg/dL). Hypothesis testing provided solid statistical evidence supporting an interaction between phenotype and each clinical outcome: severity and mortality. By computing their corresponding odds ratios, a clear trend was found for higher frequencies of unfavourable evolution in phenotype C with respect to B, as well as more unfavourable in phenotype B than in A.</p><p><strong>Conclusion: </strong>A compound unsupervised clustering technique (including a fully-automated optimization of its internal parameters) revealed the existence of three distinct groups of patients - phenotypes. In turn, these showed strong associations with the clinical severity in the progression of pneumonia, and with mortality.</p>","PeriodicalId":45120,"journal":{"name":"Pneumonia","volume":"16 1","pages":"12"},"PeriodicalIF":8.5000,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pneumonia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s41479-024-00132-0","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RESPIRATORY SYSTEM","Score":null,"Total":0}
引用次数: 0
Abstract
Background: There exists consistent empirical evidence in the literature pointing out ample heterogeneity in terms of the clinical evolution of patients with COVID-19. The identification of specific phenotypes underlying in the population might contribute towards a better understanding and characterization of the different courses of the disease. The aim of this study was to identify distinct clinical phenotypes among hospitalized patients with SARS-CoV-2 pneumonia using machine learning clustering, and to study their association with subsequent clinical outcomes as severity and mortality.
Methods: Multicentric observational, prospective, longitudinal, cohort study conducted in four hospitals in Spain. We included adult patients admitted for in-hospital stay due to SARS-CoV-2 pneumonia. We collected a broad spectrum of variables to describe exhaustively each case: patient demographics, comorbidities, symptoms, physiological status, baseline examinations (blood analytics, arterial gas test), etc. For the development and internal validation of the clustering/phenotype models, the dataset was split into training and test sets (50% each). We proposed a sequence of machine learning stages: feature scaling, missing data imputation, reduction of data dimensionality via Kernel Principal Component Analysis (KPCA), and clustering with the k-means algorithm. The optimal cluster model parameters -including k, the number of phenotypes- were chosen automatically, by maximizing the average Silhouette score across the training set.
Results: We enrolled 1548 patients, each of them characterized by 92 clinical attributes (d=109 features after variable encoding). Our clustering algorithm identified k=3 distinct phenotypes and 18 strongly informative variables: Phenotype A (788 cases [50.9% prevalence] - age 57, Charlson comorbidity 1, pneumonia CURB-65 score 0 to 1, respiratory rate at admission 18 min-1, FiO2 21%, C-reactive protein CRP 49.5 mg/dL [median within cluster]); phenotype B (620 cases [40.0%] - age 75, Charlson 5, CURB-65 1 to 2, respiration 20 min-1, FiO2 21%, CRP 101.5 mg/dL); and phenotype C (140 cases [9.0%] - age 71, Charlson 4, CURB-65 0 to 2, respiration 30 min-1, FiO2 38%, CRP 152.3 mg/dL). Hypothesis testing provided solid statistical evidence supporting an interaction between phenotype and each clinical outcome: severity and mortality. By computing their corresponding odds ratios, a clear trend was found for higher frequencies of unfavourable evolution in phenotype C with respect to B, as well as more unfavourable in phenotype B than in A.
Conclusion: A compound unsupervised clustering technique (including a fully-automated optimization of its internal parameters) revealed the existence of three distinct groups of patients - phenotypes. In turn, these showed strong associations with the clinical severity in the progression of pneumonia, and with mortality.