{"title":"Addressing statistical challenges in the analysis of proteomics data with extremely small sample size: a simulation study.","authors":"Kyung Hyun Lee, Shervin Assassi, Chandra Mohan, Claudia Pedroza","doi":"10.1186/s12864-024-11018-2","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>One of the most promising approaches for early and more precise disease prediction and diagnosis is through the inclusion of proteomics data augmented with clinical data. Clinical proteomics data is often characterized by its high dimensionality and extremely limited sample size, posing a significant challenge when employing machine learning techniques for extracting only the most relevant information. Although there is a wide array of statistical techniques and numerous analysis pipelines employed in proteomics data analysis, it is unclear which of these methods produce the most efficient, reproducible, and clinically meaningful results.</p><p><strong>Results: </strong>In this study, we compared 9 unique analysis schemes comprised of different machine learning and dimensionality reduction methods for the analysis of simulated proteomics data consisting of 1317 proteins measured in 26 subjects (i.e., 13 controls and 13 cases). In scenarios where the sample size is extremely small (i.e., n < 30), all schemes resulted in an exceptionally high level of performance metrics, indicating potential overfitting. While performance metrics did not exhibit significant differences across schemes, the set of proteins selected to be discriminatory between groups demonstrated a substantial level of heterogeneity. However, despite heterogeneity in the selected proteins, their biological pathways and genetic diseases exhibited similarities. A sensitivity analysis conducted using varying sample sizes indicated that the stability of a set of selected biomarkers improves with larger sample sizes within a scheme.</p><p><strong>Conclusions: </strong>When the aim of the study is to identify a statistical model that best distinguishes between cohort groups using proteomics data and to uncover the biological pathways and disorders common among the selected proteins, the majority of widely used analysis pipelines perform similarly. However, if the main objective is to pinpoint a set of selected proteins that wield significant influence in discriminating cohort groups and utilize them for subsequent investigations, meticulous consideration is necessary when opting for statistical models, due to the possibility of heterogeneity in the sets of selected proteins.</p>","PeriodicalId":9030,"journal":{"name":"BMC Genomics","volume":"25 1","pages":"1086"},"PeriodicalIF":3.5000,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11566501/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12864-024-11018-2","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: One of the most promising approaches for early and more precise disease prediction and diagnosis is through the inclusion of proteomics data augmented with clinical data. Clinical proteomics data is often characterized by its high dimensionality and extremely limited sample size, posing a significant challenge when employing machine learning techniques for extracting only the most relevant information. Although there is a wide array of statistical techniques and numerous analysis pipelines employed in proteomics data analysis, it is unclear which of these methods produce the most efficient, reproducible, and clinically meaningful results.
Results: In this study, we compared 9 unique analysis schemes comprised of different machine learning and dimensionality reduction methods for the analysis of simulated proteomics data consisting of 1317 proteins measured in 26 subjects (i.e., 13 controls and 13 cases). In scenarios where the sample size is extremely small (i.e., n < 30), all schemes resulted in an exceptionally high level of performance metrics, indicating potential overfitting. While performance metrics did not exhibit significant differences across schemes, the set of proteins selected to be discriminatory between groups demonstrated a substantial level of heterogeneity. However, despite heterogeneity in the selected proteins, their biological pathways and genetic diseases exhibited similarities. A sensitivity analysis conducted using varying sample sizes indicated that the stability of a set of selected biomarkers improves with larger sample sizes within a scheme.
Conclusions: When the aim of the study is to identify a statistical model that best distinguishes between cohort groups using proteomics data and to uncover the biological pathways and disorders common among the selected proteins, the majority of widely used analysis pipelines perform similarly. However, if the main objective is to pinpoint a set of selected proteins that wield significant influence in discriminating cohort groups and utilize them for subsequent investigations, meticulous consideration is necessary when opting for statistical models, due to the possibility of heterogeneity in the sets of selected proteins.
期刊介绍:
BMC Genomics is an open access, peer-reviewed journal that considers articles on all aspects of genome-scale analysis, functional genomics, and proteomics.
BMC Genomics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.