Guannan Yang, E. Menkhorst, E. Dimitriadis, K. Lê Cao
{"title":"PLSKO: a robust knockoff generator to control false discovery rate in omics variable selection","authors":"Guannan Yang, E. Menkhorst, E. Dimitriadis, K. Lê Cao","doi":"10.1101/2024.08.06.606935","DOIUrl":null,"url":null,"abstract":"The knockoff framework, combined with variable selection procedure, controls false discovery rate (FDR) without the need for calculating p−values. Hence, it presents an attractive alternative to differential expression analysis of high-throughput biological data. However, current knockoff variable generators make strong assumptions or insufficient approximations that lead to FDR inflation when applied to biological data. We propose Partial Least Squares Knockoff (PLSKO), an efficient and assumption-free knockoff generator that is robust to varying types of biological omics data. We compare PLSKO with a wide range of existing methods. In simulation studies, we show that PLSKO is the only method that controls FDR with sufficient statistical power in complex non-linear cases. In semi-simulation studies based on real data, we show that PLSKO generates valid knockoff variables for different types of biological data, including RNA-seq, proteomics, metabolomics and microbiome. In preeclampsia multi-omics case studies, we combined PLSKO with Aggregation Knockoff to address the randomness of knockoffs and improve power, and show that our method is able to select variables that are biologically relevant.","PeriodicalId":505198,"journal":{"name":"bioRxiv","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.06.606935","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The knockoff framework, combined with variable selection procedure, controls false discovery rate (FDR) without the need for calculating p−values. Hence, it presents an attractive alternative to differential expression analysis of high-throughput biological data. However, current knockoff variable generators make strong assumptions or insufficient approximations that lead to FDR inflation when applied to biological data. We propose Partial Least Squares Knockoff (PLSKO), an efficient and assumption-free knockoff generator that is robust to varying types of biological omics data. We compare PLSKO with a wide range of existing methods. In simulation studies, we show that PLSKO is the only method that controls FDR with sufficient statistical power in complex non-linear cases. In semi-simulation studies based on real data, we show that PLSKO generates valid knockoff variables for different types of biological data, including RNA-seq, proteomics, metabolomics and microbiome. In preeclampsia multi-omics case studies, we combined PLSKO with Aggregation Knockoff to address the randomness of knockoffs and improve power, and show that our method is able to select variables that are biologically relevant.