Thomas van Niekerk, Jacques V. Venter, Stephanus E. Terblanche
{"title":"An automated exact solution framework towards solving the logistic regression best subset selection problem","authors":"Thomas van Niekerk, Jacques V. Venter, Stephanus E. Terblanche","doi":"10.37920/sasj.2023.57.2.2","DOIUrl":null,"url":null,"abstract":"An automated logistic regression solution framework (ALRSF) is proposed to solve a mixed integer programming (MIP) formulation of the well known logistic regression best subset selection problem. The solution framework firstly determines the optimal number of independent variables that should be included in the model using an automated cardinality parameter selection procedure. The cardinality parameter dictates the size of the subset of variables and can be problem-specific. A novel regression parameter fixing heuristic that utilises a Benders decomposition algorithm is applied to prune the solution search space such that the optimal regression parameter values are found faster. An optimality gap is subsequently calculated to quantify the quality of the final regression model by considering the distance between the best possible log-likelihood value and a log-likelihood value that is calculated using the current parameter values. Attempts are then made to reduce the optimality gap by adjusting regression parameter values. The ALRSF serves as a holistic variable selection framework that enables the user to consider larger datasets when solving the best subset selection logistic regression problem by significantly reducing the memory requirements associated with its mixed integer programming formulation. Furthermore, the automated framework requires minimal user intervention during model training and hyperparameter tuning. Improvements in quality of the final model (when considering both the optimality gap and computing resources required to achieve a result) are observed when the ALRSF is applied to well-known real-world UCI machine learning datasets. Keywords: Best subset selection, Independent variable selection, Logistic regression, Mixed integer programming","PeriodicalId":53997,"journal":{"name":"SOUTH AFRICAN STATISTICAL JOURNAL","volume":"1 1","pages":"0"},"PeriodicalIF":0.4000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SOUTH AFRICAN STATISTICAL JOURNAL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.37920/sasj.2023.57.2.2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0
Abstract
An automated logistic regression solution framework (ALRSF) is proposed to solve a mixed integer programming (MIP) formulation of the well known logistic regression best subset selection problem. The solution framework firstly determines the optimal number of independent variables that should be included in the model using an automated cardinality parameter selection procedure. The cardinality parameter dictates the size of the subset of variables and can be problem-specific. A novel regression parameter fixing heuristic that utilises a Benders decomposition algorithm is applied to prune the solution search space such that the optimal regression parameter values are found faster. An optimality gap is subsequently calculated to quantify the quality of the final regression model by considering the distance between the best possible log-likelihood value and a log-likelihood value that is calculated using the current parameter values. Attempts are then made to reduce the optimality gap by adjusting regression parameter values. The ALRSF serves as a holistic variable selection framework that enables the user to consider larger datasets when solving the best subset selection logistic regression problem by significantly reducing the memory requirements associated with its mixed integer programming formulation. Furthermore, the automated framework requires minimal user intervention during model training and hyperparameter tuning. Improvements in quality of the final model (when considering both the optimality gap and computing resources required to achieve a result) are observed when the ALRSF is applied to well-known real-world UCI machine learning datasets. Keywords: Best subset selection, Independent variable selection, Logistic regression, Mixed integer programming
期刊介绍:
The journal will publish innovative contributions to the theory and application of statistics. Authoritative review articles on topics of general interest which are not readily accessible in a coherent form, will be also be considered for publication. Articles on applications or of a general nature will be published in separate sections and an author should indicate which of these sections an article is intended for. An applications article should normally consist of the analysis of actual data and need not necessarily contain new theory. The data should be made available with the article but need not necessarily be part of it.