Pub Date : 2023-01-01DOI: 10.37920/sasj.2023.57.2.2
Thomas van Niekerk, Jacques V. Venter, Stephanus E. Terblanche
An automated logistic regression solution framework (ALRSF) is proposed to solve a mixed integer programming (MIP) formulation of the well known logistic regression best subset selection problem. The solution framework firstly determines the optimal number of independent variables that should be included in the model using an automated cardinality parameter selection procedure. The cardinality parameter dictates the size of the subset of variables and can be problem-specific. A novel regression parameter fixing heuristic that utilises a Benders decomposition algorithm is applied to prune the solution search space such that the optimal regression parameter values are found faster. An optimality gap is subsequently calculated to quantify the quality of the final regression model by considering the distance between the best possible log-likelihood value and a log-likelihood value that is calculated using the current parameter values. Attempts are then made to reduce the optimality gap by adjusting regression parameter values. The ALRSF serves as a holistic variable selection framework that enables the user to consider larger datasets when solving the best subset selection logistic regression problem by significantly reducing the memory requirements associated with its mixed integer programming formulation. Furthermore, the automated framework requires minimal user intervention during model training and hyperparameter tuning. Improvements in quality of the final model (when considering both the optimality gap and computing resources required to achieve a result) are observed when the ALRSF is applied to well-known real-world UCI machine learning datasets. Keywords: Best subset selection, Independent variable selection, Logistic regression, Mixed integer programming
{"title":"An automated exact solution framework towards solving the logistic regression best subset selection problem","authors":"Thomas van Niekerk, Jacques V. Venter, Stephanus E. Terblanche","doi":"10.37920/sasj.2023.57.2.2","DOIUrl":"https://doi.org/10.37920/sasj.2023.57.2.2","url":null,"abstract":"An automated logistic regression solution framework (ALRSF) is proposed to solve a mixed integer programming (MIP) formulation of the well known logistic regression best subset selection problem. The solution framework firstly determines the optimal number of independent variables that should be included in the model using an automated cardinality parameter selection procedure. The cardinality parameter dictates the size of the subset of variables and can be problem-specific. A novel regression parameter fixing heuristic that utilises a Benders decomposition algorithm is applied to prune the solution search space such that the optimal regression parameter values are found faster. An optimality gap is subsequently calculated to quantify the quality of the final regression model by considering the distance between the best possible log-likelihood value and a log-likelihood value that is calculated using the current parameter values. Attempts are then made to reduce the optimality gap by adjusting regression parameter values. The ALRSF serves as a holistic variable selection framework that enables the user to consider larger datasets when solving the best subset selection logistic regression problem by significantly reducing the memory requirements associated with its mixed integer programming formulation. Furthermore, the automated framework requires minimal user intervention during model training and hyperparameter tuning. Improvements in quality of the final model (when considering both the optimality gap and computing resources required to achieve a result) are observed when the ALRSF is applied to well-known real-world UCI machine learning datasets. Keywords: Best subset selection, Independent variable selection, Logistic regression, Mixed integer programming","PeriodicalId":53997,"journal":{"name":"SOUTH AFRICAN STATISTICAL JOURNAL","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136258806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In some standard applications of spatial point pattern analysis, window selection for spatial point pattern data is complex. Often, the point pattern window is given a priori. Otherwise, the region is chosen using some objective means reflecting a view that the window may be representative of a larger region. The typical approaches used are the smallest rectangular bounding window and convex windows. The chosen window should however cover the true domain of the point process since it defines the domain for point pattern analysis and supports estimation and inference. Choosing too large a window results in spurious estimation and inference in regions of the window where points cannot occur. We propose a new algorithm for selecting the point pattern domain based on spatial covariate information and without the restriction of convexity, allowing for better estimation of the true domain. Amodified kernel smoothed intensity estimate that uses the Euclidean shortest path distance is proposed as validation of the algorithm. The proposed algorithm is applied in the setting of rural villages in Tanzania. As a spatial covariate, remotely sensed elevation data is used. The algorithm is able to detect and filter out high relief areas and steep slopes; observed characteristics that make the occurrence of a household in these regions improbable. Keywords: Covariate, Euclidean shortest path, Nonconvex, Spatial point pattern, Window selection
{"title":"Covariate construction of nonconvex windows for spatial point patterns","authors":"Kabelo Mahloromela, Inger Fabris-Rotelli, Christine Kraamwinkel","doi":"10.37920/sasj.2023.57.2.1","DOIUrl":"https://doi.org/10.37920/sasj.2023.57.2.1","url":null,"abstract":"In some standard applications of spatial point pattern analysis, window selection for spatial point pattern data is complex. Often, the point pattern window is given a priori. Otherwise, the region is chosen using some objective means reflecting a view that the window may be representative of a larger region. The typical approaches used are the smallest rectangular bounding window and convex windows. The chosen window should however cover the true domain of the point process since it defines the domain for point pattern analysis and supports estimation and inference. Choosing too large a window results in spurious estimation and inference in regions of the window where points cannot occur. We propose a new algorithm for selecting the point pattern domain based on spatial covariate information and without the restriction of convexity, allowing for better estimation of the true domain. Amodified kernel smoothed intensity estimate that uses the Euclidean shortest path distance is proposed as validation of the algorithm. The proposed algorithm is applied in the setting of rural villages in Tanzania. As a spatial covariate, remotely sensed elevation data is used. The algorithm is able to detect and filter out high relief areas and steep slopes; observed characteristics that make the occurrence of a household in these regions improbable. Keywords: Covariate, Euclidean shortest path, Nonconvex, Spatial point pattern, Window selection","PeriodicalId":53997,"journal":{"name":"SOUTH AFRICAN STATISTICAL JOURNAL","volume":"389 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136258807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-01DOI: 10.37920/sasj.2021.55.2.2
Lars Palapies
{"title":"On the variance and skewness of the swap rate in a stochastic volatility interest rate model","authors":"Lars Palapies","doi":"10.37920/sasj.2021.55.2.2","DOIUrl":"https://doi.org/10.37920/sasj.2021.55.2.2","url":null,"abstract":"","PeriodicalId":53997,"journal":{"name":"SOUTH AFRICAN STATISTICAL JOURNAL","volume":" ","pages":""},"PeriodicalIF":0.3,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48288689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-01DOI: 10.37920/sasj.2021.55.2.1
M. S. Chowdhury, Bogdan Gadidov, Linh Le, Yan Wang, L. Vanbrackle
{"title":"Time-variant nonparametric extreme quantile estimation with application to US temperature data","authors":"M. S. Chowdhury, Bogdan Gadidov, Linh Le, Yan Wang, L. Vanbrackle","doi":"10.37920/sasj.2021.55.2.1","DOIUrl":"https://doi.org/10.37920/sasj.2021.55.2.1","url":null,"abstract":"","PeriodicalId":53997,"journal":{"name":"SOUTH AFRICAN STATISTICAL JOURNAL","volume":"1 1","pages":""},"PeriodicalIF":0.3,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43303152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-01DOI: 10.37920/sasj.2021.55.2.3
E. Slabber, T. Verster, Riaan de Jongh
{"title":"Advantages of using factorisation machines as a statistical modelling technique","authors":"E. Slabber, T. Verster, Riaan de Jongh","doi":"10.37920/sasj.2021.55.2.3","DOIUrl":"https://doi.org/10.37920/sasj.2021.55.2.3","url":null,"abstract":"","PeriodicalId":53997,"journal":{"name":"SOUTH AFRICAN STATISTICAL JOURNAL","volume":" ","pages":""},"PeriodicalIF":0.3,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49105226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-03-31DOI: 10.37920/SASJ.2021.55.1.4
Bhargab Chattopadhyay, Swarnali Banerjee
This paper develops a general approach for constructing a confidence interval for a parameter of interest with a specified confidence coefficient and a specified width. This is done assuming known a positive lower bound for the unknown nuisance parameter and independence of suitable statistics. Under mild conditions, we develop a modified two-stage procedure which enjoys attractive optimality properties including a second-order efficiency property and asymptotic consistency property. We extend this work for finding a confidence interval for the location parameter of the inverse Gaussian distribution. As an illustration, we developed a modified mean absolute deviation-based procedure in the supplementary section for finding a fixed-width confidence interval for the normal mean.
{"title":"Estimation of location parameter within pre-specified error bound with second-order efficient two-stage procedure","authors":"Bhargab Chattopadhyay, Swarnali Banerjee","doi":"10.37920/SASJ.2021.55.1.4","DOIUrl":"https://doi.org/10.37920/SASJ.2021.55.1.4","url":null,"abstract":"This paper develops a general approach for constructing a confidence interval for a parameter of interest with a specified confidence coefficient and a specified width. This is done assuming known a positive lower bound for the unknown nuisance parameter and independence of suitable statistics. Under mild conditions, we develop a modified two-stage procedure which enjoys attractive optimality properties including a second-order efficiency property and asymptotic consistency property. We extend this work for finding a confidence interval for the location parameter of the inverse Gaussian distribution. As an illustration, we developed a modified mean absolute deviation-based procedure in the supplementary section for finding a fixed-width confidence interval for the normal mean.","PeriodicalId":53997,"journal":{"name":"SOUTH AFRICAN STATISTICAL JOURNAL","volume":"55 1","pages":"45-54"},"PeriodicalIF":0.3,"publicationDate":"2021-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43994375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-03-31DOI: 10.37920/SASJ.2021.55.1.6
James Thomson, Harsha Perera, T. Swartz
Cricket is a sport for which many batting and bowling statistics have been proposed. However, a feature of cricket is that the level of aggressiveness adopted by batsmen is dependent on match circumstances. It is therefore relevant to consider these circumstances when evaluating batting and bowling performances. This paper considers batting performance in the second innings of limited overs cricket when a target has been set. The runs required, the number of overs completed and the wickets taken are relevant in assessing the batting performance. We produce a visualization for second innings batting which describes how a batsman performs under different circumstances. The visualization is then reduced to a single statistic "clutch batting" which can be used to compare batsmen. An analogous approach is then provided for bowlers based on the symmetry between batting and bowling, and we define the statistic "clutch bowling".
{"title":"Contextual batting and bowling in limited overs cricket","authors":"James Thomson, Harsha Perera, T. Swartz","doi":"10.37920/SASJ.2021.55.1.6","DOIUrl":"https://doi.org/10.37920/SASJ.2021.55.1.6","url":null,"abstract":"Cricket is a sport for which many batting and bowling statistics have been proposed. However, a feature of cricket is that the level of aggressiveness adopted by batsmen is dependent on match circumstances. It is therefore relevant to consider these circumstances when evaluating batting and bowling performances. This paper considers batting performance in the second innings of limited overs cricket when a target has been set. The runs required, the number of overs completed and the wickets taken are relevant in assessing the batting performance. We produce a visualization for second innings batting which describes how a batsman performs under different circumstances. The visualization is then reduced to a single statistic \"clutch batting\" which can be used to compare batsmen. An analogous approach is then provided for bowlers based on the symmetry between batting and bowling, and we define the statistic \"clutch bowling\".","PeriodicalId":53997,"journal":{"name":"SOUTH AFRICAN STATISTICAL JOURNAL","volume":" ","pages":""},"PeriodicalIF":0.3,"publicationDate":"2021-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44199427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-03-31DOI: 10.37920/SASJ.2021.55.1.5
R. Kirsten, I. Fabris-Rotelli
Two spatial data sets are considered to be similar if they originate from the same stochastic process in terms of their spatial structure. Many tests have been developed over recent years to test the similarity of certain types of spatial data, such as spatial point patterns, geostatistical data and images. This research proposes a generic spatial similarity test able to handle various types of spatial data, for example images (modelled spatially), point patterns, marked point patterns, geostatistical data and lattice patterns. A simulation study is done in order to test the method for each spatial data set. After the simulation study, it was concluded that the proposed spatial similarity test is not sensitive to the user-defined resolution of the pixel image representation. From the simulation study, the proposed spatial similarity test performs well on lattice data, some of the unmarked point patterns and the marked point patterns with discrete marks. We illustrate this test on property prices in the City of Cape Town and the City of Johannesburg, South Africa.
{"title":"A generic test for the similarity of spatial data","authors":"R. Kirsten, I. Fabris-Rotelli","doi":"10.37920/SASJ.2021.55.1.5","DOIUrl":"https://doi.org/10.37920/SASJ.2021.55.1.5","url":null,"abstract":"Two spatial data sets are considered to be similar if they originate from the same stochastic process in terms of their spatial structure. Many tests have been developed over recent years to test the similarity of certain types of spatial data, such as spatial point patterns, geostatistical data and images. This research proposes a generic spatial similarity test able to handle various types of spatial data, for example images (modelled spatially), point patterns, marked point patterns, geostatistical data and lattice patterns. A simulation study is done in order to test the method for each spatial data set. After the simulation study, it was concluded that the proposed spatial similarity test is not sensitive to the user-defined resolution of the pixel image representation. From the simulation study, the proposed spatial similarity test performs well on lattice data, some of the unmarked point patterns and the marked point patterns with discrete marks. We illustrate this test on property prices in the City of Cape Town and the City of Johannesburg, South Africa.","PeriodicalId":53997,"journal":{"name":"SOUTH AFRICAN STATISTICAL JOURNAL","volume":"55 1","pages":"55-71"},"PeriodicalIF":0.3,"publicationDate":"2021-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43105275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-03-31DOI: 10.37920/SASJ.2021.55.1.3
A. Bere, Godfrey H. Sithuba, Coster Mabvuu, Retang Mashabela, C. Sigauke, K. Kyei
We present the results of a simulation study performed to compare the accuracy of a lasso-type penalization method and gradient boosting in estimating the baseline hazard function and covariate parameters in discrete survival models. The mean square error results reveal that the lasso-type algorithm performs better in recovering the baseline hazard and covariate parameters. In particular, gradient boosting underestimates the sizes of the parameters and also has a high false positive rate. Similar results are obtained in an application to real-life data.
{"title":"Regularisation in discrete survival models: A comparison of lasso and gradient boosting","authors":"A. Bere, Godfrey H. Sithuba, Coster Mabvuu, Retang Mashabela, C. Sigauke, K. Kyei","doi":"10.37920/SASJ.2021.55.1.3","DOIUrl":"https://doi.org/10.37920/SASJ.2021.55.1.3","url":null,"abstract":"We present the results of a simulation study performed to compare the accuracy of a lasso-type penalization method and gradient boosting in estimating the baseline hazard function and covariate parameters in discrete survival models. The mean square error results reveal that the lasso-type algorithm performs better in recovering the baseline hazard and covariate parameters. In particular, gradient boosting underestimates the sizes of the parameters and also has a high false positive rate. Similar results are obtained in an application to real-life data.","PeriodicalId":53997,"journal":{"name":"SOUTH AFRICAN STATISTICAL JOURNAL","volume":"55 1","pages":"29-44"},"PeriodicalIF":0.3,"publicationDate":"2021-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44744638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-03-01DOI: 10.37920/SASJ.2021.55.1.2
Amina Bari, A. Rassoul, Hamid Ould Rouis
In the present paper, we define and study one of the most popular indices which measures the inequality of capital incomes, known as the Gini index. We construct a semiparametric estimator for the Gini index in case of heavy-tailed income distributions and we establish its asymptotic distribution and derive bounds of confidence. We explore the performance of the confidence bounds in a simulation study and draw conclusions about capital incomes in some income distributions.
{"title":"Estimating the Gini index for heavy-tailed income distributions","authors":"Amina Bari, A. Rassoul, Hamid Ould Rouis","doi":"10.37920/SASJ.2021.55.1.2","DOIUrl":"https://doi.org/10.37920/SASJ.2021.55.1.2","url":null,"abstract":"In the present paper, we define and study one of the most popular indices which measures the inequality of capital incomes, known as the Gini index. We construct a semiparametric estimator for the Gini index in case of heavy-tailed income distributions and we establish its asymptotic distribution and derive bounds of confidence. We explore the performance of the confidence bounds in a simulation study and draw conclusions about capital incomes in some income distributions.","PeriodicalId":53997,"journal":{"name":"SOUTH AFRICAN STATISTICAL JOURNAL","volume":" ","pages":""},"PeriodicalIF":0.3,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49304065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}