SOUTH AFRICAN STATISTICAL JOURNAL最新文献

英文中文

An automated exact solution framework towards solving the logistic regression best subset selection problem 一个解决逻辑回归最佳子集选择问题的自动化精确解框架

Q4 STATISTICS & PROBABILITY

SOUTH AFRICAN STATISTICAL JOURNAL

Pub Date : 2023-01-01 DOI: 10.37920/sasj.2023.57.2.2

Thomas van Niekerk, Jacques V. Venter, Stephanus E. Terblanche

An automated logistic regression solution framework (ALRSF) is proposed to solve a mixed integer programming (MIP) formulation of the well known logistic regression best subset selection problem. The solution framework firstly determines the optimal number of independent variables that should be included in the model using an automated cardinality parameter selection procedure. The cardinality parameter dictates the size of the subset of variables and can be problem-specific. A novel regression parameter fixing heuristic that utilises a Benders decomposition algorithm is applied to prune the solution search space such that the optimal regression parameter values are found faster. An optimality gap is subsequently calculated to quantify the quality of the final regression model by considering the distance between the best possible log-likelihood value and a log-likelihood value that is calculated using the current parameter values. Attempts are then made to reduce the optimality gap by adjusting regression parameter values. The ALRSF serves as a holistic variable selection framework that enables the user to consider larger datasets when solving the best subset selection logistic regression problem by significantly reducing the memory requirements associated with its mixed integer programming formulation. Furthermore, the automated framework requires minimal user intervention during model training and hyperparameter tuning. Improvements in quality of the final model (when considering both the optimality gap and computing resources required to achieve a result) are observed when the ALRSF is applied to well-known real-world UCI machine learning datasets. Keywords: Best subset selection, Independent variable selection, Logistic regression, Mixed integer programming

提出了一种自动逻辑回归求解框架(ALRSF)，用于求解混合整数规划(MIP)形式的逻辑回归最优子集选择问题。求解框架首先使用自动基数参数选择过程确定应包含在模型中的自变量的最优数量。基数参数指示变量子集的大小，可以是特定于问题的。利用Benders分解算法，采用一种新颖的回归参数确定启发式算法对解搜索空间进行剪枝，从而更快地找到最优回归参数值。随后计算最优性差距，通过考虑最佳可能对数似然值与使用当前参数值计算的对数似然值之间的距离来量化最终回归模型的质量。然后尝试通过调整回归参数值来减小最优性差距。ALRSF作为一个整体变量选择框架，通过显著降低与其混合整数规划公式相关的内存需求，使用户能够在解决最佳子集选择逻辑回归问题时考虑更大的数据集。此外，自动化框架在模型训练和超参数调优期间需要最少的用户干预。当将ALRSF应用于众所周知的现实世界的UCI机器学习数据集时，可以观察到最终模型质量的改进(考虑到最优性差距和实现结果所需的计算资源)。关键词:最佳子集选择，自变量选择，逻辑回归，混合整数规划

{"title":"An automated exact solution framework towards solving the logistic regression best subset selection problem","authors":"Thomas van Niekerk, Jacques V. Venter, Stephanus E. Terblanche","doi":"10.37920/sasj.2023.57.2.2","DOIUrl":"https://doi.org/10.37920/sasj.2023.57.2.2","url":null,"abstract":"An automated logistic regression solution framework (ALRSF) is proposed to solve a mixed integer programming (MIP) formulation of the well known logistic regression best subset selection problem. The solution framework firstly determines the optimal number of independent variables that should be included in the model using an automated cardinality parameter selection procedure. The cardinality parameter dictates the size of the subset of variables and can be problem-specific. A novel regression parameter fixing heuristic that utilises a Benders decomposition algorithm is applied to prune the solution search space such that the optimal regression parameter values are found faster. An optimality gap is subsequently calculated to quantify the quality of the final regression model by considering the distance between the best possible log-likelihood value and a log-likelihood value that is calculated using the current parameter values. Attempts are then made to reduce the optimality gap by adjusting regression parameter values. The ALRSF serves as a holistic variable selection framework that enables the user to consider larger datasets when solving the best subset selection logistic regression problem by significantly reducing the memory requirements associated with its mixed integer programming formulation. Furthermore, the automated framework requires minimal user intervention during model training and hyperparameter tuning. Improvements in quality of the final model (when considering both the optimality gap and computing resources required to achieve a result) are observed when the ALRSF is applied to well-known real-world UCI machine learning datasets. Keywords: Best subset selection, Independent variable selection, Logistic regression, Mixed integer programming","PeriodicalId":53997,"journal":{"name":"SOUTH AFRICAN STATISTICAL JOURNAL","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136258806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Covariate construction of nonconvex windows for spatial point patterns 空间点模式非凸窗口的协变量构造

Q4 STATISTICS & PROBABILITY

SOUTH AFRICAN STATISTICAL JOURNAL

Pub Date : 2023-01-01 DOI: 10.37920/sasj.2023.57.2.1

Kabelo Mahloromela, Inger Fabris-Rotelli, Christine Kraamwinkel

In some standard applications of spatial point pattern analysis, window selection for spatial point pattern data is complex. Often, the point pattern window is given a priori. Otherwise, the region is chosen using some objective means reflecting a view that the window may be representative of a larger region. The typical approaches used are the smallest rectangular bounding window and convex windows. The chosen window should however cover the true domain of the point process since it defines the domain for point pattern analysis and supports estimation and inference. Choosing too large a window results in spurious estimation and inference in regions of the window where points cannot occur. We propose a new algorithm for selecting the point pattern domain based on spatial covariate information and without the restriction of convexity, allowing for better estimation of the true domain. Amodified kernel smoothed intensity estimate that uses the Euclidean shortest path distance is proposed as validation of the algorithm. The proposed algorithm is applied in the setting of rural villages in Tanzania. As a spatial covariate, remotely sensed elevation data is used. The algorithm is able to detect and filter out high relief areas and steep slopes; observed characteristics that make the occurrence of a household in these regions improbable. Keywords: Covariate, Euclidean shortest path, Nonconvex, Spatial point pattern, Window selection

在一些标准的空间点图分析应用中，空间点图数据的窗口选择是复杂的。通常，点模式窗口是先验的。否则，使用一些客观的方法来选择区域，以反映窗口可能代表更大区域的观点。使用的典型方法是最小的矩形边界窗口和凸窗口。然而，选择的窗口应该覆盖点过程的真正域，因为它定义了点模式分析的域，并支持估计和推理。选择太大的窗口会导致窗口中不可能出现点的区域出现虚假估计和推断。我们提出了一种新的基于空间协变量信息的点模式域选择算法，该算法不受凸性的限制，可以更好地估计真域。提出了利用欧几里得最短路径距离的改进核平滑强度估计作为算法的验证。将该算法应用于坦桑尼亚农村环境中。作为空间协变量，使用遥感高程数据。该算法能够检测并滤除高起伏区域和陡坡;在这些地区不可能出现一个家庭的观察特征。关键词:协变量，欧氏最短路径，非凸，空间点模式，窗口选择

{"title":"Covariate construction of nonconvex windows for spatial point patterns","authors":"Kabelo Mahloromela, Inger Fabris-Rotelli, Christine Kraamwinkel","doi":"10.37920/sasj.2023.57.2.1","DOIUrl":"https://doi.org/10.37920/sasj.2023.57.2.1","url":null,"abstract":"In some standard applications of spatial point pattern analysis, window selection for spatial point pattern data is complex. Often, the point pattern window is given a priori. Otherwise, the region is chosen using some objective means reflecting a view that the window may be representative of a larger region. The typical approaches used are the smallest rectangular bounding window and convex windows. The chosen window should however cover the true domain of the point process since it defines the domain for point pattern analysis and supports estimation and inference. Choosing too large a window results in spurious estimation and inference in regions of the window where points cannot occur. We propose a new algorithm for selecting the point pattern domain based on spatial covariate information and without the restriction of convexity, allowing for better estimation of the true domain. Amodified kernel smoothed intensity estimate that uses the Euclidean shortest path distance is proposed as validation of the algorithm. The proposed algorithm is applied in the setting of rural villages in Tanzania. As a spatial covariate, remotely sensed elevation data is used. The algorithm is able to detect and filter out high relief areas and steep slopes; observed characteristics that make the occurrence of a household in these regions improbable. Keywords: Covariate, Euclidean shortest path, Nonconvex, Spatial point pattern, Window selection","PeriodicalId":53997,"journal":{"name":"SOUTH AFRICAN STATISTICAL JOURNAL","volume":"389 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136258807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the variance and skewness of the swap rate in a stochastic volatility interest rate model 随机波动率利率模型中掉期利率的方差和偏度

IF 0.3 Q4 STATISTICS & PROBABILITY

SOUTH AFRICAN STATISTICAL JOURNAL

Pub Date : 2021-09-01 DOI: 10.37920/sasj.2021.55.2.2

Lars Palapies

引用次数: 0

Time-variant nonparametric extreme quantile estimation with application to US temperature data 时变非参数极值分位数估计及其在美国气温数据中的应用

IF 0.3 Q4 STATISTICS & PROBABILITY

SOUTH AFRICAN STATISTICAL JOURNAL

Pub Date : 2021-09-01 DOI: 10.37920/sasj.2021.55.2.1

M. S. Chowdhury, Bogdan Gadidov, Linh Le, Yan Wang, L. Vanbrackle

引用次数: 0

Advantages of using factorisation machines as a statistical modelling technique 使用因子分解机器作为统计建模技术的优点

IF 0.3 Q4 STATISTICS & PROBABILITY

SOUTH AFRICAN STATISTICAL JOURNAL

Pub Date : 2021-09-01 DOI: 10.37920/sasj.2021.55.2.3

E. Slabber, T. Verster, Riaan de Jongh

引用次数: 0

Estimation of location parameter within pre-specified error bound with second-order efficient two-stage procedure 用二阶有效的两阶段方法估计预定误差范围内的位置参数

IF 0.3 Q4 STATISTICS & PROBABILITY

SOUTH AFRICAN STATISTICAL JOURNAL

Pub Date : 2021-03-31 DOI: 10.37920/SASJ.2021.55.1.4

Bhargab Chattopadhyay, Swarnali Banerjee

This paper develops a general approach for constructing a confidence interval for a parameter of interest with a specified confidence coefficient and a specified width. This is done assuming known a positive lower bound for the unknown nuisance parameter and independence of suitable statistics. Under mild conditions, we develop a modified two-stage procedure which enjoys attractive optimality properties including a second-order efficiency property and asymptotic consistency property. We extend this work for finding a confidence interval for the location parameter of the inverse Gaussian distribution. As an illustration, we developed a modified mean absolute deviation-based procedure in the supplementary section for finding a fixed-width confidence interval for the normal mean.

本文提出了一种构造具有特定置信系数和特定宽度的感兴趣参数置信区间的一般方法。这是假设已知未知干扰参数的正下界和适当统计量的独立性。在温和条件下，我们开发了一个改进的两阶段过程，它具有二阶效率和渐近一致性等最优性。我们将这项工作推广到寻找高斯反分布的位置参数的置信区间。作为说明，我们在补充部分中开发了一种改进的基于平均绝对偏差的程序，用于为正态均值找到固定宽度的置信区间。

引用次数: 0

Contextual batting and bowling in limited overs cricket 上下文击球和保龄球在有限的板球

IF 0.3 Q4 STATISTICS & PROBABILITY

SOUTH AFRICAN STATISTICAL JOURNAL

Pub Date : 2021-03-31 DOI: 10.37920/SASJ.2021.55.1.6

James Thomson, Harsha Perera, T. Swartz

Cricket is a sport for which many batting and bowling statistics have been proposed. However, a feature of cricket is that the level of aggressiveness adopted by batsmen is dependent on match circumstances. It is therefore relevant to consider these circumstances when evaluating batting and bowling performances. This paper considers batting performance in the second innings of limited overs cricket when a target has been set. The runs required, the number of overs completed and the wickets taken are relevant in assessing the batting performance. We produce a visualization for second innings batting which describes how a batsman performs under different circumstances. The visualization is then reduced to a single statistic "clutch batting" which can be used to compare batsmen. An analogous approach is then provided for bowlers based on the symmetry between batting and bowling, and we define the statistic "clutch bowling".

板球是一项有许多击球和保龄球统计数据的运动。然而，板球的一个特点是击球手的攻击性取决于比赛环境。因此，在评估击球和保龄球表现时，考虑这些情况是相关的。本文考虑了在设定目标的情况下，有限回合板球第二局的击球表现。所需的跑动次数、完成的投球次数和所用的三柱球与评估击球表现有关。我们制作了第二局击球的可视化，描述了击球手在不同情况下的表现。然后，可视化被简化为一个单一的统计数据“关键击球”，可以用来比较击球手。然后，基于击球和保龄球之间的对称性，为投球手提供了一种类似的方法，我们定义了统计“离合器保龄球”。

引用次数: 1

A generic test for the similarity of spatial data 空间数据相似性的通用检验

IF 0.3 Q4 STATISTICS & PROBABILITY

SOUTH AFRICAN STATISTICAL JOURNAL

Pub Date : 2021-03-31 DOI: 10.37920/SASJ.2021.55.1.5

R. Kirsten, I. Fabris-Rotelli

Two spatial data sets are considered to be similar if they originate from the same stochastic process in terms of their spatial structure. Many tests have been developed over recent years to test the similarity of certain types of spatial data, such as spatial point patterns, geostatistical data and images. This research proposes a generic spatial similarity test able to handle various types of spatial data, for example images (modelled spatially), point patterns, marked point patterns, geostatistical data and lattice patterns. A simulation study is done in order to test the method for each spatial data set. After the simulation study, it was concluded that the proposed spatial similarity test is not sensitive to the user-defined resolution of the pixel image representation. From the simulation study, the proposed spatial similarity test performs well on lattice data, some of the unmarked point patterns and the marked point patterns with discrete marks. We illustrate this test on property prices in the City of Cape Town and the City of Johannesburg, South Africa.

如果两个空间数据集在空间结构上起源于相同的随机过程，则认为它们是相似的。近年来开发了许多测试来测试某些类型的空间数据的相似性，例如空间点模式、地质统计数据和图像。本研究提出了一种通用的空间相似性测试，能够处理各种类型的空间数据，例如图像(空间建模)、点模式、标记点模式、地统计数据和格点模式。为了验证该方法在每个空间数据集上的有效性，进行了仿真研究。仿真研究表明，所提出的空间相似性检验对像素图像表示的自定义分辨率不敏感。仿真研究表明，本文提出的空间相似性检验方法在点阵数据、部分未标记点模式和带有离散标记的标记点模式上都有较好的效果。我们以南非开普敦市和约翰内斯堡市的房地产价格为例说明这个测试。

引用次数: 0

Regularisation in discrete survival models: A comparison of lasso and gradient boosting 离散生存模型中的正则化:套索和梯度提升的比较

IF 0.3 Q4 STATISTICS & PROBABILITY

SOUTH AFRICAN STATISTICAL JOURNAL

Pub Date : 2021-03-31 DOI: 10.37920/SASJ.2021.55.1.3

A. Bere, Godfrey H. Sithuba, Coster Mabvuu, Retang Mashabela, C. Sigauke, K. Kyei

We present the results of a simulation study performed to compare the accuracy of a lasso-type penalization method and gradient boosting in estimating the baseline hazard function and covariate parameters in discrete survival models. The mean square error results reveal that the lasso-type algorithm performs better in recovering the baseline hazard and covariate parameters. In particular, gradient boosting underestimates the sizes of the parameters and also has a high false positive rate. Similar results are obtained in an application to real-life data.

我们提出了一项模拟研究的结果，以比较套索类型惩罚方法和梯度增强在估计离散生存模型中基线风险函数和协变量参数时的准确性。均方误差结果表明，套索算法在恢复基线风险和协变量参数方面具有较好的效果。特别是，梯度增强低估了参数的大小，也有很高的假阳性率。在对实际数据的应用中也得到了类似的结果。

引用次数: 0

Estimating the Gini index for heavy-tailed income distributions 重尾收入分配的基尼指数估计

IF 0.3 Q4 STATISTICS & PROBABILITY

SOUTH AFRICAN STATISTICAL JOURNAL

Pub Date : 2021-03-01 DOI: 10.37920/SASJ.2021.55.1.2

Amina Bari, A. Rassoul, Hamid Ould Rouis

In the present paper, we define and study one of the most popular indices which measures the inequality of capital incomes, known as the Gini index. We construct a semiparametric estimator for the Gini index in case of heavy-tailed income distributions and we establish its asymptotic distribution and derive bounds of confidence. We explore the performance of the confidence bounds in a simulation study and draw conclusions about capital incomes in some income distributions.

在本文中，我们定义并研究了衡量资本收入不平等的最受欢迎的指标之一，即基尼指数。在重尾收入分布的情况下，我们构造了基尼指数的半参数估计量，并建立了它的渐近分布，导出了置信界。我们在模拟研究中探讨了置信界的性能，并得出了一些收入分配中资本收入的结论。

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

SOUTH AFRICAN STATISTICAL JOURNAL

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀