Pub Date : 2024-09-12DOI: 10.1007/s00180-024-01545-7
Tommy Löfstedt
Partial least squares regression (PLS-R) has been an important regression method in the life sciences and many other fields for decades. However, PLS-R is typically solved using an opaque algorithmic approach, rather than through an optimisation formulation and procedure. There is a clear optimisation formulation of the PLS-R problem based on a Krylov subspace formulation, but it is only rarely considered. The popularity of PLS-R is attributed to the ability to interpret the data through the model components, but the model components are not available when solving the PLS-R problem using the Krylov subspace formulation. We therefore highlight a simple reformulation of the PLS-R problem using the Krylov subspace formulation as a promising modelling framework for PLS-R, and illustrate one of the main benefits of this reformulation—that it allows arbitrary penalties of the regression coefficients in the PLS-R model. Further, we propose an approach to estimate the PLS-R model components for the solution found through the Krylov subspace formulation, that are those we would have obtained had we been able to use the common algorithms for estimating the PLS-R model. We illustrate the utility of the proposed method on simulated and real data.
{"title":"Using the Krylov subspace formulation to improve regularisation and interpretation in partial least squares regression","authors":"Tommy Löfstedt","doi":"10.1007/s00180-024-01545-7","DOIUrl":"https://doi.org/10.1007/s00180-024-01545-7","url":null,"abstract":"<p>Partial least squares regression (PLS-R) has been an important regression method in the life sciences and many other fields for decades. However, PLS-R is typically solved using an opaque algorithmic approach, rather than through an optimisation formulation and procedure. There is a clear optimisation formulation of the PLS-R problem based on a Krylov subspace formulation, but it is only rarely considered. The popularity of PLS-R is attributed to the ability to interpret the data through the model components, but the model components are not available when solving the PLS-R problem using the Krylov subspace formulation. We therefore highlight a simple reformulation of the PLS-R problem using the Krylov subspace formulation as a promising modelling framework for PLS-R, and illustrate one of the main benefits of this reformulation—that it allows arbitrary penalties of the regression coefficients in the PLS-R model. Further, we propose an approach to estimate the PLS-R model components for the solution found through the Krylov subspace formulation, that are those we would have obtained had we been able to use the common algorithms for estimating the PLS-R model. We illustrate the utility of the proposed method on simulated and real data.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"25 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-12DOI: 10.1007/s00180-024-01548-4
Junchen Li
In recent years, high-dimensional matrix factor models have been widely applied in various fields. However, there are few methods that effectively handle heavy-tailed data. To address this problem, we introduced a smooth Cauchy loss function and established an optimization objective through norm minimization, deriving a Cauchy version of the weighted iterative estimation method. Unlike the Huber loss weighted estimation method, the weight calculation in this method is a smooth function rather than a piecewise function. It also considers the need to update parameters in the Cauchy loss function with each iteration during estimation. Ultimately, we propose a weighted estimation method with adaptive parameter adjustment. Subsequently, this paper analyzes the theoretical properties of the method, proving that it has a fast convergence rate. Through data simulation, our method demonstrates significant advantages. Thus, it can serve as a better alternative to other existing estimation methods. Finally, we analyzed a dataset of regional population movements between cities, demonstrating that our proposed method offers estimations with excellent interpretability compared to other methods.
{"title":"Robust matrix factor analysis method with adaptive parameter adjustment using Cauchy weighting","authors":"Junchen Li","doi":"10.1007/s00180-024-01548-4","DOIUrl":"https://doi.org/10.1007/s00180-024-01548-4","url":null,"abstract":"<p>In recent years, high-dimensional matrix factor models have been widely applied in various fields. However, there are few methods that effectively handle heavy-tailed data. To address this problem, we introduced a smooth Cauchy loss function and established an optimization objective through norm minimization, deriving a Cauchy version of the weighted iterative estimation method. Unlike the Huber loss weighted estimation method, the weight calculation in this method is a smooth function rather than a piecewise function. It also considers the need to update parameters in the Cauchy loss function with each iteration during estimation. Ultimately, we propose a weighted estimation method with adaptive parameter adjustment. Subsequently, this paper analyzes the theoretical properties of the method, proving that it has a fast convergence rate. Through data simulation, our method demonstrates significant advantages. Thus, it can serve as a better alternative to other existing estimation methods. Finally, we analyzed a dataset of regional population movements between cities, demonstrating that our proposed method offers estimations with excellent interpretability compared to other methods.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"5 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-06DOI: 10.1007/s00180-024-01540-y
Thomas Suesse, Alexander Brenning
Inference for predicted exceedance sets is important for various environmental issues such as detecting environmental anomalies and emergencies with high confidence. A critical part is to construct inner and outer predicted exceedance sets using an algorithm that samples from the predictive distribution. The simple currently used sampling procedure can lead to misleading conclusions for some locations due to relatively large standard errors when proportions are estimated from independent observations. Instead we propose an algorithm that calculates probabilities numerically using the Genz–Bretz algorithm, which is based on quasi-random numbers leading to more accurate inner and outer sets, as illustrated on rainfall data in the state of Paraná, Brazil.
{"title":"A precise and efficient exceedance-set algorithm for detecting environmental extremes","authors":"Thomas Suesse, Alexander Brenning","doi":"10.1007/s00180-024-01540-y","DOIUrl":"https://doi.org/10.1007/s00180-024-01540-y","url":null,"abstract":"<p>Inference for predicted exceedance sets is important for various environmental issues such as detecting environmental anomalies and emergencies with high confidence. A critical part is to construct inner and outer predicted exceedance sets using an algorithm that samples from the predictive distribution. The simple currently used sampling procedure can lead to misleading conclusions for some locations due to relatively large standard errors when proportions are estimated from independent observations. Instead we propose an algorithm that calculates probabilities numerically using the Genz–Bretz algorithm, which is based on quasi-random numbers leading to more accurate inner and outer sets, as illustrated on rainfall data in the state of Paraná, Brazil.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"60 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a method for change-point estimation, focusing on detecting structural shifts within time series data. Traditional maximum likelihood estimation (MLE) methods assume either independence or linear dependence via auto-regressive models. To address this limitation, the paper introduces copula-based Markov chain models, offering more flexible dependence modeling. These models treat a Gaussian time series as a Markov chain and utilize copula functions to handle serial dependence. The profile MLE procedure is then employed to estimate the change-point and other model parameters, with the Newton–Raphson algorithm facilitating numerical calculations for the estimators. The proposed approach is evaluated through simulations and real stock return data, considering two distinct periods: the 2008 financial crisis and the COVID-19 pandemic in 2020.
{"title":"Change point estimation for Gaussian time series data with copula-based Markov chain models","authors":"Li-Hsien Sun, Yu-Kai Wang, Lien-Hsi Liu, Takeshi Emura, Chi-Yang Chiu","doi":"10.1007/s00180-024-01541-x","DOIUrl":"https://doi.org/10.1007/s00180-024-01541-x","url":null,"abstract":"<p>This paper proposes a method for change-point estimation, focusing on detecting structural shifts within time series data. Traditional maximum likelihood estimation (MLE) methods assume either independence or linear dependence via auto-regressive models. To address this limitation, the paper introduces copula-based Markov chain models, offering more flexible dependence modeling. These models treat a Gaussian time series as a Markov chain and utilize copula functions to handle serial dependence. The profile MLE procedure is then employed to estimate the change-point and other model parameters, with the Newton–Raphson algorithm facilitating numerical calculations for the estimators. The proposed approach is evaluated through simulations and real stock return data, considering two distinct periods: the 2008 financial crisis and the COVID-19 pandemic in 2020.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"46 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
When collecting several data sets and heterogeneous data types on a given phenomenon of interest, the individual analysis of each data set will provide only a particular view of such phenomenon. Instead, integrating all the data may widen and deepen the results, offering a better view of the entire system. In the context of network integration, we propose the INet algorithm. INet assumes a similar network structure, representing latent variables in different network layers of the same system. Therefore, by combining individual edge weights and topological network structures, INet first constructs a Consensus Network that represents the shared information underneath the different layers to provide a global view of the entities that play a fundamental role in the phenomenon of interest. Then, it derives a Case Specific Network for each layer containing peculiar information of the single data type not present in all the others. We demonstrated good performance with our method through simulated data and detected new insights by analyzing biological and sociological datasets.
{"title":"INet for network integration","authors":"Valeria Policastro, Matteo Magnani, Claudia Angelini, Annamaria Carissimo","doi":"10.1007/s00180-024-01536-8","DOIUrl":"https://doi.org/10.1007/s00180-024-01536-8","url":null,"abstract":"<p>When collecting several data sets and heterogeneous data types on a given phenomenon of interest, the individual analysis of each data set will provide only a particular view of such phenomenon. Instead, integrating all the data may widen and deepen the results, offering a better view of the entire system. In the context of network integration, we propose the <span>INet</span> algorithm. <span>INet</span> assumes a similar network structure, representing latent variables in different network layers of the same system. Therefore, by combining individual edge weights and topological network structures, <span>INet</span> first constructs a <span>Consensus Network</span> that represents the shared information underneath the different layers to provide a global view of the entities that play a fundamental role in the phenomenon of interest. Then, it derives a <span>Case Specific Network</span> for each layer containing peculiar information of the single data type not present in all the others. We demonstrated good performance with our method through simulated data and detected new insights by analyzing biological and sociological datasets.\u0000</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"13 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-02DOI: 10.1007/s00180-024-01544-8
Julien Gibaud, Xavier Bry, Catherine Trottier
In a context of component-based multivariate modeling we propose to model the residual dependence of the responses. Each response of a response vector is assumed to depend, through a Generalized Linear Model, on a set of explanatory variables. The vast majority of explanatory variables are partitioned into conceptually homogeneous variable groups, viewed as explanatory themes. Variables in themes are supposed many and some of them are highly correlated or even collinear. Thus, generalized linear regression demands dimension reduction and regularization with respect to each theme. Besides them, we consider a small set of “additional” covariates not conceptually linked to the themes, and demanding no regularization. Supervised Component Generalized Linear Regression proposed to both regularize and reduce the dimension of the explanatory space by searching each theme for an appropriate number of orthogonal components, which both contribute to predict the responses and capture relevant structural information in themes. In this paper, we introduce random latent variables (a.k.a. factors) so as to model the covariance matrix of the linear predictors of the responses conditional on the components. To estimate the model, we present an algorithm combining supervised component-based model estimation with factor model estimation. This methodology is tested on simulated data and then applied to an agricultural ecology dataset.
在基于成分的多元建模中,我们建议对响应的残差依赖性进行建模。通过广义线性模型,假设响应向量的每个响应都取决于一组解释变量。绝大多数解释变量被划分为概念上同质的变量组,被视为解释主题。主题中的变量应该很多,其中一些变量高度相关,甚至相互关联。因此,广义线性回归要求对每个主题进行降维和正则化处理。除此之外,我们还考虑了一小部分 "附加 "协变量,这些协变量与主题没有概念上的联系,也不需要正则化。监督成分广义线性回归(Supervised Component Generalized Linear Regression)建议,通过在每个主题中搜索适当数量的正交成分来规整和降低解释空间的维度,这些正交成分既有助于预测反应,又能捕捉主题中的相关结构信息。在本文中,我们引入了随机潜变量(又称因子),从而建立以成分为条件的响应线性预测因子协方差矩阵模型。为了估计模型,我们提出了一种算法,将基于成分的监督模型估计与因子模型估计相结合。该方法在模拟数据上进行了测试,然后应用于农业生态数据集。
{"title":"Generalized linear model based on latent factors and supervised components","authors":"Julien Gibaud, Xavier Bry, Catherine Trottier","doi":"10.1007/s00180-024-01544-8","DOIUrl":"https://doi.org/10.1007/s00180-024-01544-8","url":null,"abstract":"<p>In a context of component-based multivariate modeling we propose to model the residual dependence of the responses. Each response of a response vector is assumed to depend, through a Generalized Linear Model, on a set of explanatory variables. The vast majority of explanatory variables are partitioned into conceptually homogeneous variable groups, viewed as explanatory themes. Variables in themes are supposed many and some of them are highly correlated or even collinear. Thus, generalized linear regression demands dimension reduction and regularization with respect to each theme. Besides them, we consider a small set of “additional” covariates not conceptually linked to the themes, and demanding no regularization. Supervised Component Generalized Linear Regression proposed to both regularize and reduce the dimension of the explanatory space by searching each theme for an appropriate number of orthogonal components, which both contribute to predict the responses and capture relevant structural information in themes. In this paper, we introduce random latent variables (a.k.a. factors) so as to model the covariance matrix of the linear predictors of the responses conditional on the components. To estimate the model, we present an algorithm combining supervised component-based model estimation with factor model estimation. This methodology is tested on simulated data and then applied to an agricultural ecology dataset.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"33 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-26DOI: 10.1007/s00180-024-01538-6
Natalia da Silva, Ignacio Alvarez-Castro, Leonardo Moreno, Andrés Sosa
Statistical learning methods are widely utilised in tackling complex problems due to their flexibility, good predictive performance and ability to capture complex relationships among variables. Additionally, recently developed automatic workflows have provided a standardised approach for implementing statistical learning methods across various applications. However, these tools highlight one of the main drawbacks of statistical learning: the lack of interpretability of the results. In the past few years, a large amount of research has been focused on methods for interpreting black box models. Having interpretable statistical learning methods is necessary for obtaining a deeper understanding of these models. Specifically in problems in which spatial information is relevant, combining interpretable methods with spatial data can help to provide a better understanding of the problem and an improved interpretation of the results. This paper is focused on the individual conditional expectation plot (ICE-plot), a model-agnostic method for interpreting statistical learning models and combining them with spatial information. An ICE-plot extension is proposed in which spatial information is used as a restriction to define spatial ICE (SpICE) curves. Spatial ICE curves are estimated using real data in the context of an economic problem concerning property valuation in Montevideo, Uruguay. Understanding the key factors that influence property valuation is essential for decision-making, and spatial data play a relevant role in this regard.
{"title":"SpICE: an interpretable method for spatial data","authors":"Natalia da Silva, Ignacio Alvarez-Castro, Leonardo Moreno, Andrés Sosa","doi":"10.1007/s00180-024-01538-6","DOIUrl":"https://doi.org/10.1007/s00180-024-01538-6","url":null,"abstract":"<p>Statistical learning methods are widely utilised in tackling complex problems due to their flexibility, good predictive performance and ability to capture complex relationships among variables. Additionally, recently developed automatic workflows have provided a standardised approach for implementing statistical learning methods across various applications. However, these tools highlight one of the main drawbacks of statistical learning: the lack of interpretability of the results. In the past few years, a large amount of research has been focused on methods for interpreting black box models. Having interpretable statistical learning methods is necessary for obtaining a deeper understanding of these models. Specifically in problems in which spatial information is relevant, combining interpretable methods with spatial data can help to provide a better understanding of the problem and an improved interpretation of the results. This paper is focused on the individual conditional expectation plot (ICE-plot), a model-agnostic method for interpreting statistical learning models and combining them with spatial information. An ICE-plot extension is proposed in which spatial information is used as a restriction to define spatial ICE (SpICE) curves. Spatial ICE curves are estimated using real data in the context of an economic problem concerning property valuation in Montevideo, Uruguay. Understanding the key factors that influence property valuation is essential for decision-making, and spatial data play a relevant role in this regard.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"58 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-24DOI: 10.1007/s00180-024-01539-5
Alex de la Cruz Huayanay, Jorge L. Bazán, Cibele M. Russo
This paper investigates the effectiveness of various metrics for selecting the adequate model for binary classification when data is imbalanced. Through an extensive simulation study involving 12 commonly used metrics of classification, our findings indicate that the Matthews Correlation Coefficient, G-Mean, and Cohen’s kappa consistently yield favorable performance. Conversely, the area under the curve and Accuracy metrics demonstrate poor performance across all studied scenarios, while other seven metrics exhibit varying degrees of effectiveness in specific scenarios. Furthermore, we discuss a practical application in the financial area, which confirms the robust performance of these metrics in facilitating model selection among alternative link functions.
{"title":"Performance of evaluation metrics for classification in imbalanced data","authors":"Alex de la Cruz Huayanay, Jorge L. Bazán, Cibele M. Russo","doi":"10.1007/s00180-024-01539-5","DOIUrl":"https://doi.org/10.1007/s00180-024-01539-5","url":null,"abstract":"<p>This paper investigates the effectiveness of various metrics for selecting the adequate model for binary classification when data is imbalanced. Through an extensive simulation study involving 12 commonly used metrics of classification, our findings indicate that the Matthews Correlation Coefficient, G-Mean, and Cohen’s kappa consistently yield favorable performance. Conversely, the area under the curve and Accuracy metrics demonstrate poor performance across all studied scenarios, while other seven metrics exhibit varying degrees of effectiveness in specific scenarios. Furthermore, we discuss a practical application in the financial area, which confirms the robust performance of these metrics in facilitating model selection among alternative link functions.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"23 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-17DOI: 10.1007/s00180-024-01537-7
Yoshio Takane, Eric J. Beh, Rosaria Lombardo
This paper presents a theory of contrasts designed for modified Freeman–Tukey (FT) statistics which are derived through square-root transformations of observed frequencies (proportions) in contingency tables. Some modifications of the original FT statistic are necessary to allow for ANOVA-like exact decompositions of the global goodness of fit (GOF) measures. The square-root transformations have an important effect of stabilizing (equalizing) variances. The theory is then used to derive Tukey’s post-hoc pairwise comparison tests for contingency tables. Tukey’s tests are more restrictive, but are more powerful, than Scheffè’s post-hoc tests developed earlier for the analysis of contingency tables. Throughout this paper, numerical examples are given to illustrate the theory. Modified FT statistics, like other similar statistics for contingency tables, are based on a large-sample rationale. Small Monte-Carlo studies are conducted to investigate asymptotic (and non-asymptotic) behaviors of the proposed statistics.
本文介绍了一种对比理论,该理论是为修正的弗里曼-图基(FT)统计量而设计的,该统计量是通过对或然率表中的观察频率(比例)进行平方根变换而得出的。为了对全局拟合优度(GOF)进行类似方差分析的精确分解,有必要对原始 FT 统计量进行一些修改。平方根变换具有稳定(均衡)方差的重要作用。然后,利用该理论推导出针对或然表的 Tukey 事后配对比较检验。Tukey 检验比 Scheffè 早先为分析或然率表而开发的事后检验更具限制性,但更强大。本文通篇以数字示例来说明理论。修正的 FT 统计法与其他类似的或然率统计法一样,都是基于大样本的原理。本文进行了小规模的蒙特卡洛研究,以研究拟议统计量的渐近(和非渐近)行为。
{"title":"A theory of contrasts for modified Freeman–Tukey statistics and its applications to Tukey’s post-hoc tests for contingency tables","authors":"Yoshio Takane, Eric J. Beh, Rosaria Lombardo","doi":"10.1007/s00180-024-01537-7","DOIUrl":"https://doi.org/10.1007/s00180-024-01537-7","url":null,"abstract":"<p>This paper presents a theory of contrasts designed for modified Freeman–Tukey (FT) statistics which are derived through square-root transformations of observed frequencies (proportions) in contingency tables. Some modifications of the original FT statistic are necessary to allow for ANOVA-like exact decompositions of the global goodness of fit (GOF) measures. The square-root transformations have an important effect of stabilizing (equalizing) variances. The theory is then used to derive Tukey’s post-hoc pairwise comparison tests for contingency tables. Tukey’s tests are more restrictive, but are more powerful, than Scheffè’s post-hoc tests developed earlier for the analysis of contingency tables. Throughout this paper, numerical examples are given to illustrate the theory. Modified FT statistics, like other similar statistics for contingency tables, are based on a large-sample rationale. Small Monte-Carlo studies are conducted to investigate asymptotic (and non-asymptotic) behaviors of the proposed statistics.\u0000</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"32 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-07DOI: 10.1007/s00180-024-01525-x
Majnu John, Sujit Vettam, Yihren Wu
Nonconvex penalties are utilized for regularization in high-dimensional statistical learning algorithms primarily because they yield unbiased or nearly unbiased estimators for the parameters in the model. Nonconvex penalties existing in the literature such as SCAD, MCP, Laplace and arctan have a singularity at origin which makes them useful also for variable selection. However, in several high-dimensional frameworks such as deep learning, variable selection is less of a concern. In this paper, we present a nonconvex penalty which is smooth at origin. The paper includes asymptotic results for ordinary least squares estimators regularized with the new penalty function, showing asymptotic bias that vanishes exponentially fast. We also conducted simulations to better understand the finite sample properties and conducted an empirical study employing deep neural network architecture on three datasets and convolutional neural network on four datasets. The empirical study based on artificial neural networks showed better performance for the new regularization approach in five out of the seven datasets.
{"title":"A novel nonconvex, smooth-at-origin penalty for statistical learning","authors":"Majnu John, Sujit Vettam, Yihren Wu","doi":"10.1007/s00180-024-01525-x","DOIUrl":"https://doi.org/10.1007/s00180-024-01525-x","url":null,"abstract":"<p>Nonconvex penalties are utilized for regularization in high-dimensional statistical learning algorithms primarily because they yield unbiased or nearly unbiased estimators for the parameters in the model. Nonconvex penalties existing in the literature such as SCAD, MCP, Laplace and arctan have a singularity at origin which makes them useful also for variable selection. However, in several high-dimensional frameworks such as deep learning, variable selection is less of a concern. In this paper, we present a nonconvex penalty which is smooth at origin. The paper includes asymptotic results for ordinary least squares estimators regularized with the new penalty function, showing asymptotic bias that vanishes exponentially fast. We also conducted simulations to better understand the finite sample properties and conducted an empirical study employing deep neural network architecture on three datasets and convolutional neural network on four datasets. The empirical study based on artificial neural networks showed better performance for the new regularization approach in five out of the seven datasets.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"4 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141969706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}