Pub Date : 2024-05-01DOI: 10.1007/s00180-024-01498-x
Hamid Mraoui, Ahmed El-Alaoui, Souad Bechrouri, Nezha Mohaoui, Abdelilah Monir
This paper introduces a new nonparametric estimator of the regression based on local quasi-interpolation spline method. This model combines a B-spline basis with a simple local polynomial regression, via blossoming approach, to produce a reduced rank spline like smoother. Different coefficients functionals are allowed to have different smoothing parameters (bandwidths) if the function has different smoothness. In addition, the number and location of the knots of this estimator are not fixed. In practice, we may employ a modest number of basis functions and then determine the smoothing parameter as the minimizer of the criterion. In simulations, the approach achieves very competitive performance with P-spline and smoothing spline methods. Simulated data and a real data example are used to illustrate the effectiveness of the method proposed in this paper.
本文介绍了一种基于局部准插值样条法的新的非参数估计回归模型。该模型通过绽放法将 B-样条曲线基础与简单的局部多项式回归相结合,生成类似于减阶样条曲线的平滑器。如果函数的平滑度不同,则允许不同的系数函数具有不同的平滑参数(带宽)。此外,该估计器的节点数量和位置也不是固定的。在实践中,我们可以采用数量适中的基函数,然后根据准则的最小化确定平滑参数。在模拟实验中,该方法与 P 样条法和平滑样条法相比,性能极具竞争力。本文使用模拟数据和真实数据实例来说明本文所提方法的有效性。
{"title":"Two-stage regression spline modeling based on local polynomial kernel regression","authors":"Hamid Mraoui, Ahmed El-Alaoui, Souad Bechrouri, Nezha Mohaoui, Abdelilah Monir","doi":"10.1007/s00180-024-01498-x","DOIUrl":"https://doi.org/10.1007/s00180-024-01498-x","url":null,"abstract":"<p>This paper introduces a new nonparametric estimator of the regression based on local quasi-interpolation spline method. This model combines a B-spline basis with a simple local polynomial regression, via blossoming approach, to produce a reduced rank spline like smoother. Different coefficients functionals are allowed to have different smoothing parameters (bandwidths) if the function has different smoothness. In addition, the number and location of the knots of this estimator are not fixed. In practice, we may employ a modest number of basis functions and then determine the smoothing parameter as the minimizer of the criterion. In simulations, the approach achieves very competitive performance with P-spline and smoothing spline methods. Simulated data and a real data example are used to illustrate the effectiveness of the method proposed in this paper.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"17 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140827212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-29DOI: 10.1007/s00180-024-01497-y
Shubham Saini
Estimating the reliability of multicomponent systems is crucial in various engineering and reliability analysis applications. This paper investigates the multicomponent stress strength reliability estimation using lower record values, specifically for the exponentiated Pareto distribution. We compare classical estimation techniques, such as maximum likelihood estimation, with Bayesian estimation methods. Under Bayesian estimation, we employ Markov Chain Monte Carlo techniques and Tierney–Kadane’s approximation to obtain the posterior distribution of the reliability parameter. To evaluate the performance of the proposed estimation approaches, we conduct a comprehensive simulation study, considering various system configurations and sample sizes. Additionally, we analyze real data to illustrate the practical applicability of our methods. The proposed methodologies provide valuable insights for engineers and reliability analysts in accurately assessing the reliability of multicomponent systems using lower record values.
{"title":"Advancements in reliability estimation for the exponentiated Pareto distribution: a comparison of classical and Bayesian methods with lower record values","authors":"Shubham Saini","doi":"10.1007/s00180-024-01497-y","DOIUrl":"https://doi.org/10.1007/s00180-024-01497-y","url":null,"abstract":"<p>Estimating the reliability of multicomponent systems is crucial in various engineering and reliability analysis applications. This paper investigates the multicomponent stress strength reliability estimation using lower record values, specifically for the exponentiated Pareto distribution. We compare classical estimation techniques, such as maximum likelihood estimation, with Bayesian estimation methods. Under Bayesian estimation, we employ Markov Chain Monte Carlo techniques and Tierney–Kadane’s approximation to obtain the posterior distribution of the reliability parameter. To evaluate the performance of the proposed estimation approaches, we conduct a comprehensive simulation study, considering various system configurations and sample sizes. Additionally, we analyze real data to illustrate the practical applicability of our methods. The proposed methodologies provide valuable insights for engineers and reliability analysts in accurately assessing the reliability of multicomponent systems using lower record values.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"153 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140885012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-02DOI: 10.1007/s00180-024-01472-7
Hyejoon Park, Hyunjoong Kim, Yung-Seop Lee
This study proposes a new linear dimension reduction technique called Maximizing Adjusted Covariance (MAC), which is suitable for supervised classification. The new approach is to adjust the covariance matrix between input and target variables using the within-class sum of squares, thereby promoting class separation after linear dimension reduction. MAC has a low computational cost and can complement existing linear dimensionality reduction techniques for classification. In this study, the classification performance by MAC was compared with those of the existing linear dimension reduction methods using 44 datasets. In most of the classification models used in the experiment, the MAC dimension reduction method showed better classification accuracy and F1 score than other linear dimension reduction methods.
本研究提出了一种新的线性降维技术--最大化调整协方差(MAC),它适用于监督分类。新方法是利用类内平方和调整输入变量和目标变量之间的协方差矩阵,从而促进线性降维后的类分离。MAC 的计算成本较低,可作为现有线性降维分类技术的补充。本研究使用 44 个数据集比较了 MAC 与现有线性降维方法的分类性能。在实验中使用的大多数分类模型中,MAC 降维方法的分类准确率和 F1 分数都优于其他线性降维方法。
{"title":"Maximizing adjusted covariance: new supervised dimension reduction for classification","authors":"Hyejoon Park, Hyunjoong Kim, Yung-Seop Lee","doi":"10.1007/s00180-024-01472-7","DOIUrl":"https://doi.org/10.1007/s00180-024-01472-7","url":null,"abstract":"<p>This study proposes a new linear dimension reduction technique called Maximizing Adjusted Covariance (MAC), which is suitable for supervised classification. The new approach is to adjust the covariance matrix between input and target variables using the within-class sum of squares, thereby promoting class separation after linear dimension reduction. MAC has a low computational cost and can complement existing linear dimensionality reduction techniques for classification. In this study, the classification performance by MAC was compared with those of the existing linear dimension reduction methods using 44 datasets. In most of the classification models used in the experiment, the MAC dimension reduction method showed better classification accuracy and F1 score than other linear dimension reduction methods.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"53 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140567927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Extensions of quantile regression modeling for time series analysis are extensively employed in medical and health studies. This study introduces a specific class of transformed quantile-dispersion regression models for non-stationary time series. These models possess the flexibility to incorporate the time-varying structure into the model specification, enabling precise predictions for future decisions. Our proposed modeling methodology applies to dynamic processes characterized by high variation and possible periodicity, relying on a non-linear framework. Additionally, unlike the transformed time series model, our approach directly interprets the regression parameters concerning the initial response. For computational purposes, we present an iteratively reweighted least squares algorithm. To assess the performance of our model, we conduct simulation experiments. To illustrate the modeling strategy, we analyze time-series measurements of influenza infection and daily COVID-19 deaths.
{"title":"A class of transformed joint quantile time series models with applications to health studies","authors":"Fahimeh Tourani-Farani, Zeynab Aghabazaz, Iraj Kazemi","doi":"10.1007/s00180-024-01484-3","DOIUrl":"https://doi.org/10.1007/s00180-024-01484-3","url":null,"abstract":"<p>Extensions of quantile regression modeling for time series analysis are extensively employed in medical and health studies. This study introduces a specific class of transformed quantile-dispersion regression models for non-stationary time series. These models possess the flexibility to incorporate the time-varying structure into the model specification, enabling precise predictions for future decisions. Our proposed modeling methodology applies to dynamic processes characterized by high variation and possible periodicity, relying on a non-linear framework. Additionally, unlike the transformed time series model, our approach directly interprets the regression parameters concerning the initial response. For computational purposes, we present an iteratively reweighted least squares algorithm. To assess the performance of our model, we conduct simulation experiments. To illustrate the modeling strategy, we analyze time-series measurements of influenza infection and daily COVID-19 deaths.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"96 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140567967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-27DOI: 10.1007/s00180-024-01483-4
Michael Levine, Gildas Mazo
In this manuscript, we consider a finite multivariate nonparametric mixture model where the dependence between the marginal densities is modeled using the copula device. Pseudo expectation–maximization (EM) stochastic algorithms were recently proposed to estimate all of the components of this model under a location-scale constraint on the marginals. Here, we introduce a deterministic algorithm that seeks to maximize a smoothed semiparametric likelihood. No location-scale assumption is made about the marginals. The algorithm is monotonic in one special case, and, in another, leads to “approximate monotonicity”—whereby the difference between successive values of the objective function becomes non-negative up to an additive term that becomes negligible after a sufficiently large number of iterations. The behavior of this algorithm is illustrated on several simulated and real datasets. The results suggest that, under suitable conditions, the proposed algorithm may indeed be monotonic in general. A discussion of the results and some possible future research directions round out our presentation.
{"title":"A smoothed semiparametric likelihood for estimation of nonparametric finite mixture models with a copula-based dependence structure","authors":"Michael Levine, Gildas Mazo","doi":"10.1007/s00180-024-01483-4","DOIUrl":"https://doi.org/10.1007/s00180-024-01483-4","url":null,"abstract":"<p>In this manuscript, we consider a finite multivariate nonparametric mixture model where the dependence between the marginal densities is modeled using the copula device. Pseudo expectation–maximization (EM) stochastic algorithms were recently proposed to estimate all of the components of this model under a location-scale constraint on the marginals. Here, we introduce a deterministic algorithm that seeks to maximize a smoothed semiparametric likelihood. No location-scale assumption is made about the marginals. The algorithm is monotonic in one special case, and, in another, leads to “approximate monotonicity”—whereby the difference between successive values of the objective function becomes non-negative up to an additive term that becomes negligible after a sufficiently large number of iterations. The behavior of this algorithm is illustrated on several simulated and real datasets. The results suggest that, under suitable conditions, the proposed algorithm may indeed be monotonic in general. A discussion of the results and some possible future research directions round out our presentation.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"28 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140884878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-09DOI: 10.1007/s00180-024-01476-3
Saeid Amiri, Reza Modarres
We present a technique for learning via aggregation in supervised classification. The new method improves classification performance, regardless of which classifier is at its core. This approach exploits the information hidden in subspaces by combinations of aggregating variables and is applicable to high-dimensional data sets. We provide algorithms that randomly divide the variables into smaller subsets and permute them before applying a classification method to each subset. We combine the resulting classes to predict the class membership. Theoretical and simulation analyses consistently demonstrate the high accuracy of our classification methods. In comparison to aggregating observations through sampling, our approach proves to be significantly more effective. Through extensive simulations, we evaluate the accuracy of various classification methods. To further illustrate the effectiveness of our techniques, we apply them to five real-world data sets.
{"title":"A subspace aggregating algorithm for accurate classification","authors":"Saeid Amiri, Reza Modarres","doi":"10.1007/s00180-024-01476-3","DOIUrl":"https://doi.org/10.1007/s00180-024-01476-3","url":null,"abstract":"<p>We present a technique for learning via aggregation in supervised classification. The new method improves classification performance, regardless of which classifier is at its core. This approach exploits the information hidden in subspaces by combinations of aggregating variables and is applicable to high-dimensional data sets. We provide algorithms that randomly divide the variables into smaller subsets and permute them before applying a classification method to each subset. We combine the resulting classes to predict the class membership. Theoretical and simulation analyses consistently demonstrate the high accuracy of our classification methods. In comparison to aggregating observations through sampling, our approach proves to be significantly more effective. Through extensive simulations, we evaluate the accuracy of various classification methods. To further illustrate the effectiveness of our techniques, we apply them to five real-world data sets.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"12 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140075811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-08DOI: 10.1007/s00180-024-01471-8
Abstract
The data distribution is often associated with a priori-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications.
{"title":"Imbalanced data sampling design based on grid boundary domain for big data","authors":"","doi":"10.1007/s00180-024-01471-8","DOIUrl":"https://doi.org/10.1007/s00180-024-01471-8","url":null,"abstract":"<h3>Abstract</h3> <p>The data distribution is often associated with a <em>priori</em>-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"54 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140075873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-04DOI: 10.1007/s00180-024-01474-5
Abstract
This paper considers the sparse estimation problem of regression coefficients in the linear model. Note that the global–local shrinkage priors do not allow the regression coefficients to be truly estimated as zero, we propose three threshold rules and compare their contraction properties, and also tandem those rules with the popular horseshoe prior and the horseshoe+ prior that are normally regarded as global–local shrinkage priors. The hierarchical prior expressions for the horseshoe prior and the horseshoe+ prior are obtained, and the full conditional posterior distributions for all parameters for algorithm implementation are also given. Simulation studies indicate that the horseshoe/horseshoe+ prior with the threshold rules are both superior to the spike-slab models. Finally, a real data analysis demonstrates the effectiveness of variable selection of the proposed method.
{"title":"Sparse estimation of linear model via Bayesian method $$^*$$","authors":"","doi":"10.1007/s00180-024-01474-5","DOIUrl":"https://doi.org/10.1007/s00180-024-01474-5","url":null,"abstract":"<h3>Abstract</h3> <p>This paper considers the sparse estimation problem of regression coefficients in the linear model. Note that the global–local shrinkage priors do not allow the regression coefficients to be truly estimated as zero, we propose three threshold rules and compare their contraction properties, and also tandem those rules with the popular horseshoe prior and the horseshoe+ prior that are normally regarded as global–local shrinkage priors. The hierarchical prior expressions for the horseshoe prior and the horseshoe+ prior are obtained, and the full conditional posterior distributions for all parameters for algorithm implementation are also given. Simulation studies indicate that the horseshoe/horseshoe+ prior with the threshold rules are both superior to the spike-slab models. Finally, a real data analysis demonstrates the effectiveness of variable selection of the proposed method.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"35 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140036222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1007/s00180-024-01473-6
Abstract
Bernstein Polynomial (BP) bases can uniformly approximate any continuous function based on observed noisy samples. However, a persistent challenge is the data-driven selection of a suitable degree for the BPs. In the absence of noise, asymptotic theory suggests that a larger degree leads to better approximation. However, in the presence of noise, which reduces bias, a larger degree also results in larger variances due to high-dimensional parameter estimation. Thus, a balance in the classic bias-variance trade-off is essential. The main objective of this work is to determine the minimum possible degree of the approximating BPs using probabilistic methods that are robust to various shapes of an unknown continuous function. Beyond offering theoretical guidance, the paper includes numerical illustrations to address the issue of determining a suitable degree for BPs in approximating arbitrary continuous functions.
摘要 伯恩斯坦多项式(BP)基可以根据观测到的噪声样本均匀地近似任何连续函数。然而,一个长期存在的难题是如何根据数据为 BP 选择合适的阶数。在没有噪声的情况下,渐近理论表明,阶数越大,逼近效果越好。然而,在有噪声的情况下,噪声会减少偏差,但由于高维参数估计,较大的度数也会导致较大的方差。因此,传统的偏差-方差权衡中的平衡至关重要。这项工作的主要目的是利用概率方法确定近似 BP 的最小可能度,这种方法对未知连续函数的各种形状都具有鲁棒性。除了提供理论指导外,本文还通过数值说明来解决在逼近任意连续函数时如何确定 BP 的合适度这一问题。
{"title":"Degree selection methods for curve estimation via Bernstein polynomials","authors":"","doi":"10.1007/s00180-024-01473-6","DOIUrl":"https://doi.org/10.1007/s00180-024-01473-6","url":null,"abstract":"<h3>Abstract</h3> <p>Bernstein Polynomial (BP) bases can uniformly approximate any continuous function based on observed noisy samples. However, a persistent challenge is the data-driven selection of a suitable degree for the BPs. In the absence of noise, asymptotic theory suggests that a larger degree leads to better approximation. However, in the presence of noise, which reduces bias, a larger degree also results in larger variances due to high-dimensional parameter estimation. Thus, a balance in the classic bias-variance trade-off is essential. The main objective of this work is to determine the minimum possible degree of the approximating BPs using probabilistic methods that are robust to various shapes of an unknown continuous function. Beyond offering theoretical guidance, the paper includes numerical illustrations to address the issue of determining a suitable degree for BPs in approximating arbitrary continuous functions.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"22 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140016810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-01DOI: 10.1007/s00180-024-01475-4
Mathias von Ottenbreit, Riccardo De Bin
Regression modelling often presents a trade-off between predictiveness and interpretability. Highly predictive and popular tree-based algorithms such as Random Forest and boosted trees predict very well the outcome of new observations, but the effect of the predictors on the result is hard to interpret. Highly interpretable algorithms like linear effect-based boosting and MARS, on the other hand, are typically less predictive. Here we propose a novel regression algorithm, automatic piecewise linear regression (APLR), that combines the predictiveness of a boosting algorithm with the interpretability of a MARS model. In addition, as a boosting algorithm, it automatically handles variable selection, and, as a MARS-based approach, it takes into account non-linear relationships and possible interaction terms. We show on simulated and real data examples how APLR’s performance is comparable to that of the top-performing approaches in terms of prediction, while offering an easy way to interpret the results. APLR has been implemented in C++ and wrapped in a Python package as a Scikit-learn compatible estimator.
回归建模通常需要在预测性和可解释性之间做出权衡。高预测性和流行的基于树的算法(如随机森林和提升树)能很好地预测新观测结果,但预测因子对结果的影响却难以解释。另一方面,基于线性效应的提升算法和 MARS 等可解释性强的算法通常预测性较差。在这里,我们提出了一种新型回归算法--自动分片线性回归(APLR),它结合了提升算法的预测性和 MARS 模型的可解释性。此外,作为一种提升算法,它能自动处理变量选择;作为一种基于 MARS 的方法,它能考虑到非线性关系和可能的交互项。我们在模拟和真实数据示例中展示了 APLR 在预测方面的性能如何与表现最佳的方法相媲美,同时还提供了解释结果的简便方法。APLR 是用 C++ 实现的,并封装在一个 Python 软件包中,作为 Scikit-learn 兼容的估计器。
{"title":"Automatic piecewise linear regression","authors":"Mathias von Ottenbreit, Riccardo De Bin","doi":"10.1007/s00180-024-01475-4","DOIUrl":"https://doi.org/10.1007/s00180-024-01475-4","url":null,"abstract":"<p>Regression modelling often presents a trade-off between predictiveness and interpretability. Highly predictive and popular tree-based algorithms such as Random Forest and boosted trees predict very well the outcome of new observations, but the effect of the predictors on the result is hard to interpret. Highly interpretable algorithms like linear effect-based boosting and MARS, on the other hand, are typically less predictive. Here we propose a novel regression algorithm, automatic piecewise linear regression (APLR), that combines the predictiveness of a boosting algorithm with the interpretability of a MARS model. In addition, as a boosting algorithm, it automatically handles variable selection, and, as a MARS-based approach, it takes into account non-linear relationships and possible interaction terms. We show on simulated and real data examples how APLR’s performance is comparable to that of the top-performing approaches in terms of prediction, while offering an easy way to interpret the results. APLR has been implemented in C++ and wrapped in a Python package as a Scikit-learn compatible estimator.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"1 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140016808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}