Pub Date : 2024-05-08DOI: 10.1007/s00180-024-01501-5
M. Corneli, E. Erosheva, X. Qian, M. Lorenzi
We consider mixtures of longitudinal trajectories, where one trajectory contains measurements over time of the variable of interest for one individual and each individual belongs to one cluster. The number of clusters as well as individual cluster memberships are unknown and must be inferred. We propose an original Bayesian clustering framework that allows us to obtain an exact finite-sample model selection criterion for selecting the number of clusters. Our finite-sample approach is more flexible and parsimonious than asymptotic alternatives such as Bayesian information criterion or integrated classification likelihood criterion in the choice of the number of clusters. Moreover, our approach has other desirable qualities: (i) it keeps the computational effort of the clustering algorithm under control and (ii) it generalizes to several families of regression mixture models, from linear to purely non-parametric. We test our method on simulated datasets as well as on a real world dataset from the Alzheimer’s disease neuroimaging initative database.
{"title":"A Bayesian approach for clustering and exact finite-sample model selection in longitudinal data mixtures","authors":"M. Corneli, E. Erosheva, X. Qian, M. Lorenzi","doi":"10.1007/s00180-024-01501-5","DOIUrl":"https://doi.org/10.1007/s00180-024-01501-5","url":null,"abstract":"<p>We consider mixtures of longitudinal trajectories, where one trajectory contains measurements over time of the variable of interest for one individual and each individual belongs to one cluster. The number of clusters as well as individual cluster memberships are unknown and must be inferred. We propose an original Bayesian clustering framework that allows us to obtain an exact finite-sample model selection criterion for selecting the number of clusters. Our finite-sample approach is more flexible and parsimonious than asymptotic alternatives such as Bayesian information criterion or integrated classification likelihood criterion in the choice of the number of clusters. Moreover, our approach has other desirable qualities: (i) it keeps the computational effort of the clustering algorithm under control and (ii) it generalizes to several families of regression mixture models, from linear to purely non-parametric. We test our method on simulated datasets as well as on a real world dataset from the Alzheimer’s disease neuroimaging initative database.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"38 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140935153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-06DOI: 10.1007/s00180-024-01478-1
Roberto Rocci, Maurizio Vichi, Monia Ranalli
Finite mixture of Gaussians are often used to classify two- (units and variables) or three- (units, variables and occasions) way data. However, two issues arise: model complexity and capturing the true cluster structure. Indeed, a large number of variables and/or occasions implies a large number of model parameters; while the existence of noise variables (and/or occasions) could mask the true cluster structure. The approach adopted in the present paper is to reduce the number of model parameters by identifying a sub-space containing the information needed to classify the observations. This should also help in identifying noise variables and/or occasions. The maximum likelihood model estimation is carried out through an EM-like algorithm. The effectiveness of the proposal is assessed through a simulation study and an application to real data.
有限高斯混合物通常用于对双向(单位和变量)或三向(单位、变量和场合)数据进行分类。然而,这就产生了两个问题:模型的复杂性和捕捉真实的聚类结构。事实上,大量的变量和/或场合意味着大量的模型参数;而噪声变量(和/或场合)的存在可能会掩盖真实的聚类结构。本文采用的方法是通过识别包含观测分类所需信息的子空间来减少模型参数的数量。这也有助于识别噪声变量和/或场合。最大似然模型估计是通过类似 EM 的算法进行的。通过模拟研究和对真实数据的应用,对该建议的有效性进行了评估。
{"title":"Mixture models for simultaneous classification and reduction of three-way data","authors":"Roberto Rocci, Maurizio Vichi, Monia Ranalli","doi":"10.1007/s00180-024-01478-1","DOIUrl":"https://doi.org/10.1007/s00180-024-01478-1","url":null,"abstract":"<p>Finite mixture of Gaussians are often used to classify two- (units and variables) or three- (units, variables and occasions) way data. However, two issues arise: model complexity and capturing the true cluster structure. Indeed, a large number of variables and/or occasions implies a large number of model parameters; while the existence of noise variables (and/or occasions) could mask the true cluster structure. The approach adopted in the present paper is to reduce the number of model parameters by identifying a sub-space containing the information needed to classify the observations. This should also help in identifying noise variables and/or occasions. The maximum likelihood model estimation is carried out through an EM-like algorithm. The effectiveness of the proposal is assessed through a simulation study and an application to real data.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"62 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140885015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-04DOI: 10.1007/s00180-024-01499-w
Kuo-Jung Lee, Ray-Bing Chen, Keunbaik Lee
Longitudinal studies have been conducted in various fields, including medicine, economics and the social sciences. In this paper, we focus on longitudinal ordinal data. Since the longitudinal data are collected over time, repeated outcomes within each subject may be serially correlated. To address both the within-subjects serial correlation and the specific variance between subjects, we propose a Bayesian cumulative probit random effects model for the analysis of longitudinal ordinal data. The hypersphere decomposition approach is employed to overcome the positive definiteness constraint and high-dimensionality of the correlation matrix. Additionally, we present a hybrid Gibbs/Metropolis-Hastings algorithm to efficiently generate cutoff points from truncated normal distributions, thereby expediting the convergence of the Markov Chain Monte Carlo (MCMC) algorithm. The performance and robustness of our proposed methodology under misspecified correlation matrices are demonstrated through simulation studies under complete data, missing completely at random (MCAR), and missing at random (MAR). We apply the proposed approach to analyze two sets of actual ordinal data: the arthritis dataset and the lung cancer dataset. To facilitate the implementation of our method, we have developed BayesRGMM, an open-source R package available on CRAN, accompanied by comprehensive documentation and source code accessible at https://github.com/kuojunglee/BayesRGMM/.
纵向研究已在医学、经济学和社会科学等多个领域开展。本文重点研究纵向序数数据。由于纵向数据是长期收集的,因此每个研究对象内部的重复结果可能存在序列相关性。为了解决受试者内部的序列相关性和受试者之间的特定方差,我们提出了一种贝叶斯累积 probit 随机效应模型,用于分析纵向序数数据。我们采用了超球分解方法来克服相关矩阵的正定性约束和高维性。此外,我们还提出了一种混合吉布斯/大都会-哈斯廷斯算法,从截断正态分布中有效地生成截断点,从而加快了马尔可夫链蒙特卡罗(MCMC)算法的收敛速度。我们通过对完整数据、完全随机缺失(MCAR)和随机缺失(MAR)数据的模拟研究,证明了我们提出的方法在误设相关矩阵下的性能和稳健性。我们将提出的方法用于分析两组实际的序数数据:关节炎数据集和肺癌数据集。为了方便方法的实施,我们开发了 BayesRGMM,这是一个开源的 R 软件包,可在 CRAN 上获取,并附有全面的文档和源代码,可在 https://github.com/kuojunglee/BayesRGMM/ 上访问。
{"title":"Robust Bayesian cumulative probit linear mixed models for longitudinal ordinal data","authors":"Kuo-Jung Lee, Ray-Bing Chen, Keunbaik Lee","doi":"10.1007/s00180-024-01499-w","DOIUrl":"https://doi.org/10.1007/s00180-024-01499-w","url":null,"abstract":"<p>Longitudinal studies have been conducted in various fields, including medicine, economics and the social sciences. In this paper, we focus on longitudinal ordinal data. Since the longitudinal data are collected over time, repeated outcomes within each subject may be serially correlated. To address both the within-subjects serial correlation and the specific variance between subjects, we propose a Bayesian cumulative probit random effects model for the analysis of longitudinal ordinal data. The hypersphere decomposition approach is employed to overcome the positive definiteness constraint and high-dimensionality of the correlation matrix. Additionally, we present a hybrid Gibbs/Metropolis-Hastings algorithm to efficiently generate cutoff points from truncated normal distributions, thereby expediting the convergence of the Markov Chain Monte Carlo (MCMC) algorithm. The performance and robustness of our proposed methodology under misspecified correlation matrices are demonstrated through simulation studies under complete data, missing completely at random (MCAR), and missing at random (MAR). We apply the proposed approach to analyze two sets of actual ordinal data: the arthritis dataset and the lung cancer dataset. To facilitate the implementation of our method, we have developed <span>BayesRGMM</span>, an open-source R package available on CRAN, accompanied by comprehensive documentation and source code accessible at https://github.com/kuojunglee/BayesRGMM/.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"62 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140885010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-03DOI: 10.1007/s00180-024-01495-0
Jaromír Antoch, Michal Černý, Ryozo Miura
The main objective of this paper is to discuss selected computational aspects of robust estimation in the linear model with the emphasis on R-estimators. We focus on numerical algorithms and computational efficiency rather than on statistical properties. In addition, we formulate some algorithmic properties that a “good” method for R-estimators is expected to satisfy and show how to satisfy them using the currently available algorithms. We illustrate both good and bad properties of the existing algorithms. We propose two-stage methods to minimize the effect of the bad properties. Finally we justify a challenge for new approaches based on interior-point methods in optimization.
本文的主要目的是讨论线性模型中稳健估计的某些计算问题,重点是 R 估计器。我们的重点是数值算法和计算效率,而不是统计特性。此外,我们还提出了 R 估计器的 "好 "方法应满足的一些算法属性,并展示了如何利用现有算法满足这些属性。我们举例说明了现有算法的优点和缺点。我们提出了两阶段方法,以尽量减少不良属性的影响。最后,我们对基于优化中内点法的新方法提出了挑战。
{"title":"R-estimation in linear models: algorithms, complexity, challenges","authors":"Jaromír Antoch, Michal Černý, Ryozo Miura","doi":"10.1007/s00180-024-01495-0","DOIUrl":"https://doi.org/10.1007/s00180-024-01495-0","url":null,"abstract":"<p>The main objective of this paper is to discuss selected computational aspects of robust estimation in the linear model with the emphasis on <i>R</i>-estimators. We focus on numerical algorithms and computational efficiency rather than on statistical properties. In addition, we formulate some algorithmic properties that a “good” method for <i>R</i>-estimators is expected to satisfy and show how to satisfy them using the currently available algorithms. We illustrate both good and bad properties of the existing algorithms. We propose two-stage methods to minimize the effect of the bad properties. Finally we justify a challenge for new approaches based on interior-point methods in optimization.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"176 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140889694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01DOI: 10.1007/s00180-024-01498-x
Hamid Mraoui, Ahmed El-Alaoui, Souad Bechrouri, Nezha Mohaoui, Abdelilah Monir
This paper introduces a new nonparametric estimator of the regression based on local quasi-interpolation spline method. This model combines a B-spline basis with a simple local polynomial regression, via blossoming approach, to produce a reduced rank spline like smoother. Different coefficients functionals are allowed to have different smoothing parameters (bandwidths) if the function has different smoothness. In addition, the number and location of the knots of this estimator are not fixed. In practice, we may employ a modest number of basis functions and then determine the smoothing parameter as the minimizer of the criterion. In simulations, the approach achieves very competitive performance with P-spline and smoothing spline methods. Simulated data and a real data example are used to illustrate the effectiveness of the method proposed in this paper.
本文介绍了一种基于局部准插值样条法的新的非参数估计回归模型。该模型通过绽放法将 B-样条曲线基础与简单的局部多项式回归相结合,生成类似于减阶样条曲线的平滑器。如果函数的平滑度不同,则允许不同的系数函数具有不同的平滑参数(带宽)。此外,该估计器的节点数量和位置也不是固定的。在实践中,我们可以采用数量适中的基函数,然后根据准则的最小化确定平滑参数。在模拟实验中,该方法与 P 样条法和平滑样条法相比,性能极具竞争力。本文使用模拟数据和真实数据实例来说明本文所提方法的有效性。
{"title":"Two-stage regression spline modeling based on local polynomial kernel regression","authors":"Hamid Mraoui, Ahmed El-Alaoui, Souad Bechrouri, Nezha Mohaoui, Abdelilah Monir","doi":"10.1007/s00180-024-01498-x","DOIUrl":"https://doi.org/10.1007/s00180-024-01498-x","url":null,"abstract":"<p>This paper introduces a new nonparametric estimator of the regression based on local quasi-interpolation spline method. This model combines a B-spline basis with a simple local polynomial regression, via blossoming approach, to produce a reduced rank spline like smoother. Different coefficients functionals are allowed to have different smoothing parameters (bandwidths) if the function has different smoothness. In addition, the number and location of the knots of this estimator are not fixed. In practice, we may employ a modest number of basis functions and then determine the smoothing parameter as the minimizer of the criterion. In simulations, the approach achieves very competitive performance with P-spline and smoothing spline methods. Simulated data and a real data example are used to illustrate the effectiveness of the method proposed in this paper.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"17 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140827212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-29DOI: 10.1007/s00180-024-01497-y
Shubham Saini
Estimating the reliability of multicomponent systems is crucial in various engineering and reliability analysis applications. This paper investigates the multicomponent stress strength reliability estimation using lower record values, specifically for the exponentiated Pareto distribution. We compare classical estimation techniques, such as maximum likelihood estimation, with Bayesian estimation methods. Under Bayesian estimation, we employ Markov Chain Monte Carlo techniques and Tierney–Kadane’s approximation to obtain the posterior distribution of the reliability parameter. To evaluate the performance of the proposed estimation approaches, we conduct a comprehensive simulation study, considering various system configurations and sample sizes. Additionally, we analyze real data to illustrate the practical applicability of our methods. The proposed methodologies provide valuable insights for engineers and reliability analysts in accurately assessing the reliability of multicomponent systems using lower record values.
{"title":"Advancements in reliability estimation for the exponentiated Pareto distribution: a comparison of classical and Bayesian methods with lower record values","authors":"Shubham Saini","doi":"10.1007/s00180-024-01497-y","DOIUrl":"https://doi.org/10.1007/s00180-024-01497-y","url":null,"abstract":"<p>Estimating the reliability of multicomponent systems is crucial in various engineering and reliability analysis applications. This paper investigates the multicomponent stress strength reliability estimation using lower record values, specifically for the exponentiated Pareto distribution. We compare classical estimation techniques, such as maximum likelihood estimation, with Bayesian estimation methods. Under Bayesian estimation, we employ Markov Chain Monte Carlo techniques and Tierney–Kadane’s approximation to obtain the posterior distribution of the reliability parameter. To evaluate the performance of the proposed estimation approaches, we conduct a comprehensive simulation study, considering various system configurations and sample sizes. Additionally, we analyze real data to illustrate the practical applicability of our methods. The proposed methodologies provide valuable insights for engineers and reliability analysts in accurately assessing the reliability of multicomponent systems using lower record values.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"153 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140885012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-02DOI: 10.1007/s00180-024-01472-7
Hyejoon Park, Hyunjoong Kim, Yung-Seop Lee
This study proposes a new linear dimension reduction technique called Maximizing Adjusted Covariance (MAC), which is suitable for supervised classification. The new approach is to adjust the covariance matrix between input and target variables using the within-class sum of squares, thereby promoting class separation after linear dimension reduction. MAC has a low computational cost and can complement existing linear dimensionality reduction techniques for classification. In this study, the classification performance by MAC was compared with those of the existing linear dimension reduction methods using 44 datasets. In most of the classification models used in the experiment, the MAC dimension reduction method showed better classification accuracy and F1 score than other linear dimension reduction methods.
本研究提出了一种新的线性降维技术--最大化调整协方差(MAC),它适用于监督分类。新方法是利用类内平方和调整输入变量和目标变量之间的协方差矩阵,从而促进线性降维后的类分离。MAC 的计算成本较低,可作为现有线性降维分类技术的补充。本研究使用 44 个数据集比较了 MAC 与现有线性降维方法的分类性能。在实验中使用的大多数分类模型中,MAC 降维方法的分类准确率和 F1 分数都优于其他线性降维方法。
{"title":"Maximizing adjusted covariance: new supervised dimension reduction for classification","authors":"Hyejoon Park, Hyunjoong Kim, Yung-Seop Lee","doi":"10.1007/s00180-024-01472-7","DOIUrl":"https://doi.org/10.1007/s00180-024-01472-7","url":null,"abstract":"<p>This study proposes a new linear dimension reduction technique called Maximizing Adjusted Covariance (MAC), which is suitable for supervised classification. The new approach is to adjust the covariance matrix between input and target variables using the within-class sum of squares, thereby promoting class separation after linear dimension reduction. MAC has a low computational cost and can complement existing linear dimensionality reduction techniques for classification. In this study, the classification performance by MAC was compared with those of the existing linear dimension reduction methods using 44 datasets. In most of the classification models used in the experiment, the MAC dimension reduction method showed better classification accuracy and F1 score than other linear dimension reduction methods.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"53 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140567927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Extensions of quantile regression modeling for time series analysis are extensively employed in medical and health studies. This study introduces a specific class of transformed quantile-dispersion regression models for non-stationary time series. These models possess the flexibility to incorporate the time-varying structure into the model specification, enabling precise predictions for future decisions. Our proposed modeling methodology applies to dynamic processes characterized by high variation and possible periodicity, relying on a non-linear framework. Additionally, unlike the transformed time series model, our approach directly interprets the regression parameters concerning the initial response. For computational purposes, we present an iteratively reweighted least squares algorithm. To assess the performance of our model, we conduct simulation experiments. To illustrate the modeling strategy, we analyze time-series measurements of influenza infection and daily COVID-19 deaths.
{"title":"A class of transformed joint quantile time series models with applications to health studies","authors":"Fahimeh Tourani-Farani, Zeynab Aghabazaz, Iraj Kazemi","doi":"10.1007/s00180-024-01484-3","DOIUrl":"https://doi.org/10.1007/s00180-024-01484-3","url":null,"abstract":"<p>Extensions of quantile regression modeling for time series analysis are extensively employed in medical and health studies. This study introduces a specific class of transformed quantile-dispersion regression models for non-stationary time series. These models possess the flexibility to incorporate the time-varying structure into the model specification, enabling precise predictions for future decisions. Our proposed modeling methodology applies to dynamic processes characterized by high variation and possible periodicity, relying on a non-linear framework. Additionally, unlike the transformed time series model, our approach directly interprets the regression parameters concerning the initial response. For computational purposes, we present an iteratively reweighted least squares algorithm. To assess the performance of our model, we conduct simulation experiments. To illustrate the modeling strategy, we analyze time-series measurements of influenza infection and daily COVID-19 deaths.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"96 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140567967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-27DOI: 10.1007/s00180-024-01483-4
Michael Levine, Gildas Mazo
In this manuscript, we consider a finite multivariate nonparametric mixture model where the dependence between the marginal densities is modeled using the copula device. Pseudo expectation–maximization (EM) stochastic algorithms were recently proposed to estimate all of the components of this model under a location-scale constraint on the marginals. Here, we introduce a deterministic algorithm that seeks to maximize a smoothed semiparametric likelihood. No location-scale assumption is made about the marginals. The algorithm is monotonic in one special case, and, in another, leads to “approximate monotonicity”—whereby the difference between successive values of the objective function becomes non-negative up to an additive term that becomes negligible after a sufficiently large number of iterations. The behavior of this algorithm is illustrated on several simulated and real datasets. The results suggest that, under suitable conditions, the proposed algorithm may indeed be monotonic in general. A discussion of the results and some possible future research directions round out our presentation.
{"title":"A smoothed semiparametric likelihood for estimation of nonparametric finite mixture models with a copula-based dependence structure","authors":"Michael Levine, Gildas Mazo","doi":"10.1007/s00180-024-01483-4","DOIUrl":"https://doi.org/10.1007/s00180-024-01483-4","url":null,"abstract":"<p>In this manuscript, we consider a finite multivariate nonparametric mixture model where the dependence between the marginal densities is modeled using the copula device. Pseudo expectation–maximization (EM) stochastic algorithms were recently proposed to estimate all of the components of this model under a location-scale constraint on the marginals. Here, we introduce a deterministic algorithm that seeks to maximize a smoothed semiparametric likelihood. No location-scale assumption is made about the marginals. The algorithm is monotonic in one special case, and, in another, leads to “approximate monotonicity”—whereby the difference between successive values of the objective function becomes non-negative up to an additive term that becomes negligible after a sufficiently large number of iterations. The behavior of this algorithm is illustrated on several simulated and real datasets. The results suggest that, under suitable conditions, the proposed algorithm may indeed be monotonic in general. A discussion of the results and some possible future research directions round out our presentation.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"28 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140884878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-09DOI: 10.1007/s00180-024-01476-3
Saeid Amiri, Reza Modarres
We present a technique for learning via aggregation in supervised classification. The new method improves classification performance, regardless of which classifier is at its core. This approach exploits the information hidden in subspaces by combinations of aggregating variables and is applicable to high-dimensional data sets. We provide algorithms that randomly divide the variables into smaller subsets and permute them before applying a classification method to each subset. We combine the resulting classes to predict the class membership. Theoretical and simulation analyses consistently demonstrate the high accuracy of our classification methods. In comparison to aggregating observations through sampling, our approach proves to be significantly more effective. Through extensive simulations, we evaluate the accuracy of various classification methods. To further illustrate the effectiveness of our techniques, we apply them to five real-world data sets.
{"title":"A subspace aggregating algorithm for accurate classification","authors":"Saeid Amiri, Reza Modarres","doi":"10.1007/s00180-024-01476-3","DOIUrl":"https://doi.org/10.1007/s00180-024-01476-3","url":null,"abstract":"<p>We present a technique for learning via aggregation in supervised classification. The new method improves classification performance, regardless of which classifier is at its core. This approach exploits the information hidden in subspaces by combinations of aggregating variables and is applicable to high-dimensional data sets. We provide algorithms that randomly divide the variables into smaller subsets and permute them before applying a classification method to each subset. We combine the resulting classes to predict the class membership. Theoretical and simulation analyses consistently demonstrate the high accuracy of our classification methods. In comparison to aggregating observations through sampling, our approach proves to be significantly more effective. Through extensive simulations, we evaluate the accuracy of various classification methods. To further illustrate the effectiveness of our techniques, we apply them to five real-world data sets.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"12 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140075811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}