Pub Date : 2024-03-04DOI: 10.1007/s00357-024-09466-2
Abstract
We propose a discrete random effects multinomial regression model to deal with estimation and inference issues in the case of categorical and hierarchical data. Random effects are assumed to follow a discrete distribution with an a priori unknown number of support points. For a K-categories response, the modelling identifies a latent structure at the highest level of grouping, where groups are clustered into subpopulations. This model does not assume the independence across random effects relative to different response categories, and this provides an improvement from the multinomial semi-parametric multilevel model previously proposed in the literature. Since the category-specific random effects arise from the same subjects, the independence assumption is seldom verified in real data. To evaluate the improvements provided by the proposed model, we reproduce simulation and case studies of the literature, highlighting the strength of the method in properly modelling the real data structure and the advantages that taking into account the data dependence structure offers.
摘要 我们提出了一种离散随机效应多叉回归模型,用于处理分类和分层数据的估计和推断问题。假设随机效应遵循离散分布,支持点的数量先验未知。对于 K 个类别的响应,建模确定了最高分组层次的潜在结构,其中各组被聚类为子群体。该模型不假定相对于不同响应类别的随机效应之间的独立性,这就改进了之前文献中提出的多项式半参数多层次模型。由于特定类别的随机效应来自相同的受试者,因此在实际数据中很少验证独立性假设。为了评估所提出的模型所带来的改进,我们重现了文献中的模拟和案例研究,强调了该方法在正确模拟真实数据结构方面的优势,以及考虑数据依赖结构所带来的优势。
{"title":"Inferential Tools for Assessing Dependence Across Response Categories in Multinomial Models with Discrete Random Effects","authors":"","doi":"10.1007/s00357-024-09466-2","DOIUrl":"https://doi.org/10.1007/s00357-024-09466-2","url":null,"abstract":"<h3>Abstract</h3> <p>We propose a discrete random effects multinomial regression model to deal with estimation and inference issues in the case of categorical and hierarchical data. Random effects are assumed to follow a discrete distribution with an a priori unknown number of support points. For a <em>K</em>-categories response, the modelling identifies a latent structure at the highest level of grouping, where groups are clustered into subpopulations. This model does not assume the independence across random effects relative to different response categories, and this provides an improvement from the multinomial semi-parametric multilevel model previously proposed in the literature. Since the category-specific random effects arise from the same subjects, the independence assumption is seldom verified in real data. To evaluate the improvements provided by the proposed model, we reproduce simulation and case studies of the literature, highlighting the strength of the method in properly modelling the real data structure and the advantages that taking into account the data dependence structure offers.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"62 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140034076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-04DOI: 10.1007/s00357-024-09467-1
Ji Hyun Nam, Jongmin Mun, Seongil Jo, Jaeoh Kim
Since the 1953 truce, the Republic of Korea Army (ROKA) has regularly conducted artillery training, posing a risk of wildfires — a threat to both the environment and the public perception of national defense. To assess this risk and aid decision-making within the ROKA, we built a predictive model of wildfires triggered by artillery training. To this end, we combined the ROKA dataset with meteorological database. Given the infrequent occurrence of wildfires (imbalance ratio (approx ) 1:24 in our dataset), achieving balanced detection of wildfire occurrences and non-occurrences is challenging. Our approach combines a weighted support vector machine with a Gaussian mixture-based oversampling, effectively penalizing misclassification of the wildfires. Applied to our dataset, our method outperforms traditional algorithms (G-mean=0.864, sensitivity=0.956, specificity= 0.781), indicating balanced detection. This study not only helps reduce wildfires during artillery trainings but also provides a practical wildfire prediction method for similar climates worldwide.
自 1953 年停战以来,大韩民国陆军(ROKA)定期进行炮兵训练,从而带来了野火风险--这对环境和公众的国防观念都是一种威胁。为了评估这种风险并帮助韩国陆军做出决策,我们建立了一个由炮兵训练引发野火的预测模型。为此,我们将 ROKA 数据集与气象数据库相结合。鉴于野火发生的频率很低(在我们的数据集中,不平衡比为 1:24),实现野火发生和未发生的平衡检测具有挑战性。我们的方法将加权支持向量机与基于高斯混合物的超采样相结合,有效地惩罚了对野火的错误分类。应用于我们的数据集,我们的方法优于传统算法(G-mean=0.864,灵敏度=0.956,特异性=0.781),表明检测是均衡的。这项研究不仅有助于减少炮兵训练中的野火,还为全球类似气候提供了一种实用的野火预测方法。
{"title":"Prediction of Forest Fire Risk for Artillery Military Training using Weighted Support Vector Machine for Imbalanced Data","authors":"Ji Hyun Nam, Jongmin Mun, Seongil Jo, Jaeoh Kim","doi":"10.1007/s00357-024-09467-1","DOIUrl":"https://doi.org/10.1007/s00357-024-09467-1","url":null,"abstract":"<p>Since the 1953 truce, the Republic of Korea Army (ROKA) has regularly conducted artillery training, posing a risk of wildfires — a threat to both the environment and the public perception of national defense. To assess this risk and aid decision-making within the ROKA, we built a predictive model of wildfires triggered by artillery training. To this end, we combined the ROKA dataset with meteorological database. Given the infrequent occurrence of wildfires (imbalance ratio <span>(approx )</span> 1:24 in our dataset), achieving balanced detection of wildfire occurrences and non-occurrences is challenging. Our approach combines a weighted support vector machine with a Gaussian mixture-based oversampling, effectively penalizing misclassification of the wildfires. Applied to our dataset, our method outperforms traditional algorithms (G-mean=0.864, sensitivity=0.956, specificity= 0.781), indicating balanced detection. This study not only helps reduce wildfires during artillery trainings but also provides a practical wildfire prediction method for similar climates worldwide.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"114 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140034067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-04DOI: 10.1007/s00357-024-09468-0
Hema Banati, Richa Sharma, Asha Yadav
Binary metaheuristic algorithms prove to be invaluable for solving binary optimization problems. This paper proposes a binary variant of the peacock algorithm (PA) for feature selection. PA, a recent metaheuristic algorithm, is built upon lekking and mating behaviors of peacocks and peahens. While designing the binary variant, two major shortcomings of PA (lek formation and offspring generation) were identified and addressed. Eight binary variants of PA are also proposed and compared over mean fitness to identify the best variant, called binary peacock algorithm (bPA). To validate bPA’s performance experiments are conducted using 34 benchmark datasets and results are compared with eight well-known binary metaheuristic algorithms. The results show that bPA classifies 30 datasets with highest accuracy and extracts minimum features in 32 datasets, achieving up to 99.80% reduction in the feature subset size in the dataset with maximum features. bPA attained rank 1 in Friedman rank test over all parameters.
{"title":"Binary Peacock Algorithm: A Novel Metaheuristic Approach for Feature Selection","authors":"Hema Banati, Richa Sharma, Asha Yadav","doi":"10.1007/s00357-024-09468-0","DOIUrl":"https://doi.org/10.1007/s00357-024-09468-0","url":null,"abstract":"<p>Binary metaheuristic algorithms prove to be invaluable for solving binary optimization problems. This paper proposes a binary variant of the peacock algorithm (PA) for feature selection. PA, a recent metaheuristic algorithm, is built upon lekking and mating behaviors of peacocks and peahens. While designing the binary variant, two major shortcomings of PA (lek formation and offspring generation) were identified and addressed. Eight binary variants of PA are also proposed and compared over mean fitness to identify the best variant, called binary peacock algorithm (bPA). To validate bPA’s performance experiments are conducted using 34 benchmark datasets and results are compared with eight well-known binary metaheuristic algorithms. The results show that bPA classifies 30 datasets with highest accuracy and extracts minimum features in 32 datasets, achieving up to 99.80% reduction in the feature subset size in the dataset with maximum features. bPA attained rank 1 in Friedman rank test over all parameters.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"11 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140034222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work addresses the problem of supervised classification for high-dimensional and highly correlated data using correlation blocks and supervised dimension reduction. We propose a method that combines block partitioning based on interval graph modeling and an extension of principal component analysis (PCA) incorporating conditional class moment estimates in the low-dimensional projection. Block partitioning allows us to handle the high correlation of our data by grouping them into blocks where the correlation within the same block is maximized and the correlation between variables in different blocks is minimized. The extended PCA allows us to perform low-dimensional projection and clustering supervised. Applied to gene expression data from 445 individuals divided into two groups (diseased and non-diseased) and 719,656 single nucleotide polymorphisms (SNPs), this method shows good clustering and prediction performances. SNPs are a type of genetic variation that represents a difference in a single deoxyribonucleic acid (DNA) building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct population origin of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is crucial. The classical statistical models currently used in the field of genome-wide association studies (GWAS) have shown their limitations in detecting genes of interest in the study of complex diseases such as asthma or malaria. In this study, we first investigate a linkage disequilibrium (LD) block partition method based on interval graph modeling to handle the high correlation between SNPs. Then, we use supervised approaches, in particular, the approach that extends PCA by incorporating conditional class moment estimates in the low-dimensional projection, to identify the determining SNPs in malaria episodes. Experimental results obtained on the Dielmo-Ndiop project dataset show that the linear discriminant analysis (LDA) approach has significantly high accuracy in predicting malaria episodes.
{"title":"Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data","authors":"Aboubacry Gaye, Abdou Ka Diongue, Seydou Nourou Sylla, Maryam Diarra, Amadou Diallo, Cheikh Talla, Cheikh Loucoubar","doi":"10.1007/s00357-024-09463-5","DOIUrl":"https://doi.org/10.1007/s00357-024-09463-5","url":null,"abstract":"<p>This work addresses the problem of supervised classification for high-dimensional and highly correlated data using correlation blocks and supervised dimension reduction. We propose a method that combines block partitioning based on interval graph modeling and an extension of principal component analysis (PCA) incorporating conditional class moment estimates in the low-dimensional projection. Block partitioning allows us to handle the high correlation of our data by grouping them into blocks where the correlation within the same block is maximized and the correlation between variables in different blocks is minimized. The extended PCA allows us to perform low-dimensional projection and clustering supervised. Applied to gene expression data from 445 individuals divided into two groups (diseased and non-diseased) and 719,656 single nucleotide polymorphisms (SNPs), this method shows good clustering and prediction performances. SNPs are a type of genetic variation that represents a difference in a single deoxyribonucleic acid (DNA) building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct population origin of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is crucial. The classical statistical models currently used in the field of genome-wide association studies (GWAS) have shown their limitations in detecting genes of interest in the study of complex diseases such as asthma or malaria. In this study, we first investigate a linkage disequilibrium (LD) block partition method based on interval graph modeling to handle the high correlation between SNPs. Then, we use supervised approaches, in particular, the approach that extends PCA by incorporating conditional class moment estimates in the low-dimensional projection, to identify the determining SNPs in malaria episodes. Experimental results obtained on the Dielmo-Ndiop project dataset show that the linear discriminant analysis (LDA) approach has significantly high accuracy in predicting malaria episodes.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"6 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140011375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-25DOI: 10.1007/s00357-024-09462-6
Keding Chen, Yong Peng, Feiping Nie, Wanzeng Kong
Feature selection and subspace learning are two primary methods to achieve data dimensionality reduction and discriminability enhancement. However, data label information is unavailable in unsupervised learning to guide the dimensionality reduction process. To this end, we propose a soft label guided unsupervised discriminative sparse subspace feature selection (UDS(^2)FS) model in this paper, which consists of two superiorities in comparison with the existing studies. On the one hand, UDS(^2)FS aims to find a discriminative subspace to simultaneously maximize the between-class data scatter and minimize the within-class scatter. On the other hand, UDS(^2)FS estimates the data label information in the learned subspace, which further serves as the soft labels to guide the discriminative subspace learning process. Moreover, the (ell _{2,0})-norm is imposed to achieve row sparsity of the subspace projection matrix, which is parameter-free and more stable compared to the (ell _{2,1})-norm. Experimental studies to evaluate the performance of UDS(^2)FS are performed from three aspects, i.e., a synthetic data set to check its iterative optimization process, several toy data sets to visualize the feature selection effect, and some benchmark data sets to examine the clustering performance of UDS(^2)FS. From the obtained results, UDS(^2)FS exhibits competitive performance in joint subspace learning and feature selection in comparison with some related models.
{"title":"Soft Label Guided Unsupervised Discriminative Sparse Subspace Feature Selection","authors":"Keding Chen, Yong Peng, Feiping Nie, Wanzeng Kong","doi":"10.1007/s00357-024-09462-6","DOIUrl":"https://doi.org/10.1007/s00357-024-09462-6","url":null,"abstract":"<p>Feature selection and subspace learning are two primary methods to achieve data dimensionality reduction and discriminability enhancement. However, data label information is unavailable in unsupervised learning to guide the dimensionality reduction process. To this end, we propose a soft label guided unsupervised discriminative sparse subspace feature selection (UDS<span>(^2)</span>FS) model in this paper, which consists of two superiorities in comparison with the existing studies. On the one hand, UDS<span>(^2)</span>FS aims to find a discriminative subspace to simultaneously maximize the between-class data scatter and minimize the within-class scatter. On the other hand, UDS<span>(^2)</span>FS estimates the data label information in the learned subspace, which further serves as the soft labels to guide the discriminative subspace learning process. Moreover, the <span>(ell _{2,0})</span>-norm is imposed to achieve row sparsity of the subspace projection matrix, which is parameter-free and more stable compared to the <span>(ell _{2,1})</span>-norm. Experimental studies to evaluate the performance of UDS<span>(^2)</span>FS are performed from three aspects, i.e., a synthetic data set to check its iterative optimization process, several toy data sets to visualize the feature selection effect, and some benchmark data sets to examine the clustering performance of UDS<span>(^2)</span>FS. From the obtained results, UDS<span>(^2)</span>FS exhibits competitive performance in joint subspace learning and feature selection in comparison with some related models.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"330 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139559325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-23DOI: 10.1007/s00357-023-09457-9
Fulvia Pennoni, Francesco Bartolucci, Silvia Pandofi
We propose a variable selection method for multivariate hidden Markov models with continuous responses that are partially or completely missing at a given time occasion. Through this procedure, we achieve a dimensionality reduction by selecting the subset of the most informative responses for clustering individuals and simultaneously choosing the optimal number of these clusters corresponding to latent states. The approach is based on comparing different model specifications in terms of the subset of responses assumed to be dependent on the latent states, and it relies on a greedy search algorithm based on the Bayesian information criterion seen as an approximation of the Bayes factor. A suitable expectation-maximization algorithm is employed to obtain maximum likelihood estimates of the model parameters under the missing-at-random assumption. The proposal is illustrated via Monte Carlo simulation and an application where development indicators collected over eighteen years are selected, and countries are clustered into groups to evaluate their growth over time.
{"title":"Variable Selection for Hidden Markov Models with Continuous Variables and Missing Data","authors":"Fulvia Pennoni, Francesco Bartolucci, Silvia Pandofi","doi":"10.1007/s00357-023-09457-9","DOIUrl":"https://doi.org/10.1007/s00357-023-09457-9","url":null,"abstract":"<p>We propose a variable selection method for multivariate hidden Markov models with continuous responses that are partially or completely missing at a given time occasion. Through this procedure, we achieve a dimensionality reduction by selecting the subset of the most informative responses for clustering individuals and simultaneously choosing the optimal number of these clusters corresponding to latent states. The approach is based on comparing different model specifications in terms of the subset of responses assumed to be dependent on the latent states, and it relies on a greedy search algorithm based on the Bayesian information criterion seen as an approximation of the Bayes factor. A suitable expectation-maximization algorithm is employed to obtain maximum likelihood estimates of the model parameters under the missing-at-random assumption. The proposal is illustrated via Monte Carlo simulation and an application where development indicators collected over eighteen years are selected, and countries are clustered into groups to evaluate their growth over time.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"56 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139559134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-11DOI: 10.1007/s00357-023-09461-z
Youn Seon Lim
Cognitive diagnosis models provide diagnostic information on whether examinees have mastered the skills, called “attributes,” that characterize a given knowledge domain. Based on attribute mastery, distinct proficiency classes are defined to which examinees are assigned based on their item responses. Attributes are typically perceived as binary. However, polytomous attributes may yield higher precision in the assessment of examinees’ attribute mastery. Karelitz (2004) introduced the ordered-category attribute coding framework (OCAC) to accommodate polytomous attributes. Other approaches to handle polytomous attributes in cognitive diagnosis have been proposed in the literature. However, the heavy parameterization of these models often created difficulties in fitting these models. In this article, a nonparametric method for cognitive diagnosis is proposed for use with polytomous attributes, called the nonparametric polytomous attributes diagnostic classification (NPADC) method, that relies on an adaptation of the OCAC framework. The new NPADC method proposed here can be used with various cognitive diagnosis models. It does not require large sample sizes; it is computationally efficient and highly effective as is evidenced by the recovery rates of the proficiency classes observed in large-scale simulation studies. The NPADC method is also used with a real-world data set.
{"title":"Nonparametric Cognitive Diagnosis When Attributes Are Polytomous","authors":"Youn Seon Lim","doi":"10.1007/s00357-023-09461-z","DOIUrl":"https://doi.org/10.1007/s00357-023-09461-z","url":null,"abstract":"<p>Cognitive diagnosis models provide diagnostic information on whether examinees have mastered the skills, called “attributes,” that characterize a given knowledge domain. Based on attribute mastery, distinct proficiency classes are defined to which examinees are assigned based on their item responses. Attributes are typically perceived as binary. However, polytomous attributes may yield higher precision in the assessment of examinees’ attribute mastery. Karelitz (2004) introduced the ordered-category attribute coding framework (OCAC) to accommodate polytomous attributes. Other approaches to handle polytomous attributes in cognitive diagnosis have been proposed in the literature. However, the heavy parameterization of these models often created difficulties in fitting these models. In this article, a nonparametric method for cognitive diagnosis is proposed for use with polytomous attributes, called the nonparametric polytomous attributes diagnostic classification (NPADC) method, that relies on an adaptation of the OCAC framework. The new NPADC method proposed here can be used with various cognitive diagnosis models. It does not require large sample sizes; it is computationally efficient and highly effective as is evidenced by the recovery rates of the proficiency classes observed in large-scale simulation studies. The NPADC method is also used with a real-world data set.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"209 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139422567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-08DOI: 10.1007/s00357-023-09458-8
Abstract
Normal cluster-weighted models constitute a modern approach to linear regression which simultaneously perform model-based cluster analysis and multivariate linear regression analysis with random quantitative regressors. Robustified models have been recently developed, based on the use of the contaminated normal distribution, which can manage the presence of mildly atypical observations. A more flexible class of contaminated normal linear cluster-weighted models is specified here, in which the researcher is free to use a different vector of regressors for each response. The novel class also includes parsimonious models, where parsimony is attained by imposing suitable constraints on the component-covariance matrices of either the responses or the regressors. Identifiability conditions are illustrated and discussed. An expectation-conditional maximisation algorithm is provided for the maximum likelihood estimation of the model parameters. The effectiveness and usefulness of the proposed models are shown through the analysis of simulated and real datasets.
{"title":"Parsimonious Seemingly Unrelated Contaminated Normal Cluster-Weighted Models","authors":"","doi":"10.1007/s00357-023-09458-8","DOIUrl":"https://doi.org/10.1007/s00357-023-09458-8","url":null,"abstract":"<h3>Abstract</h3> <p>Normal cluster-weighted models constitute a modern approach to linear regression which simultaneously perform model-based cluster analysis and multivariate linear regression analysis with random quantitative regressors. Robustified models have been recently developed, based on the use of the contaminated normal distribution, which can manage the presence of mildly atypical observations. A more flexible class of contaminated normal linear cluster-weighted models is specified here, in which the researcher is free to use a different vector of regressors for each response. The novel class also includes parsimonious models, where parsimony is attained by imposing suitable constraints on the component-covariance matrices of either the responses or the regressors. Identifiability conditions are illustrated and discussed. An expectation-conditional maximisation algorithm is provided for the maximum likelihood estimation of the model parameters. The effectiveness and usefulness of the proposed models are shown through the analysis of simulated and real datasets.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"37 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139411994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-06DOI: 10.1007/s00357-023-09460-0
Abstract
A family of parsimonious contaminated shifted asymmetric Laplace mixtures is developed for unsupervised classification of asymmetric clusters in the presence of outliers and noise. A series of constraints are applied to a modified factor analyzer structure of the component scale matrices, yielding a family of twelve models. Application of the modified factor analyzer structure and these parsimonious constraints makes these models effective for the analysis of high-dimensional data by reducing the number of free parameters that need to be estimated. A variant of the expectation-maximization algorithm is developed for parameter estimation with convergence issues being discussed and addressed. Popular model selection criteria like the Bayesian information criterion and the integrated complete likelihood (ICL) are utilized, and a novel modification to the ICL is also considered. Through a series of simulation studies and real data analyses, that includes comparisons to well-established methods, we demonstrate the improvements in classification performance found using the proposed family of models.
{"title":"Unsupervised Classification with a Family of Parsimonious Contaminated Shifted Asymmetric Laplace Mixtures","authors":"","doi":"10.1007/s00357-023-09460-0","DOIUrl":"https://doi.org/10.1007/s00357-023-09460-0","url":null,"abstract":"<h3>Abstract</h3> <p>A family of parsimonious contaminated shifted asymmetric Laplace mixtures is developed for unsupervised classification of asymmetric clusters in the presence of outliers and noise. A series of constraints are applied to a modified factor analyzer structure of the component scale matrices, yielding a family of twelve models. Application of the modified factor analyzer structure and these parsimonious constraints makes these models effective for the analysis of high-dimensional data by reducing the number of free parameters that need to be estimated. A variant of the expectation-maximization algorithm is developed for parameter estimation with convergence issues being discussed and addressed. Popular model selection criteria like the Bayesian information criterion and the integrated complete likelihood (ICL) are utilized, and a novel modification to the ICL is also considered. Through a series of simulation studies and real data analyses, that includes comparisons to well-established methods, we demonstrate the improvements in classification performance found using the proposed family of models.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"20 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139373787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-07DOI: 10.1007/s00357-023-09456-w
Jacopo Di Iorio, Simone Vantini
Nowadays, an increasing number of problems involve data with one infinite continuous dimension known as functional data. In this paper, we introduce the funLOCI algorithm, which enables the identification of functional local clusters or functional loci, i.e, subsets or groups of curves that exhibit similar behavior across the same continuous subset of the domain. The definition of functional local clusters incorporates ideas from multivariate and functional clustering and biclustering and is based on an additive model that takes into account the shape of the curves. funLOCI is a multi-step algorithm that relies on hierarchical clustering and a functional version of the mean squared residue score to identify and validate candidate loci. Subsequently, all the results are collected and ordered in a post-processing step. To evaluate our algorithm performance, we conduct extensive simulations and compare it with other recently proposed algorithms in the literature. Furthermore, we apply funLOCI to a real-data case regarding inner carotid arteries.
{"title":"funLOCI: A Local Clustering Algorithm for Functional Data","authors":"Jacopo Di Iorio, Simone Vantini","doi":"10.1007/s00357-023-09456-w","DOIUrl":"https://doi.org/10.1007/s00357-023-09456-w","url":null,"abstract":"<p>Nowadays, an increasing number of problems involve data with one infinite continuous dimension known as functional data. In this paper, we introduce the <i>funLOCI</i> algorithm, which enables the identification of functional local clusters or functional loci, i.e, subsets or groups of curves that exhibit similar behavior across the same continuous subset of the domain. The definition of functional local clusters incorporates ideas from multivariate and functional clustering and biclustering and is based on an additive model that takes into account the shape of the curves. <i>funLOCI</i> is a multi-step algorithm that relies on hierarchical clustering and a functional version of the mean squared residue score to identify and validate candidate loci. Subsequently, all the results are collected and ordered in a post-processing step. To evaluate our algorithm performance, we conduct extensive simulations and compare it with other recently proposed algorithms in the literature. Furthermore, we apply <i>funLOCI</i> to a real-data case regarding inner carotid arteries.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"46 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138547167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}