首页 > 最新文献

Journal of Classification最新文献

英文 中文
Inferential Tools for Assessing Dependence Across Response Categories in Multinomial Models with Discrete Random Effects 在具有离散随机效应的多项式模型中评估跨响应类别依赖性的推理工具
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-03-04 DOI: 10.1007/s00357-024-09466-2

Abstract

We propose a discrete random effects multinomial regression model to deal with estimation and inference issues in the case of categorical and hierarchical data. Random effects are assumed to follow a discrete distribution with an a priori unknown number of support points. For a K-categories response, the modelling identifies a latent structure at the highest level of grouping, where groups are clustered into subpopulations. This model does not assume the independence across random effects relative to different response categories, and this provides an improvement from the multinomial semi-parametric multilevel model previously proposed in the literature. Since the category-specific random effects arise from the same subjects, the independence assumption is seldom verified in real data. To evaluate the improvements provided by the proposed model, we reproduce simulation and case studies of the literature, highlighting the strength of the method in properly modelling the real data structure and the advantages that taking into account the data dependence structure offers.

摘要 我们提出了一种离散随机效应多叉回归模型,用于处理分类和分层数据的估计和推断问题。假设随机效应遵循离散分布,支持点的数量先验未知。对于 K 个类别的响应,建模确定了最高分组层次的潜在结构,其中各组被聚类为子群体。该模型不假定相对于不同响应类别的随机效应之间的独立性,这就改进了之前文献中提出的多项式半参数多层次模型。由于特定类别的随机效应来自相同的受试者,因此在实际数据中很少验证独立性假设。为了评估所提出的模型所带来的改进,我们重现了文献中的模拟和案例研究,强调了该方法在正确模拟真实数据结构方面的优势,以及考虑数据依赖结构所带来的优势。
{"title":"Inferential Tools for Assessing Dependence Across Response Categories in Multinomial Models with Discrete Random Effects","authors":"","doi":"10.1007/s00357-024-09466-2","DOIUrl":"https://doi.org/10.1007/s00357-024-09466-2","url":null,"abstract":"<h3>Abstract</h3> <p>We propose a discrete random effects multinomial regression model to deal with estimation and inference issues in the case of categorical and hierarchical data. Random effects are assumed to follow a discrete distribution with an a priori unknown number of support points. For a <em>K</em>-categories response, the modelling identifies a latent structure at the highest level of grouping, where groups are clustered into subpopulations. This model does not assume the independence across random effects relative to different response categories, and this provides an improvement from the multinomial semi-parametric multilevel model previously proposed in the literature. Since the category-specific random effects arise from the same subjects, the independence assumption is seldom verified in real data. To evaluate the improvements provided by the proposed model, we reproduce simulation and case studies of the literature, highlighting the strength of the method in properly modelling the real data structure and the advantages that taking into account the data dependence structure offers.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"62 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140034076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prediction of Forest Fire Risk for Artillery Military Training using Weighted Support Vector Machine for Imbalanced Data 利用加权支持向量机预测炮兵军事训练中的森林火灾风险
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-03-04 DOI: 10.1007/s00357-024-09467-1
Ji Hyun Nam, Jongmin Mun, Seongil Jo, Jaeoh Kim

Since the 1953 truce, the Republic of Korea Army (ROKA) has regularly conducted artillery training, posing a risk of wildfires — a threat to both the environment and the public perception of national defense. To assess this risk and aid decision-making within the ROKA, we built a predictive model of wildfires triggered by artillery training. To this end, we combined the ROKA dataset with meteorological database. Given the infrequent occurrence of wildfires (imbalance ratio (approx ) 1:24 in our dataset), achieving balanced detection of wildfire occurrences and non-occurrences is challenging. Our approach combines a weighted support vector machine with a Gaussian mixture-based oversampling, effectively penalizing misclassification of the wildfires. Applied to our dataset, our method outperforms traditional algorithms (G-mean=0.864, sensitivity=0.956, specificity= 0.781), indicating balanced detection. This study not only helps reduce wildfires during artillery trainings but also provides a practical wildfire prediction method for similar climates worldwide.

自 1953 年停战以来,大韩民国陆军(ROKA)定期进行炮兵训练,从而带来了野火风险--这对环境和公众的国防观念都是一种威胁。为了评估这种风险并帮助韩国陆军做出决策,我们建立了一个由炮兵训练引发野火的预测模型。为此,我们将 ROKA 数据集与气象数据库相结合。鉴于野火发生的频率很低(在我们的数据集中,不平衡比为 1:24),实现野火发生和未发生的平衡检测具有挑战性。我们的方法将加权支持向量机与基于高斯混合物的超采样相结合,有效地惩罚了对野火的错误分类。应用于我们的数据集,我们的方法优于传统算法(G-mean=0.864,灵敏度=0.956,特异性=0.781),表明检测是均衡的。这项研究不仅有助于减少炮兵训练中的野火,还为全球类似气候提供了一种实用的野火预测方法。
{"title":"Prediction of Forest Fire Risk for Artillery Military Training using Weighted Support Vector Machine for Imbalanced Data","authors":"Ji Hyun Nam, Jongmin Mun, Seongil Jo, Jaeoh Kim","doi":"10.1007/s00357-024-09467-1","DOIUrl":"https://doi.org/10.1007/s00357-024-09467-1","url":null,"abstract":"<p>Since the 1953 truce, the Republic of Korea Army (ROKA) has regularly conducted artillery training, posing a risk of wildfires — a threat to both the environment and the public perception of national defense. To assess this risk and aid decision-making within the ROKA, we built a predictive model of wildfires triggered by artillery training. To this end, we combined the ROKA dataset with meteorological database. Given the infrequent occurrence of wildfires (imbalance ratio <span>(approx )</span> 1:24 in our dataset), achieving balanced detection of wildfire occurrences and non-occurrences is challenging. Our approach combines a weighted support vector machine with a Gaussian mixture-based oversampling, effectively penalizing misclassification of the wildfires. Applied to our dataset, our method outperforms traditional algorithms (G-mean=0.864, sensitivity=0.956, specificity= 0.781), indicating balanced detection. This study not only helps reduce wildfires during artillery trainings but also provides a practical wildfire prediction method for similar climates worldwide.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"114 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140034067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Binary Peacock Algorithm: A Novel Metaheuristic Approach for Feature Selection 二元孔雀算法:一种用于特征选择的新型元智方法
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-03-04 DOI: 10.1007/s00357-024-09468-0
Hema Banati, Richa Sharma, Asha Yadav

Binary metaheuristic algorithms prove to be invaluable for solving binary optimization problems. This paper proposes a binary variant of the peacock algorithm (PA) for feature selection. PA, a recent metaheuristic algorithm, is built upon lekking and mating behaviors of peacocks and peahens. While designing the binary variant, two major shortcomings of PA (lek formation and offspring generation) were identified and addressed. Eight binary variants of PA are also proposed and compared over mean fitness to identify the best variant, called binary peacock algorithm (bPA). To validate bPA’s performance experiments are conducted using 34 benchmark datasets and results are compared with eight well-known binary metaheuristic algorithms. The results show that bPA classifies 30 datasets with highest accuracy and extracts minimum features in 32 datasets, achieving up to 99.80% reduction in the feature subset size in the dataset with maximum features. bPA attained rank 1 in Friedman rank test over all parameters.

事实证明,二元元启发式算法在解决二元优化问题时非常有价值。本文提出了孔雀算法(PA)的二元变体,用于特征选择。孔雀算法是一种最新的元启发式算法,它建立在孔雀和豌豆的觅食和交配行为基础之上。在设计二进制变体的过程中,发现并解决了 PA 的两个主要缺陷(lek 形成和后代生成)。此外,还提出了八种二元孔雀算法变体,并对其平均适合度进行了比较,以确定最佳变体,即二元孔雀算法(bPA)。为了验证 bPA 的性能,使用 34 个基准数据集进行了实验,并将结果与 8 种著名的二元元启发式算法进行了比较。结果表明,bPA 在 30 个数据集上的分类准确率最高,在 32 个数据集上提取的特征最少,在特征最多的数据集上减少的特征子集大小高达 99.80%。
{"title":"Binary Peacock Algorithm: A Novel Metaheuristic Approach for Feature Selection","authors":"Hema Banati, Richa Sharma, Asha Yadav","doi":"10.1007/s00357-024-09468-0","DOIUrl":"https://doi.org/10.1007/s00357-024-09468-0","url":null,"abstract":"<p>Binary metaheuristic algorithms prove to be invaluable for solving binary optimization problems. This paper proposes a binary variant of the peacock algorithm (PA) for feature selection. PA, a recent metaheuristic algorithm, is built upon lekking and mating behaviors of peacocks and peahens. While designing the binary variant, two major shortcomings of PA (lek formation and offspring generation) were identified and addressed. Eight binary variants of PA are also proposed and compared over mean fitness to identify the best variant, called binary peacock algorithm (bPA). To validate bPA’s performance experiments are conducted using 34 benchmark datasets and results are compared with eight well-known binary metaheuristic algorithms. The results show that bPA classifies 30 datasets with highest accuracy and extracts minimum features in 32 datasets, achieving up to 99.80% reduction in the feature subset size in the dataset with maximum features. bPA attained rank 1 in Friedman rank test over all parameters.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"11 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140034222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data 高维相关数据的监督分类:基因组数据的应用
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-02-28 DOI: 10.1007/s00357-024-09463-5
Aboubacry Gaye, Abdou Ka Diongue, Seydou Nourou Sylla, Maryam Diarra, Amadou Diallo, Cheikh Talla, Cheikh Loucoubar

This work addresses the problem of supervised classification for high-dimensional and highly correlated data using correlation blocks and supervised dimension reduction. We propose a method that combines block partitioning based on interval graph modeling and an extension of principal component analysis (PCA) incorporating conditional class moment estimates in the low-dimensional projection. Block partitioning allows us to handle the high correlation of our data by grouping them into blocks where the correlation within the same block is maximized and the correlation between variables in different blocks is minimized. The extended PCA allows us to perform low-dimensional projection and clustering supervised. Applied to gene expression data from 445 individuals divided into two groups (diseased and non-diseased) and 719,656 single nucleotide polymorphisms (SNPs), this method shows good clustering and prediction performances. SNPs are a type of genetic variation that represents a difference in a single deoxyribonucleic acid (DNA) building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct population origin of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is crucial. The classical statistical models currently used in the field of genome-wide association studies (GWAS) have shown their limitations in detecting genes of interest in the study of complex diseases such as asthma or malaria. In this study, we first investigate a linkage disequilibrium (LD) block partition method based on interval graph modeling to handle the high correlation between SNPs. Then, we use supervised approaches, in particular, the approach that extends PCA by incorporating conditional class moment estimates in the low-dimensional projection, to identify the determining SNPs in malaria episodes. Experimental results obtained on the Dielmo-Ndiop project dataset show that the linear discriminant analysis (LDA) approach has significantly high accuracy in predicting malaria episodes.

这项研究利用相关块和监督降维解决了高维和高度相关数据的监督分类问题。我们提出了一种方法,该方法结合了基于区间图建模的块划分和主成分分析 (PCA) 的扩展,在低维投影中纳入了条件类矩估计。块划分法允许我们通过将数据分组为块来处理数据的高相关性,其中同一块内的相关性最大化,而不同块内变量间的相关性最小化。扩展 PCA 允许我们执行低维投影和聚类监督。将该方法应用于 445 个个体的基因表达数据,这些数据分为两组(患病和未患病),包含 719,656 个单核苷酸多态性(SNPs),结果显示该方法具有良好的聚类和预测性能。单核苷酸多态性(SNPs)是一种遗传变异,它代表了单个脱氧核糖核酸(DNA)结构单元(即核苷酸)的差异。以往的研究表明,SNPs 可用于识别个体的正确种群来源,并可单独或同时对表型产生影响。因此,研究遗传学在传染病表型中的作用至关重要。目前在全基因组关联研究(GWAS)领域使用的经典统计模型在检测哮喘或疟疾等复杂疾病研究中的相关基因方面已显示出其局限性。在本研究中,我们首先研究了一种基于区间图建模的连锁不平衡(LD)区块划分方法,以处理 SNP 之间的高度相关性。然后,我们使用监督方法,特别是通过在低维投影中纳入条件类矩估计来扩展 PCA 的方法,来识别疟疾发病中的决定性 SNPs。在 Dielmo-Ndiop 项目数据集上获得的实验结果表明,线性判别分析(LDA)方法在预测疟疾发作方面具有很高的准确性。
{"title":"Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data","authors":"Aboubacry Gaye, Abdou Ka Diongue, Seydou Nourou Sylla, Maryam Diarra, Amadou Diallo, Cheikh Talla, Cheikh Loucoubar","doi":"10.1007/s00357-024-09463-5","DOIUrl":"https://doi.org/10.1007/s00357-024-09463-5","url":null,"abstract":"<p>This work addresses the problem of supervised classification for high-dimensional and highly correlated data using correlation blocks and supervised dimension reduction. We propose a method that combines block partitioning based on interval graph modeling and an extension of principal component analysis (PCA) incorporating conditional class moment estimates in the low-dimensional projection. Block partitioning allows us to handle the high correlation of our data by grouping them into blocks where the correlation within the same block is maximized and the correlation between variables in different blocks is minimized. The extended PCA allows us to perform low-dimensional projection and clustering supervised. Applied to gene expression data from 445 individuals divided into two groups (diseased and non-diseased) and 719,656 single nucleotide polymorphisms (SNPs), this method shows good clustering and prediction performances. SNPs are a type of genetic variation that represents a difference in a single deoxyribonucleic acid (DNA) building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct population origin of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is crucial. The classical statistical models currently used in the field of genome-wide association studies (GWAS) have shown their limitations in detecting genes of interest in the study of complex diseases such as asthma or malaria. In this study, we first investigate a linkage disequilibrium (LD) block partition method based on interval graph modeling to handle the high correlation between SNPs. Then, we use supervised approaches, in particular, the approach that extends PCA by incorporating conditional class moment estimates in the low-dimensional projection, to identify the determining SNPs in malaria episodes. Experimental results obtained on the Dielmo-Ndiop project dataset show that the linear discriminant analysis (LDA) approach has significantly high accuracy in predicting malaria episodes.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"6 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140011375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Soft Label Guided Unsupervised Discriminative Sparse Subspace Feature Selection 软标签引导的无监督判别稀疏子空间特征选择
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-01-25 DOI: 10.1007/s00357-024-09462-6
Keding Chen, Yong Peng, Feiping Nie, Wanzeng Kong

Feature selection and subspace learning are two primary methods to achieve data dimensionality reduction and discriminability enhancement. However, data label information is unavailable in unsupervised learning to guide the dimensionality reduction process. To this end, we propose a soft label guided unsupervised discriminative sparse subspace feature selection (UDS(^2)FS) model in this paper, which consists of two superiorities in comparison with the existing studies. On the one hand, UDS(^2)FS aims to find a discriminative subspace to simultaneously maximize the between-class data scatter and minimize the within-class scatter. On the other hand, UDS(^2)FS estimates the data label information in the learned subspace, which further serves as the soft labels to guide the discriminative subspace learning process. Moreover, the (ell _{2,0})-norm is imposed to achieve row sparsity of the subspace projection matrix, which is parameter-free and more stable compared to the (ell _{2,1})-norm. Experimental studies to evaluate the performance of UDS(^2)FS are performed from three aspects, i.e., a synthetic data set to check its iterative optimization process, several toy data sets to visualize the feature selection effect, and some benchmark data sets to examine the clustering performance of UDS(^2)FS. From the obtained results, UDS(^2)FS exhibits competitive performance in joint subspace learning and feature selection in comparison with some related models.

特征选择和子空间学习是实现数据降维和提高可辨别性的两种主要方法。然而,在无监督学习中,数据标签信息无法用于指导降维过程。为此,我们在本文中提出了一种软标签引导的无监督辨别稀疏子空间特征选择(UDS/(^2)FS)模型,与现有研究相比,它包含两个优点。一方面,UDS(^2)FS 的目标是找到一个判别子空间,同时使类间数据散度最大化和类内数据散度最小化。另一方面,UDS(^2)FS 在学习到的子空间中估计数据标签信息,进一步作为软标签来指导判别子空间的学习过程。此外,为了实现子空间投影矩阵的行稀疏性,UDS(ell _{2,0})FS采用了(ell _{2,0})规范,与(ell _{2,1})规范相比,它不需要参数且更加稳定。为了评估 UDS(^2)FS 的性能,我们从三个方面进行了实验研究,即通过一个合成数据集来检验其迭代优化过程,通过几个玩具数据集来直观地显示特征选择效果,以及通过一些基准数据集来检验 UDS(^2)FS 的聚类性能。从得到的结果来看,与一些相关模型相比,UDS(^2)FS 在联合子空间学习和特征选择方面表现出了很强的竞争力。
{"title":"Soft Label Guided Unsupervised Discriminative Sparse Subspace Feature Selection","authors":"Keding Chen, Yong Peng, Feiping Nie, Wanzeng Kong","doi":"10.1007/s00357-024-09462-6","DOIUrl":"https://doi.org/10.1007/s00357-024-09462-6","url":null,"abstract":"<p>Feature selection and subspace learning are two primary methods to achieve data dimensionality reduction and discriminability enhancement. However, data label information is unavailable in unsupervised learning to guide the dimensionality reduction process. To this end, we propose a soft label guided unsupervised discriminative sparse subspace feature selection (UDS<span>(^2)</span>FS) model in this paper, which consists of two superiorities in comparison with the existing studies. On the one hand, UDS<span>(^2)</span>FS aims to find a discriminative subspace to simultaneously maximize the between-class data scatter and minimize the within-class scatter. On the other hand, UDS<span>(^2)</span>FS estimates the data label information in the learned subspace, which further serves as the soft labels to guide the discriminative subspace learning process. Moreover, the <span>(ell _{2,0})</span>-norm is imposed to achieve row sparsity of the subspace projection matrix, which is parameter-free and more stable compared to the <span>(ell _{2,1})</span>-norm. Experimental studies to evaluate the performance of UDS<span>(^2)</span>FS are performed from three aspects, i.e., a synthetic data set to check its iterative optimization process, several toy data sets to visualize the feature selection effect, and some benchmark data sets to examine the clustering performance of UDS<span>(^2)</span>FS. From the obtained results, UDS<span>(^2)</span>FS exhibits competitive performance in joint subspace learning and feature selection in comparison with some related models.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"330 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139559325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Variable Selection for Hidden Markov Models with Continuous Variables and Missing Data 具有连续变量和缺失数据的隐马尔可夫模型的变量选择
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-01-23 DOI: 10.1007/s00357-023-09457-9
Fulvia Pennoni, Francesco Bartolucci, Silvia Pandofi

We propose a variable selection method for multivariate hidden Markov models with continuous responses that are partially or completely missing at a given time occasion. Through this procedure, we achieve a dimensionality reduction by selecting the subset of the most informative responses for clustering individuals and simultaneously choosing the optimal number of these clusters corresponding to latent states. The approach is based on comparing different model specifications in terms of the subset of responses assumed to be dependent on the latent states, and it relies on a greedy search algorithm based on the Bayesian information criterion seen as an approximation of the Bayes factor. A suitable expectation-maximization algorithm is employed to obtain maximum likelihood estimates of the model parameters under the missing-at-random assumption. The proposal is illustrated via Monte Carlo simulation and an application where development indicators collected over eighteen years are selected, and countries are clustered into groups to evaluate their growth over time.

我们提出了一种变量选择方法,适用于在给定时间内部分或完全缺失连续响应的多元隐马尔可夫模型。通过这种方法,我们可以选择信息量最大的反应子集对个体进行聚类,同时选择这些聚类中与潜在状态相对应的最佳数量,从而达到降维的目的。这种方法的基础是比较不同的模型规格,即假设依赖于潜在状态的响应子集,它依赖于一种基于贝叶斯信息标准的贪婪搜索算法,该标准被视为贝叶斯因子的近似值。在随机缺失假设下,采用合适的期望最大化算法来获得模型参数的最大似然估计值。该建议通过蒙特卡罗模拟和一个应用实例进行了说明,在该应用实例中,选择了 18 年来所收集的发展指标,并将国家分组,以评估其随时间推移的增长情况。
{"title":"Variable Selection for Hidden Markov Models with Continuous Variables and Missing Data","authors":"Fulvia Pennoni, Francesco Bartolucci, Silvia Pandofi","doi":"10.1007/s00357-023-09457-9","DOIUrl":"https://doi.org/10.1007/s00357-023-09457-9","url":null,"abstract":"<p>We propose a variable selection method for multivariate hidden Markov models with continuous responses that are partially or completely missing at a given time occasion. Through this procedure, we achieve a dimensionality reduction by selecting the subset of the most informative responses for clustering individuals and simultaneously choosing the optimal number of these clusters corresponding to latent states. The approach is based on comparing different model specifications in terms of the subset of responses assumed to be dependent on the latent states, and it relies on a greedy search algorithm based on the Bayesian information criterion seen as an approximation of the Bayes factor. A suitable expectation-maximization algorithm is employed to obtain maximum likelihood estimates of the model parameters under the missing-at-random assumption. The proposal is illustrated via Monte Carlo simulation and an application where development indicators collected over eighteen years are selected, and countries are clustered into groups to evaluate their growth over time.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"56 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139559134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nonparametric Cognitive Diagnosis When Attributes Are Polytomous 属性多态时的非参数认知诊断
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-01-11 DOI: 10.1007/s00357-023-09461-z
Youn Seon Lim

Cognitive diagnosis models provide diagnostic information on whether examinees have mastered the skills, called “attributes,” that characterize a given knowledge domain. Based on attribute mastery, distinct proficiency classes are defined to which examinees are assigned based on their item responses. Attributes are typically perceived as binary. However, polytomous attributes may yield higher precision in the assessment of examinees’ attribute mastery. Karelitz (2004) introduced the ordered-category attribute coding framework (OCAC) to accommodate polytomous attributes. Other approaches to handle polytomous attributes in cognitive diagnosis have been proposed in the literature. However, the heavy parameterization of these models often created difficulties in fitting these models. In this article, a nonparametric method for cognitive diagnosis is proposed for use with polytomous attributes, called the nonparametric polytomous attributes diagnostic classification (NPADC) method, that relies on an adaptation of the OCAC framework. The new NPADC method proposed here can be used with various cognitive diagnosis models. It does not require large sample sizes; it is computationally efficient and highly effective as is evidenced by the recovery rates of the proficiency classes observed in large-scale simulation studies. The NPADC method is also used with a real-world data set.

认知诊断模型提供诊断信息,说明考生是否掌握了特定知识领域的技能(称为 "属性")。在掌握属性的基础上,根据考生对题目的回答,将其划分为不同的能力等级。属性通常被视为二元属性。然而,在评估考生的属性掌握情况时,多态属性可能会产生更高的精确度。Karelitz (2004) 引入了有序类别属性编码框架 (OCAC),以适应多义属性。文献中还提出了其他处理认知诊断中多变属性的方法。然而,这些模型的参数化程度很高,往往给模型拟合带来困难。本文提出了一种用于认知诊断的非参数方法,该方法依赖于对 OCAC 框架的调整,可用于多omous 属性,称为非参数多omous 属性诊断分类法(NPADC)。这里提出的新 NPADC 方法可用于各种认知诊断模型。它不需要大样本量,计算效率高,效果显著,在大规模模拟研究中观察到的能力等级恢复率就证明了这一点。NPADC 方法还可用于真实世界的数据集。
{"title":"Nonparametric Cognitive Diagnosis When Attributes Are Polytomous","authors":"Youn Seon Lim","doi":"10.1007/s00357-023-09461-z","DOIUrl":"https://doi.org/10.1007/s00357-023-09461-z","url":null,"abstract":"<p>Cognitive diagnosis models provide diagnostic information on whether examinees have mastered the skills, called “attributes,” that characterize a given knowledge domain. Based on attribute mastery, distinct proficiency classes are defined to which examinees are assigned based on their item responses. Attributes are typically perceived as binary. However, polytomous attributes may yield higher precision in the assessment of examinees’ attribute mastery. Karelitz (2004) introduced the ordered-category attribute coding framework (OCAC) to accommodate polytomous attributes. Other approaches to handle polytomous attributes in cognitive diagnosis have been proposed in the literature. However, the heavy parameterization of these models often created difficulties in fitting these models. In this article, a nonparametric method for cognitive diagnosis is proposed for use with polytomous attributes, called the nonparametric polytomous attributes diagnostic classification (NPADC) method, that relies on an adaptation of the OCAC framework. The new NPADC method proposed here can be used with various cognitive diagnosis models. It does not require large sample sizes; it is computationally efficient and highly effective as is evidenced by the recovery rates of the proficiency classes observed in large-scale simulation studies. The NPADC method is also used with a real-world data set.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"209 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139422567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parsimonious Seemingly Unrelated Contaminated Normal Cluster-Weighted Models 看似不相关的似污染正态聚类加权模型
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-01-08 DOI: 10.1007/s00357-023-09458-8

Abstract

Normal cluster-weighted models constitute a modern approach to linear regression which simultaneously perform model-based cluster analysis and multivariate linear regression analysis with random quantitative regressors. Robustified models have been recently developed, based on the use of the contaminated normal distribution, which can manage the presence of mildly atypical observations. A more flexible class of contaminated normal linear cluster-weighted models is specified here, in which the researcher is free to use a different vector of regressors for each response. The novel class also includes parsimonious models, where parsimony is attained by imposing suitable constraints on the component-covariance matrices of either the responses or the regressors. Identifiability conditions are illustrated and discussed. An expectation-conditional maximisation algorithm is provided for the maximum likelihood estimation of the model parameters. The effectiveness and usefulness of the proposed models are shown through the analysis of simulated and real datasets.

摘要 正态聚类加权模型是线性回归的一种现代方法,它可以同时进行基于模型的聚类分析和带有随机定量回归因子的多元线性回归分析。最近开发了基于污染正态分布的改进模型,可以处理轻度非典型观测值。这里提出了一类更灵活的污染正态线性聚类加权模型,研究人员可以自由地对每个响应使用不同的回归因子向量。这一类新模型还包括简约模型,简约模型是通过对响应或回归因子的分量-协方差矩阵施加适当的约束来实现的。对可识别性条件进行了说明和讨论。为模型参数的最大似然估计提供了期望条件最大化算法。通过对模拟和真实数据集的分析,展示了所提模型的有效性和实用性。
{"title":"Parsimonious Seemingly Unrelated Contaminated Normal Cluster-Weighted Models","authors":"","doi":"10.1007/s00357-023-09458-8","DOIUrl":"https://doi.org/10.1007/s00357-023-09458-8","url":null,"abstract":"<h3>Abstract</h3> <p>Normal cluster-weighted models constitute a modern approach to linear regression which simultaneously perform model-based cluster analysis and multivariate linear regression analysis with random quantitative regressors. Robustified models have been recently developed, based on the use of the contaminated normal distribution, which can manage the presence of mildly atypical observations. A more flexible class of contaminated normal linear cluster-weighted models is specified here, in which the researcher is free to use a different vector of regressors for each response. The novel class also includes parsimonious models, where parsimony is attained by imposing suitable constraints on the component-covariance matrices of either the responses or the regressors. Identifiability conditions are illustrated and discussed. An expectation-conditional maximisation algorithm is provided for the maximum likelihood estimation of the model parameters. The effectiveness and usefulness of the proposed models are shown through the analysis of simulated and real datasets.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"37 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139411994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unsupervised Classification with a Family of Parsimonious Contaminated Shifted Asymmetric Laplace Mixtures 使用准污染移位非对称拉普拉斯混合物族进行无监督分类
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-01-06 DOI: 10.1007/s00357-023-09460-0

Abstract

A family of parsimonious contaminated shifted asymmetric Laplace mixtures is developed for unsupervised classification of asymmetric clusters in the presence of outliers and noise. A series of constraints are applied to a modified factor analyzer structure of the component scale matrices, yielding a family of twelve models. Application of the modified factor analyzer structure and these parsimonious constraints makes these models effective for the analysis of high-dimensional data by reducing the number of free parameters that need to be estimated. A variant of the expectation-maximization algorithm is developed for parameter estimation with convergence issues being discussed and addressed. Popular model selection criteria like the Bayesian information criterion and the integrated complete likelihood (ICL) are utilized, and a novel modification to the ICL is also considered. Through a series of simulation studies and real data analyses, that includes comparisons to well-established methods, we demonstrate the improvements in classification performance found using the proposed family of models.

摘要 针对存在离群值和噪声的非对称聚类的无监督分类,开发了一系列简明的污染偏移非对称拉普拉斯混合物。对分量尺度矩阵的修正因子分析器结构应用了一系列约束条件,产生了一个由十二个模型组成的模型族。应用改进的因子分析器结构和这些简洁的约束条件,可以减少需要估计的自由参数数量,从而使这些模型在分析高维数据时非常有效。为参数估计开发了一种期望最大化算法的变体,并讨论和解决了收敛问题。利用了贝叶斯信息准则和综合完全似然(ICL)等流行的模型选择标准,还考虑了对 ICL 的新修改。通过一系列模拟研究和真实数据分析(包括与成熟方法的比较),我们证明了所提出的模型系列在分类性能方面的改进。
{"title":"Unsupervised Classification with a Family of Parsimonious Contaminated Shifted Asymmetric Laplace Mixtures","authors":"","doi":"10.1007/s00357-023-09460-0","DOIUrl":"https://doi.org/10.1007/s00357-023-09460-0","url":null,"abstract":"<h3>Abstract</h3> <p>A family of parsimonious contaminated shifted asymmetric Laplace mixtures is developed for unsupervised classification of asymmetric clusters in the presence of outliers and noise. A series of constraints are applied to a modified factor analyzer structure of the component scale matrices, yielding a family of twelve models. Application of the modified factor analyzer structure and these parsimonious constraints makes these models effective for the analysis of high-dimensional data by reducing the number of free parameters that need to be estimated. A variant of the expectation-maximization algorithm is developed for parameter estimation with convergence issues being discussed and addressed. Popular model selection criteria like the Bayesian information criterion and the integrated complete likelihood (ICL) are utilized, and a novel modification to the ICL is also considered. Through a series of simulation studies and real data analyses, that includes comparisons to well-established methods, we demonstrate the improvements in classification performance found using the proposed family of models.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"20 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139373787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
funLOCI: A Local Clustering Algorithm for Functional Data funLOCI:功能数据的局部聚类算法
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-12-07 DOI: 10.1007/s00357-023-09456-w
Jacopo Di Iorio, Simone Vantini

Nowadays, an increasing number of problems involve data with one infinite continuous dimension known as functional data. In this paper, we introduce the funLOCI algorithm, which enables the identification of functional local clusters or functional loci, i.e, subsets or groups of curves that exhibit similar behavior across the same continuous subset of the domain. The definition of functional local clusters incorporates ideas from multivariate and functional clustering and biclustering and is based on an additive model that takes into account the shape of the curves. funLOCI is a multi-step algorithm that relies on hierarchical clustering and a functional version of the mean squared residue score to identify and validate candidate loci. Subsequently, all the results are collected and ordered in a post-processing step. To evaluate our algorithm performance, we conduct extensive simulations and compare it with other recently proposed algorithms in the literature. Furthermore, we apply funLOCI to a real-data case regarding inner carotid arteries.

如今,越来越多的问题涉及具有一个无限连续维度的数据,即函数数据。在本文中,我们介绍了 funLOCI 算法,该算法可以识别功能局部簇或功能位置,即在同一连续域子集上表现出相似行为的曲线子集或曲线组。funLOCI 是一种多步骤算法,依靠分层聚类和功能版的均方残差得分来识别和验证候选位置。随后,在后处理步骤中对所有结果进行收集和排序。为了评估我们的算法性能,我们进行了大量模拟,并将其与文献中最近提出的其他算法进行了比较。此外,我们还将 funLOCI 应用于颈内动脉的真实数据案例。
{"title":"funLOCI: A Local Clustering Algorithm for Functional Data","authors":"Jacopo Di Iorio, Simone Vantini","doi":"10.1007/s00357-023-09456-w","DOIUrl":"https://doi.org/10.1007/s00357-023-09456-w","url":null,"abstract":"<p>Nowadays, an increasing number of problems involve data with one infinite continuous dimension known as functional data. In this paper, we introduce the <i>funLOCI</i> algorithm, which enables the identification of functional local clusters or functional loci, i.e, subsets or groups of curves that exhibit similar behavior across the same continuous subset of the domain. The definition of functional local clusters incorporates ideas from multivariate and functional clustering and biclustering and is based on an additive model that takes into account the shape of the curves. <i>funLOCI</i> is a multi-step algorithm that relies on hierarchical clustering and a functional version of the mean squared residue score to identify and validate candidate loci. Subsequently, all the results are collected and ordered in a post-processing step. To evaluate our algorithm performance, we conduct extensive simulations and compare it with other recently proposed algorithms in the literature. Furthermore, we apply <i>funLOCI</i> to a real-data case regarding inner carotid arteries.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"46 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138547167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Journal of Classification
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1