首页 > 最新文献

Journal of Classification最新文献

英文 中文
A New Look at the Dirichlet Distribution: Robustness, Clustering, and Both Together 迪里夏特分布新视角:稳健性、聚类和两者兼而有之
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-02 DOI: 10.1007/s00357-024-09480-4
Salvatore D. Tomarchio, Antonio Punzo, Johannes T. Ferreira, Andriette Bekker

Compositional data have peculiar characteristics that pose significant challenges to traditional statistical methods and models. Within this framework, we use a convenient mode parametrized Dirichlet distribution across multiple fields of statistics. In particular, we propose finite mixtures of unimodal Dirichlet (UD) distributions for model-based clustering and classification. Then, we introduce the contaminated UD (CUD) distribution, a heavy-tailed generalization of the UD distribution that allows for a more flexible tail behavior in the presence of atypical observations. Thirdly, we propose finite mixtures of CUD distributions to jointly account for the presence of clusters and atypical points in the data. Parameter estimation is carried out by directly maximizing the maximum likelihood or by using an expectation-maximization (EM) algorithm. Two analyses are conducted on simulated data to illustrate the effects of atypical observations on parameter estimation and data classification, and how our proposals address both aspects. Furthermore, two real datasets are investigated and the results obtained via our models are discussed.

组合数据具有独特的特征,对传统的统计方法和模型提出了巨大的挑战。在此框架内,我们在多个统计领域使用了方便的模式参数化 Dirichlet 分布。特别是,我们为基于模型的聚类和分类提出了单模态 Dirichlet (UD) 分布的有限混合物。然后,我们介绍了受污染的 UD(CUD)分布,它是 UD 分布的重尾广义化,允许在存在非典型观察结果的情况下具有更灵活的尾部行为。第三,我们提出了 CUD 分布的有限混合物,以共同考虑数据中集群和非典型点的存在。参数估计通过直接最大化最大似然或使用期望最大化(EM)算法进行。我们对模拟数据进行了两项分析,以说明非典型观测对参数估计和数据分类的影响,以及我们的建议如何解决这两方面的问题。此外,还研究了两个真实数据集,并讨论了通过我们的模型获得的结果。
{"title":"A New Look at the Dirichlet Distribution: Robustness, Clustering, and Both Together","authors":"Salvatore D. Tomarchio, Antonio Punzo, Johannes T. Ferreira, Andriette Bekker","doi":"10.1007/s00357-024-09480-4","DOIUrl":"https://doi.org/10.1007/s00357-024-09480-4","url":null,"abstract":"<p>Compositional data have peculiar characteristics that pose significant challenges to traditional statistical methods and models. Within this framework, we use a convenient mode parametrized Dirichlet distribution across multiple fields of statistics. In particular, we propose finite mixtures of unimodal Dirichlet (UD) distributions for model-based clustering and classification. Then, we introduce the contaminated UD (CUD) distribution, a heavy-tailed generalization of the UD distribution that allows for a more flexible tail behavior in the presence of atypical observations. Thirdly, we propose finite mixtures of CUD distributions to jointly account for the presence of clusters and atypical points in the data. Parameter estimation is carried out by directly maximizing the maximum likelihood or by using an expectation-maximization (EM) algorithm. Two analyses are conducted on simulated data to illustrate the effects of atypical observations on parameter estimation and data classification, and how our proposals address both aspects. Furthermore, two real datasets are investigated and the results obtained via our models are discussed.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"70 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic Topic Title Assignment with Word Embedding 通过单词嵌入自动分配主题标题
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-01 DOI: 10.1007/s00357-024-09476-0
Gianpaolo Zammarchi, Maurizio Romano, Claudio Conversano

In this paper, we propose TAWE (title assignment with word embedding), a new method to automatically assign titles to topics inferred from sets of documents. This method combines the results obtained from the topic modeling performed with, e.g., latent Dirichlet allocation (LDA) or other suitable methods and the word embedding representation of words in a vector space. This representation preserves the meaning of the words while allowing to find the most suitable word that represents the topic. The procedure is twofold: first, a cleaned text is used to build the LDA model to infer a desirable number of latent topics; second, a reasonable number of words and their weights are extracted from each topic and represented in n-dimensional space using word embedding. Based on the selected weighted words, a centroid is computed, and the closest word is chosen as the title of the topic. To test the method, we used a collection of tweets about climate change downloaded from some of the main newspapers accounts on Twitter. Results showed that TAWE is a suitable method for automatically assigning a topic title.

在本文中,我们提出了 TAWE(带词嵌入的标题分配),这是一种自动为从文档集推断出的主题分配标题的新方法。该方法结合了通过潜在 Dirichlet 分配(LDA)或其他适当方法进行主题建模所获得的结果,以及词在向量空间中的词嵌入表示。这种表示法既能保留词语的含义,又能找到最合适的词语来表示主题。这一过程包括两个方面:首先,使用经过清理的文本来建立 LDA 模型,以推断出理想数量的潜在主题;其次,从每个主题中提取合理数量的词语及其权重,并使用词语嵌入法在 n 维空间中表示出来。根据所选的加权词,计算出一个中心点,并选择最接近的词作为主题的标题。为了测试该方法,我们使用了从 Twitter 上一些主要报纸账户下载的有关气候变化的推文集合。结果表明,TAWE 是自动分配话题标题的合适方法。
{"title":"Automatic Topic Title Assignment with Word Embedding","authors":"Gianpaolo Zammarchi, Maurizio Romano, Claudio Conversano","doi":"10.1007/s00357-024-09476-0","DOIUrl":"https://doi.org/10.1007/s00357-024-09476-0","url":null,"abstract":"<p>In this paper, we propose TAWE (title assignment with word embedding), a new method to automatically assign titles to topics inferred from sets of documents. This method combines the results obtained from the topic modeling performed with, e.g., latent Dirichlet allocation (LDA) or other suitable methods and the word embedding representation of words in a vector space. This representation preserves the meaning of the words while allowing to find the most suitable word that represents the topic. The procedure is twofold: first, a cleaned text is used to build the LDA model to infer a desirable number of latent topics; second, a reasonable number of words and their weights are extracted from each topic and represented in n-dimensional space using word embedding. Based on the selected weighted words, a centroid is computed, and the closest word is chosen as the title of the topic. To test the method, we used a collection of tweets about climate change downloaded from some of the main newspapers accounts on Twitter. Results showed that TAWE is a suitable method for automatically assigning a topic title.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"143 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Normalised Clustering Accuracy: An Asymmetric External Cluster Validity Measure 归一化聚类精度:一种非对称外部聚类有效性测量方法
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-06-28 DOI: 10.1007/s00357-024-09482-2
Marek Gagolewski

There is no, nor will there ever be, single best clustering algorithm. Nevertheless, we would still like to be able to distinguish between methods that work well on certain task types and those that systematically underperform. Clustering algorithms are traditionally evaluated using either internal or external validity measures. Internal measures quantify different aspects of the obtained partitions, e.g., the average degree of cluster compactness or point separability. However, their validity is questionable because the clusterings they endorse can sometimes be meaningless. External measures, on the other hand, compare the algorithms’ outputs to fixed ground truth groupings provided by experts. In this paper, we argue that the commonly used classical partition similarity scores, such as the normalised mutual information, Fowlkes–Mallows, or adjusted Rand index, miss some desirable properties. In particular, they do not identify worst-case scenarios correctly, nor are they easily interpretable. As a consequence, the evaluation of clustering algorithms on diverse benchmark datasets can be difficult. To remedy these issues, we propose and analyse a new measure: a version of the optimal set-matching accuracy, which is normalised, monotonic with respect to some similarity relation, scale-invariant, and corrected for the imbalancedness of cluster sizes (but neither symmetric nor adjusted for chance).

现在没有,将来也不会有单一的最佳聚类算法。尽管如此,我们仍然希望能够区分哪些方法在某些任务类型中效果显著,哪些方法系统性地表现不佳。聚类算法传统上使用内部或外部有效性指标进行评估。内部度量对所获得分区的不同方面进行量化,例如聚类的平均紧凑程度或点的可分离性。然而,其有效性值得怀疑,因为它们所认可的聚类有时可能毫无意义。另一方面,外部测量则是将算法的输出与专家提供的固定基本真实分组进行比较。在本文中,我们认为常用的经典分区相似性得分,如归一化互信息、Fowlkes-Mallows 或调整后的兰德指数,都缺少一些理想的特性。特别是,它们不能正确识别最坏情况,也不容易解释。因此,在不同的基准数据集上对聚类算法进行评估非常困难。为了解决这些问题,我们提出并分析了一种新的度量方法:最优集合匹配准确度的一个版本,它是归一化的、与某种相似性关系相关的单调性、规模不变性,并针对聚类大小的不平衡性进行了修正(但既不对称,也未针对偶然性进行调整)。
{"title":"Normalised Clustering Accuracy: An Asymmetric External Cluster Validity Measure","authors":"Marek Gagolewski","doi":"10.1007/s00357-024-09482-2","DOIUrl":"https://doi.org/10.1007/s00357-024-09482-2","url":null,"abstract":"<p>There is no, nor will there ever be, single best clustering algorithm. Nevertheless, we would still like to be able to distinguish between methods that work well on certain task types and those that systematically underperform. Clustering algorithms are traditionally evaluated using either internal or external validity measures. Internal measures quantify different aspects of the obtained partitions, e.g., the average degree of cluster compactness or point separability. However, their validity is questionable because the clusterings they endorse can sometimes be meaningless. External measures, on the other hand, compare the algorithms’ outputs to fixed ground truth groupings provided by experts. In this paper, we argue that the commonly used classical partition similarity scores, such as the normalised mutual information, Fowlkes–Mallows, or adjusted Rand index, miss some desirable properties. In particular, they do not identify worst-case scenarios correctly, nor are they easily interpretable. As a consequence, the evaluation of clustering algorithms on diverse benchmark datasets can be difficult. To remedy these issues, we propose and analyse a new measure: a version of the optimal set-matching accuracy, which is normalised, monotonic with respect to some similarity relation, scale-invariant, and corrected for the imbalancedness of cluster sizes (but neither symmetric nor adjusted for chance).</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"22 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sensitivity and Specificity versus Precision and Recall, and Related Dilemmas 灵敏度和特异性与精确度和召回率及相关难题
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-06-26 DOI: 10.1007/s00357-024-09478-y
William Cullerne Bown

Many evaluations of binary classifiers begin by adopting a pair of indicators, most often sensitivity and specificity or precision and recall. Despite this, we lack a general, pan-disciplinary basis for choosing one pair over the other, or over one of four other sibling pairs. Related obscurity afflicts the choice between the receiver operating characteristic and the precision-recall curve. Here, I return to first principles to separate concerns and distinguish more than 50 foundational concepts. This allows me to establish six rules that allow one to identify which pair is correct. The choice depends on the context in which the classifier is to operate, the intended use of the classifications, their intended user(s), and the measurability of the underlying classes, but not skew. The rules can be applied by those who develop, operate, or regulate them to classifiers composed of technology, people, or combinations of the two.

许多二元分类器的评估都是从采用一对指标开始的,最常见的是灵敏度和特异性或精确度和召回率。尽管如此,我们仍然缺乏一个通用的、泛学科的基础来选择一对指标或其他四对指标中的一对。在接收者操作特征和精确度-召回曲线之间进行选择时,也存在类似的模糊之处。在此,我回到第一原则,将关注点分开,并区分 50 多个基本概念。这样,我就可以建立六条规则,让人们确定哪一对是正确的。选择取决于分类器的运行环境、分类的预期用途、预期用户以及基础类别的可测量性,但不包括偏斜。这些规则可由开发、运行或管理分类器的人员应用于由技术、人员或二者组合而成的分类器。
{"title":"Sensitivity and Specificity versus Precision and Recall, and Related Dilemmas","authors":"William Cullerne Bown","doi":"10.1007/s00357-024-09478-y","DOIUrl":"https://doi.org/10.1007/s00357-024-09478-y","url":null,"abstract":"<p>Many evaluations of binary classifiers begin by adopting a pair of indicators, most often sensitivity and specificity or precision and recall. Despite this, we lack a general, pan-disciplinary basis for choosing one pair over the other, or over one of four other sibling pairs. Related obscurity afflicts the choice between the receiver operating characteristic and the precision-recall curve. Here, I return to first principles to separate concerns and distinguish more than 50 foundational concepts. This allows me to establish six rules that allow one to identify which pair is correct. The choice depends on the context in which the classifier is to operate, the intended use of the classifications, their intended user(s), and the measurability of the underlying classes, but not skew. The rules can be applied by those who develop, operate, or regulate them to classifiers composed of technology, people, or combinations of the two.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"53 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clustering Longitudinal Data for Growth Curve Modelling by Gibbs Sampler and Information Criterion 利用吉布斯采样器和信息标准对纵向数据进行聚类以建立生长曲线模型
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-06-19 DOI: 10.1007/s00357-024-09477-z
Yu Fei, Rongli Li, Zhouhong Li, Guoqi Qian

Clustering longitudinal data for growth curve modelling is considered in this paper, where we aim to optimally estimate the underpinning unknown group partition matrix. Instead of following the conventional soft clustering approach, which assumes the columns of the partition matrix to have i.i.d. multinomial or categorical prior distributions and uses a regression model with the response following a finite mixture distribution to estimate the posterior distribution of the partition matrix, we propose an iterative partition and regression procedure to find the best partition matrix and the associated best growth curve regression model for each identified cluster. We show that the best partition matrix is the one minimizing a recently developed empirical Bayes information criterion (eBIC), which, due to the involved combinatorial explosion, is difficult to compute via enumerating all candidate partition matrices. Thus, we develop a Gibbs sampling method to generate a Markov chain of candidate partition matrices that has its equilibrium probability distribution equal the one induced from eBIC. We further show that the best partition matrix, given a priori the number of latent clusters, can be consistently estimated and is computationally scalable based on this Markov chain. The number of latent clusters is also best estimated by minimizing eBIC. The proposed iterative clustering and regression method is assessed by a comprehensive simulation study before being applied to two real-world growth curve modelling examples involving longitudinal data clustering.

本文考虑了为生长曲线建模而对纵向数据进行聚类的问题,我们的目标是对未知群体分区矩阵的基础数据进行最佳估算。传统的软聚类方法假定分区矩阵的列具有 i.i.d. 多叉或分类先验分布,并使用响应遵循有限混合分布的回归模型来估计分区矩阵的后验分布,而我们则不采用这种方法,而是提出了一种迭代分区和回归程序,以找到最佳分区矩阵和每个已识别群组的相关最佳生长曲线回归模型。我们证明,最佳分区矩阵是最小化最近开发的经验贝叶斯信息准则(eBIC)的矩阵,由于涉及组合爆炸,很难通过枚举所有候选分区矩阵来计算。因此,我们开发了一种吉布斯抽样方法,生成候选分区矩阵的马尔可夫链,其均衡概率分布等于由 eBIC 诱导的概率分布。我们进一步证明,在给定潜在集群数的先验条件下,可以根据该马尔可夫链持续估计出最佳的分区矩阵,并且在计算上是可扩展的。潜在聚类的数量也可以通过最小化 eBIC 得到最佳估计。在将所提出的迭代聚类和回归方法应用于两个涉及纵向数据聚类的真实世界生长曲线建模实例之前,先通过全面的模拟研究对其进行了评估。
{"title":"Clustering Longitudinal Data for Growth Curve Modelling by Gibbs Sampler and Information Criterion","authors":"Yu Fei, Rongli Li, Zhouhong Li, Guoqi Qian","doi":"10.1007/s00357-024-09477-z","DOIUrl":"https://doi.org/10.1007/s00357-024-09477-z","url":null,"abstract":"<p>Clustering longitudinal data for growth curve modelling is considered in this paper, where we aim to optimally estimate the underpinning unknown group partition matrix. Instead of following the conventional soft clustering approach, which assumes the columns of the partition matrix to have i.i.d. multinomial or categorical prior distributions and uses a regression model with the response following a finite mixture distribution to estimate the posterior distribution of the partition matrix, we propose an iterative partition and regression procedure to find the best partition matrix and the associated best growth curve regression model for each identified cluster. We show that the best partition matrix is the one minimizing a recently developed empirical Bayes information criterion (eBIC), which, due to the involved combinatorial explosion, is difficult to compute via enumerating all candidate partition matrices. Thus, we develop a Gibbs sampling method to generate a Markov chain of candidate partition matrices that has its equilibrium probability distribution equal the one induced from eBIC. We further show that the best partition matrix, given a priori the number of latent clusters, can be consistently estimated and is computationally scalable based on this Markov chain. The number of latent clusters is also best estimated by minimizing eBIC. The proposed iterative clustering and regression method is assessed by a comprehensive simulation study before being applied to two real-world growth curve modelling examples involving longitudinal data clustering.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"199 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Finding Outliers in Gaussian Model-based Clustering 在基于高斯模型的聚类中查找异常值
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-05-30 DOI: 10.1007/s00357-024-09473-3
Katharine M. Clark, Paul D. McNicholas

Clustering, or unsupervised classification, is a task often plagued by outliers. Yet there is a paucity of work on handling outliers in clustering. Outlier identification algorithms tend to fall into three broad categories: outlier inclusion, outlier trimming, and post hoc outlier identification methods, with the former two often requiring pre-specification of the number of outliers. The fact that sample squared Mahalanobis distance is beta-distributed is used to derive an approximate distribution for the log-likelihoods of subset finite Gaussian mixture models. An algorithm is then proposed that removes the least plausible points according to the subset log-likelihoods, which are deemed outliers, until the subset log-likelihoods adhere to the reference distribution. This results in a trimming method, called OCLUST, that inherently estimates the number of outliers.

聚类或无监督分类是一项经常受到异常值困扰的任务。然而,在聚类中处理异常值的工作却很少。离群值识别算法往往分为三大类:离群值包含法、离群值修剪法和事后离群值识别法,其中前两者往往需要预先指定离群值的数量。利用样本平方 Mahalanobis 距离是贝塔分布这一事实,可以推导出子集有限高斯混合模型的对数似然值的近似分布。然后提出一种算法,根据子集对数似然删除最不可信的点,这些点被视为离群值,直到子集对数似然符合参考分布。这就产生了一种称为 OCLUST 的修剪方法,它能从本质上估算出异常值的数量。
{"title":"Finding Outliers in Gaussian Model-based Clustering","authors":"Katharine M. Clark, Paul D. McNicholas","doi":"10.1007/s00357-024-09473-3","DOIUrl":"https://doi.org/10.1007/s00357-024-09473-3","url":null,"abstract":"<p>Clustering, or unsupervised classification, is a task often plagued by outliers. Yet there is a paucity of work on handling outliers in clustering. Outlier identification algorithms tend to fall into three broad categories: outlier inclusion, outlier trimming, and <i>post hoc</i> outlier identification methods, with the former two often requiring pre-specification of the number of outliers. The fact that sample squared Mahalanobis distance is beta-distributed is used to derive an approximate distribution for the log-likelihoods of subset finite Gaussian mixture models. An algorithm is then proposed that removes the least plausible points according to the subset log-likelihoods, which are deemed outliers, until the subset log-likelihoods adhere to the reference distribution. This results in a trimming method, called OCLUST, that inherently estimates the number of outliers.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"13 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141188966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Novel Classification Algorithm Based on the Synergy Between Dynamic Clustering with Adaptive Distances and K-Nearest Neighbors 基于自适应距离动态聚类与 K 近邻协同作用的新型分类算法
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-05-11 DOI: 10.1007/s00357-024-09471-5
Mohammed Sabri, Rosanna Verde, Antonio Balzanella, Fabrizio Maturo, Hamid Tairi, Ali Yahyaouy, Jamal Riffi

This paper introduces a novel supervised classification method based on dynamic clustering (DC) and K-nearest neighbor (KNN) learning algorithms, denoted DC-KNN. The aim is to improve the accuracy of a classifier by using a DC method to discover the hidden patterns of the apriori groups of the training set. It provides a partitioning of each group into a predetermined number of subgroups. A new objective function is designed for the DC variant, based on a trade-off between the compactness and separation of all subgroups in the original groups. Moreover, the proposed DC method uses adaptive distances which assign a set of weights to the variables of each cluster, which depend on both their intra-cluster and inter-cluster structure. DC-KNN performs the minimization of a suitable objective function. Next, the KNN algorithm takes into account objects by assigning them to the label of subgroups. Furthermore, the classification step is performed according to two KNN competing algorithms. The proposed strategies have been evaluated using both synthetic data and widely used real datasets from public repositories. The achieved results have confirmed the effectiveness and robustness of the strategy in improving classification accuracy in comparison to alternative approaches.

本文介绍了一种基于动态聚类(DC)和 K 近邻(KNN)学习算法的新型监督分类方法,简称 DC-KNN。其目的是通过使用 DC 方法来发现训练集中先验组的隐藏模式,从而提高分类器的准确性。它将每个组划分为预定数量的子组。根据原始组中所有子组的紧凑性和分离性之间的权衡,为 DC 变体设计了一个新的目标函数。此外,所提出的 DC 方法使用自适应距离,为每个聚类的变量分配一系列权重,这些权重取决于聚类内部和聚类之间的结构。DC-KNN 实现了合适目标函数的最小化。接下来,KNN 算法会将对象分配到子群标签中。此外,分类步骤根据两种 KNN 竞争算法进行。我们使用合成数据和公共资料库中广泛使用的真实数据集对所提出的策略进行了评估。与其他方法相比,所取得的结果证实了该策略在提高分类准确性方面的有效性和稳健性。
{"title":"A Novel Classification Algorithm Based on the Synergy Between Dynamic Clustering with Adaptive Distances and K-Nearest Neighbors","authors":"Mohammed Sabri, Rosanna Verde, Antonio Balzanella, Fabrizio Maturo, Hamid Tairi, Ali Yahyaouy, Jamal Riffi","doi":"10.1007/s00357-024-09471-5","DOIUrl":"https://doi.org/10.1007/s00357-024-09471-5","url":null,"abstract":"<p>This paper introduces a novel supervised classification method based on dynamic clustering (DC) and K-nearest neighbor (KNN) learning algorithms, denoted DC-KNN. The aim is to improve the accuracy of a classifier by using a DC method to discover the hidden patterns of the apriori groups of the training set. It provides a partitioning of each group into a predetermined number of subgroups. A new objective function is designed for the DC variant, based on a trade-off between the compactness and separation of all subgroups in the original groups. Moreover, the proposed DC method uses adaptive distances which assign a set of weights to the variables of each cluster, which depend on both their intra-cluster and inter-cluster structure. DC-KNN performs the minimization of a suitable objective function. Next, the KNN algorithm takes into account objects by assigning them to the label of subgroups. Furthermore, the classification step is performed according to two KNN competing algorithms. The proposed strategies have been evaluated using both synthetic data and widely used real datasets from public repositories. The achieved results have confirmed the effectiveness and robustness of the strategy in improving classification accuracy in comparison to alternative approaches.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"94 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140938766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerated Sequential Data Clustering 加速序列数据聚类
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-05-09 DOI: 10.1007/s00357-024-09472-4
Reza Mortazavi, Elham Enayati, Abdolali Basiri

Data clustering is an important task in the field of data mining. In many real applications, clustering algorithms must consider the order of data, resulting in the problem of clustering sequential data. For instance, analyzing the moving pattern of an object and detecting community structure in a complex network are related to sequential data clustering. The constraint of the continuous region prevents previous clustering algorithms from being directly applied to the problem. A dynamic programming algorithm was proposed to address the issue, which returns the optimal sequential data clustering. However, it is not scalable and hence the practicality is limited. This paper revisits the solution and enhances it by introducing a greedy stopping condition. This condition halts the algorithm’s search process when it is likely that the optimal solution has been found. Experimental results on multiple datasets show that the algorithm is much faster than its original solution while the optimality gap is negligible.

数据聚类是数据挖掘领域的一项重要任务。在许多实际应用中,聚类算法必须考虑数据的顺序,从而产生了顺序数据聚类问题。例如,分析物体的移动模式和检测复杂网络中的群落结构都与顺序数据聚类有关。由于连续区域的限制,以往的聚类算法无法直接应用于该问题。为了解决这个问题,有人提出了一种动态编程算法,它能返回最优的顺序数据聚类。但是,该算法不具备可扩展性,因此实用性有限。本文重新审视了这一解决方案,并通过引入贪婪停止条件对其进行了改进。当可能已经找到最优解时,该条件会停止算法的搜索过程。在多个数据集上的实验结果表明,该算法比其原始解决方案要快得多,而优化差距却可以忽略不计。
{"title":"Accelerated Sequential Data Clustering","authors":"Reza Mortazavi, Elham Enayati, Abdolali Basiri","doi":"10.1007/s00357-024-09472-4","DOIUrl":"https://doi.org/10.1007/s00357-024-09472-4","url":null,"abstract":"<p>Data clustering is an important task in the field of data mining. In many real applications, clustering algorithms must consider the order of data, resulting in the problem of clustering sequential data. For instance, analyzing the moving pattern of an object and detecting community structure in a complex network are related to sequential data clustering. The constraint of the continuous region prevents previous clustering algorithms from being directly applied to the problem. A dynamic programming algorithm was proposed to address the issue, which returns the optimal sequential data clustering. However, it is not scalable and hence the practicality is limited. This paper revisits the solution and enhances it by introducing a greedy stopping condition. This condition halts the algorithm’s search process when it is likely that the optimal solution has been found. Experimental results on multiple datasets show that the algorithm is much faster than its original solution while the optimality gap is negligible.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"25 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140938688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Skew Multiple Scaled Mixtures of Normal Distributions with Flexible Tail Behavior and Their Application to Clustering 具有灵活尾部行为的偏斜多重标度正态分布混合物及其在聚类中的应用
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-05-06 DOI: 10.1007/s00357-024-09470-6
Abbas Mahdavi, Anthony F. Desmond, Ahad Jamalizadeh, Tsung-I Lin

The family of multiple scaled mixtures of multivariate normal (MSMN) distributions has been shown to be a powerful tool for modeling data that allow different marginal amounts of tail weight. An extension of the MSMN distribution is proposed through the incorporation of a vector of shape parameters, resulting in the skew multiple scaled mixtures of multivariate normal (SMSMN) distributions. The family of SMSMN distributions can express a variety of shapes by controlling different degrees of tailedness and versatile skewness in each dimension. Some characterizations and probabilistic properties of the SMSMN distributions are studied and an extension to finite mixtures thereof is also discussed. Based on a sort of selection mechanism, a feasible ECME algorithm is designed to compute the maximum likelihood estimates of model parameters. Numerical experiments on simulated data and three real data examples demonstrate the efficacy and usefulness of the proposed methodology.

多元正态分布的多元缩放混合物(MSMN)系列已被证明是一种强大的工具,可用于对允许不同边际尾重的数据建模。本文提出了 MSMN 分布的扩展方案,即加入形状参数向量,形成偏斜多元标度混合多元正态分布 (SMSMN)。SMSMN 分布系列可以通过控制每个维度的不同尾度和多变偏度来表达各种形状。研究了 SMSMN 分布的一些特征和概率性质,并讨论了其向有限混合物的扩展。基于一种选择机制,设计了一种可行的 ECME 算法来计算模型参数的最大似然估计值。模拟数据和三个真实数据实例的数值实验证明了所提方法的有效性和实用性。
{"title":"Skew Multiple Scaled Mixtures of Normal Distributions with Flexible Tail Behavior and Their Application to Clustering","authors":"Abbas Mahdavi, Anthony F. Desmond, Ahad Jamalizadeh, Tsung-I Lin","doi":"10.1007/s00357-024-09470-6","DOIUrl":"https://doi.org/10.1007/s00357-024-09470-6","url":null,"abstract":"<p>The family of multiple scaled mixtures of multivariate normal (MSMN) distributions has been shown to be a powerful tool for modeling data that allow different marginal amounts of tail weight. An extension of the MSMN distribution is proposed through the incorporation of a vector of shape parameters, resulting in the skew multiple scaled mixtures of multivariate normal (SMSMN) distributions. The family of SMSMN distributions can express a variety of shapes by controlling different degrees of tailedness and versatile skewness in each dimension. Some characterizations and probabilistic properties of the SMSMN distributions are studied and an extension to finite mixtures thereof is also discussed. Based on a sort of selection mechanism, a feasible ECME algorithm is designed to compute the maximum likelihood estimates of model parameters. Numerical experiments on simulated data and three real data examples demonstrate the efficacy and usefulness of the proposed methodology.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"111 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multinomial Restricted Unfolding 多项式受限展开
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-04-08 DOI: 10.1007/s00357-024-09465-3
Mark de Rooij, Frank Busing

For supervised classification we propose to use restricted multidimensional unfolding in a multinomial logistic framework. Where previous research proposed similar models based on squared distances, we propose to use usual (i.e., not squared) Euclidean distances. This change in functional form results in several interpretational advantages of the resulting biplot, a graphical representation of the classification model. First, the conditional probability of any class peaks at the location of the class in the Euclidean space. Second, the interpretation of the biplot is in terms of distances towards the class points, whereas in the squared distance model the interpretation is in terms of the distance towards the decision boundary. Third, the distance between two class points represents an upper bound for the estimated log-odds of choosing one of these classes over the other. For our multinomial restricted unfolding, we develop and test a Majorization Minimization algorithm that monotonically decreases the negative log-likelihood. With two empirical applications we point out the advantages of the distance model and show how to apply multinomial restricted unfolding in practice, including model selection.

对于监督分类,我们建议在多叉逻辑框架内使用受限多维展开。之前的研究提出了基于平方距离的类似模型,而我们建议使用通常(即非平方)的欧氏距离。函数形式的这一变化为分类模型的图形表示法 biplot 带来了一些解释上的优势。首先,任何类别的条件概率都会在该类别在欧氏空间中的位置达到峰值。其次,双曲线图的解释是与类点的距离,而平方距离模型的解释是与决策边界的距离。第三,两个类别点之间的距离代表了选择其中一个类别而非另一个类别的估计对数概率的上限。对于我们的多项式受限展开,我们开发并测试了一种单调降低负对数概率的 "大数最小化"(Majorization Minimization)算法。通过两个经验应用,我们指出了距离模型的优势,并展示了如何在实践中应用多项式受限展开,包括模型选择。
{"title":"Multinomial Restricted Unfolding","authors":"Mark de Rooij, Frank Busing","doi":"10.1007/s00357-024-09465-3","DOIUrl":"https://doi.org/10.1007/s00357-024-09465-3","url":null,"abstract":"<p>For supervised classification we propose to use restricted multidimensional unfolding in a multinomial logistic framework. Where previous research proposed similar models based on squared distances, we propose to use usual (i.e., not squared) Euclidean distances. This change in functional form results in several interpretational advantages of the resulting biplot, a graphical representation of the classification model. First, the conditional probability of any class peaks at the location of the class in the Euclidean space. Second, the interpretation of the biplot is in terms of distances towards the class points, whereas in the squared distance model the interpretation is in terms of the distance towards the decision boundary. Third, the distance between two class points represents an upper bound for the estimated log-odds of choosing one of these classes over the other. For our multinomial restricted unfolding, we develop and test a Majorization Minimization algorithm that monotonically decreases the negative log-likelihood. With two empirical applications we point out the advantages of the distance model and show how to apply multinomial restricted unfolding in practice, including model selection.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"40 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140589402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Classification
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1