首页 > 最新文献

Journal of Classification最新文献

英文 中文
How to Measure the Researcher Impact with the Aid of its Impactable Area: A Concrete Approach Using Distance Geometry 如何借助可影响区域测量研究人员的影响?利用距离几何学的具体方法
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-08-26 DOI: 10.1007/s00357-024-09490-2
Beniamino Cappelletti-Montano, Gianmarco Cherchi, Benedetto Manca, Stefano Montaldo, Monica Musio

Assuming that the subject of each scientific publication can be identified by one or more classification entities, we address the problem of determining a similarity function (distance) between classification entities based on how often two classification entities are used in the same publication. This similarity function is then used to obtain a representation of the classification entities as points of an Euclidean space of a suitable dimension by means of optimization and dimensionality reduction algorithms. This procedure allows us also to represent the researchers as points in the same Euclidean space and to determine the distance between researchers according to their scientific production. As a case study, we consider as classification entities the codes of the American Mathematical Society Classification System.

假定每篇科学出版物的主题都可以通过一个或多个分类实体来识别,我们要解决的问题是根据两个分类实体在同一出版物中的使用频率来确定分类实体之间的相似性函数(距离)。然后,通过优化和降维算法,利用该相似度函数将分类实体表示为一个合适维度的欧几里得空间中的点。通过这一过程,我们还可以将研究人员表示为同一欧几里得空间中的点,并根据其科研成果确定研究人员之间的距离。作为案例研究,我们将美国数学学会分类系统的代码视为分类实体。
{"title":"How to Measure the Researcher Impact with the Aid of its Impactable Area: A Concrete Approach Using Distance Geometry","authors":"Beniamino Cappelletti-Montano, Gianmarco Cherchi, Benedetto Manca, Stefano Montaldo, Monica Musio","doi":"10.1007/s00357-024-09490-2","DOIUrl":"https://doi.org/10.1007/s00357-024-09490-2","url":null,"abstract":"<p>Assuming that the subject of each scientific publication can be identified by one or more classification entities, we address the problem of determining a similarity function (distance) between classification entities based on how often two classification entities are used in the same publication. This similarity function is then used to obtain a representation of the classification entities as points of an Euclidean space of a suitable dimension by means of optimization and dimensionality reduction algorithms. This procedure allows us also to represent the researchers as points in the same Euclidean space and to determine the distance between researchers according to their scientific production. As a case study, we consider as classification entities the codes of the American Mathematical Society Classification System.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"56 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-task Support Vector Machine Classifier with Generalized Huber Loss 具有广义休伯损失的多任务支持向量机分类器
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-08-23 DOI: 10.1007/s00357-024-09488-w
Qi Liu, Wenxin Zhu, Zhengming Dai, Zhihong Ma

Compared to single-task learning (STL), multi-task learning (MTL) achieves a better generalization by exploiting domain-specific information implicit in the training signals of several related tasks. The adaptation of MTL to support vector machines (SVMs) is a rather successful example. Inspired by the recently published generalized Huber loss SVM (GHSVM) and regularized multi-task learning (RMTL), we propose a novel generalized Huber loss multi-task support vector machine including linear and non-linear cases for binary classification, named as MTL-GHSVM. The new method extends the GHSVM from single-task to multi-task learning, and the application of Huber loss to MTL-SVM is innovative to the best of our knowledge. The proposed method has two main advantages: on the one hand, compared with SVMs with hinge loss and GHSVM, our MTL-GHSVM using the differentiable generalized Huber loss has better generalization performance; on the other hand, it adopts functional iteration to find the optimal solution, and does not need to solve a quadratic programming problem (QPP), which can significantly reduce the computational cost. Numerical experiments have been conducted on fifteen real datasets, and the results demonstrate the effectiveness of the proposed multi-task classification algorithm compared with the state-of-the-art algorithms.

与单任务学习(STL)相比,多任务学习(MTL)通过利用多个相关任务的训练信号中隐含的特定领域信息,实现了更好的泛化效果。将 MTL 应用于支持向量机(SVM)就是一个相当成功的例子。受最近发布的广义胡伯损失 SVM(GHSVM)和正则化多任务学习(RMTL)的启发,我们提出了一种新的广义胡伯损失多任务支持向量机,包括线性和非线性二元分类情况,命名为 MTL-GHSVM。新方法将 GHSVM 从单任务学习扩展到了多任务学习,而且据我们所知,将 Huber 损失应用于 MTL-SVM 是一项创新。所提出的方法有两大优势:一方面,与带铰链损失的 SVM 和 GHSVM 相比,我们使用可微分广义 Huber 损失的 MTL-GHSVM 具有更好的泛化性能;另一方面,它采用函数迭代寻找最优解,不需要求解二次编程问题(QPP),可以显著降低计算成本。我们在 15 个真实数据集上进行了数值实验,结果表明,与最先进的算法相比,所提出的多任务分类算法非常有效。
{"title":"Multi-task Support Vector Machine Classifier with Generalized Huber Loss","authors":"Qi Liu, Wenxin Zhu, Zhengming Dai, Zhihong Ma","doi":"10.1007/s00357-024-09488-w","DOIUrl":"https://doi.org/10.1007/s00357-024-09488-w","url":null,"abstract":"<p>Compared to single-task learning (STL), multi-task learning (MTL) achieves a better generalization by exploiting domain-specific information implicit in the training signals of several related tasks. The adaptation of MTL to support vector machines (SVMs) is a rather successful example. Inspired by the recently published generalized Huber loss SVM (GHSVM) and regularized multi-task learning (RMTL), we propose a novel generalized Huber loss multi-task support vector machine including linear and non-linear cases for binary classification, named as MTL-GHSVM. The new method extends the GHSVM from single-task to multi-task learning, and the application of Huber loss to MTL-SVM is innovative to the best of our knowledge. The proposed method has two main advantages: on the one hand, compared with SVMs with hinge loss and GHSVM, our MTL-GHSVM using the differentiable generalized Huber loss has better generalization performance; on the other hand, it adopts functional iteration to find the optimal solution, and does not need to solve a quadratic programming problem (QPP), which can significantly reduce the computational cost. Numerical experiments have been conducted on fifteen real datasets, and the results demonstrate the effectiveness of the proposed multi-task classification algorithm compared with the state-of-the-art algorithms.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"166 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clustering-Based Oversampling Algorithm for Multi-class Imbalance Learning 基于聚类的多类失衡学习过采样算法
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-08-22 DOI: 10.1007/s00357-024-09491-1
Haixia Zhao, Jian Wu

Multi-class imbalanced data learning faces many challenges. Its complex structural characteristics cause severe intra-class imbalance or overgeneralization in most solution strategies. This negatively affects data learning. This paper proposes a clustering-based oversampling algorithm (COM) to handle multi-class imbalance learning. In order to avoid the loss of important information, COM clusters the minority class based on the structural characteristics of the instances, among which rare instances and outliers are carefully portrayed through assigning a sampling weight to each of the clusters. Clusters with high densities are given low weights, and then, oversampling is performed within clusters to avoid overgeneralization. COM avoids intra-class imbalance effectively because low-density clusters are more likely than high-density ones to be selected to synthesize instances. Our study used the UCI and KEEL imbalanced datasets to demonstrate the effectiveness and stability of the proposed method.

多类不平衡数据学习面临许多挑战。其复杂的结构特征会导致大多数解决策略出现严重的类内不平衡或过度泛化。这对数据学习产生了负面影响。本文提出了一种基于聚类的超采样算法(COM)来处理多类不平衡学习。为了避免重要信息的丢失,COM 根据实例的结构特征对少数类进行聚类,并通过为每个聚类分配采样权重来仔细刻画其中的罕见实例和异常值。密度高的聚类被赋予较低的权重,然后在聚类内部进行超采样,以避免过度泛化。COM 能有效避免类内不平衡,因为低密度聚类比高密度聚类更有可能被选中合成实例。我们的研究使用 UCI 和 KEEL 不平衡数据集来证明所提方法的有效性和稳定性。
{"title":"Clustering-Based Oversampling Algorithm for Multi-class Imbalance Learning","authors":"Haixia Zhao, Jian Wu","doi":"10.1007/s00357-024-09491-1","DOIUrl":"https://doi.org/10.1007/s00357-024-09491-1","url":null,"abstract":"<p>Multi-class imbalanced data learning faces many challenges. Its complex structural characteristics cause severe intra-class imbalance or overgeneralization in most solution strategies. This negatively affects data learning. This paper proposes a clustering-based oversampling algorithm (COM) to handle multi-class imbalance learning. In order to avoid the loss of important information, COM clusters the minority class based on the structural characteristics of the instances, among which rare instances and outliers are carefully portrayed through assigning a sampling weight to each of the clusters. Clusters with high densities are given low weights, and then, oversampling is performed within clusters to avoid overgeneralization. COM avoids intra-class imbalance effectively because low-density clusters are more likely than high-density ones to be selected to synthesize instances. Our study used the UCI and KEEL imbalanced datasets to demonstrate the effectiveness and stability of the proposed method.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"17 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Combining Semi-supervised Clustering and Classification Under a Generalized Framework 通用框架下的半监督聚类与分类相结合
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-08-13 DOI: 10.1007/s00357-024-09489-9
Zhen Jiang, Lingyun Zhao, Yu Lu

Most machine learning algorithms rely on having a sufficient amount of labeled data to train a reliable classifier. However, labeling data is often costly and time-consuming, while unlabeled data can be readily accessible. Therefore, learning from both labeled and unlabeled data has become a hot topic of interest. Inspired by the co-training algorithm, we present a learning framework called CSCC, which combines semi-supervised clustering and classification to learn from both labeled and unlabeled data. Unlike existing co-training style methods that construct diverse classifiers to learn from each other, CSCC leverages the diversity between semi-supervised clustering and classification models to achieve mutual enhancement. Existing classification algorithms can be easily adapted to CSCC, allowing them to generalize from a few labeled data. Especially, in order to bridge the gap between class information and clustering, we propose a semi-supervised hierarchical clustering algorithm that utilizes labeled data to guide the process of cluster-splitting. Within the CSCC framework, we introduce two loss functions to supervise the iterative updating of the semi-supervised clustering and classification models, respectively. Extensive experiments conducted on a variety of benchmark datasets validate the superiority of CSCC over other state-of-the-art methods.

大多数机器学习算法都依赖于足够数量的标记数据来训练可靠的分类器。然而,标注数据通常既费钱又费时,而未标注数据却很容易获得。因此,从有标签和无标签数据中学习已成为人们关注的热点话题。受联合训练算法的启发,我们提出了一种名为 CSCC 的学习框架,它结合了半监督聚类和分类,可同时从有标签和无标签数据中学习。与现有的联合训练式方法构建不同的分类器来相互学习不同,CSCC 利用半监督聚类和分类模型之间的多样性来实现相互增强。现有的分类算法可以很容易地适应 CSCC,使其能够从少数标记数据中进行泛化。特别是,为了缩小类信息与聚类之间的差距,我们提出了一种半监督分层聚类算法,利用标记数据来指导分簇过程。在 CSCC 框架内,我们引入了两个损失函数,分别用于监督半监督聚类和分类模型的迭代更新。在各种基准数据集上进行的广泛实验验证了 CSCC 优于其他最先进的方法。
{"title":"Combining Semi-supervised Clustering and Classification Under a Generalized Framework","authors":"Zhen Jiang, Lingyun Zhao, Yu Lu","doi":"10.1007/s00357-024-09489-9","DOIUrl":"https://doi.org/10.1007/s00357-024-09489-9","url":null,"abstract":"<p>Most machine learning algorithms rely on having a sufficient amount of labeled data to train a reliable classifier. However, labeling data is often costly and time-consuming, while unlabeled data can be readily accessible. Therefore, learning from both labeled and unlabeled data has become a hot topic of interest. Inspired by the co-training algorithm, we present a learning framework called CSCC, which combines semi-supervised clustering and classification to learn from both labeled and unlabeled data. Unlike existing co-training style methods that construct diverse classifiers to learn from each other, CSCC leverages the diversity between semi-supervised clustering and classification models to achieve mutual enhancement. Existing classification algorithms can be easily adapted to CSCC, allowing them to generalize from a few labeled data. Especially, in order to bridge the gap between class information and clustering, we propose a semi-supervised hierarchical clustering algorithm that utilizes labeled data to guide the process of cluster-splitting. Within the CSCC framework, we introduce two loss functions to supervise the iterative updating of the semi-supervised clustering and classification models, respectively. Extensive experiments conducted on a variety of benchmark datasets validate the superiority of CSCC over other state-of-the-art methods.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"13 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Slope Stability Classification Model Based on Single-Valued Neutrosophic Matrix Energy and Its Application Under a Single-Valued Neutrosophic Matrix Scenario 基于单值中性矩阵能量的边坡稳定性分类模型及其在单值中性矩阵情景下的应用
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-23 DOI: 10.1007/s00357-024-09487-x
Jun Ye, Kaiqian Du, Shigui Du, Rui Yong

Since matrix energy (ME) implies the expressive merit of collective information, a classification method based on ME has not been investigated in the existing literature, which reflects its research gap in a matrix scenario. Therefore, the purpose of this paper is to propose a slope stability classification model based on the single-valued neutrosophic matrix (SVNM) energy to solve the current research gap in slope stability classification analysis with uncertain and inconsistent information. In this study, we first present SVNM and define the SVNM energy based on true, uncertain, and false MEs. Then, using a neutrosophication technique based on true, false, and uncertain Gaussian membership functions, the multiple sampling data of the stability affecting factors for each slope are transformed into SVNM. Next, a slope stability classification model based on the SVNM energy and score function is developed to solve the slope stability classification analysis under the full SVNM scenario of both the affecting factor weights and the affecting factors of slope stability. Finally, the developed classification model is applied to the classification analysis of 50 slope samples collected from different areas of Zhejiang province in China as a case study to verify its rationality and accuracy under the SVNM scenario. The accuracy of the classification results for the 50 slope samples is 100%.

由于矩阵能(ME)意味着集合信息的表达能力,现有文献尚未研究基于矩阵能的分类方法,这反映了其在矩阵场景下的研究空白。因此,本文旨在提出一种基于单值中性矩阵(SVNM)能量的边坡稳定性分类模型,以解决目前边坡稳定性分类分析中信息不确定、不一致的研究空白。在本研究中,我们首先介绍了 SVNM,并定义了基于真、不确定和假 ME 的 SVNM 能量。然后,利用基于真、假和不确定高斯成员函数的中性化技术,将各边坡稳定性影响因素的多重采样数据转化为 SVNM。然后,建立基于 SVNM 能量和得分函数的边坡稳定性分类模型,以解决影响因素权重和边坡稳定性影响因素的全 SVNM 情景下的边坡稳定性分类分析问题。最后,以从中国浙江省不同地区采集的 50 个边坡样本为例,应用所建立的分类模型进行分类分析,以验证其在 SVNM 情景下的合理性和准确性。50 个边坡样本的分类结果准确率为 100%。
{"title":"Slope Stability Classification Model Based on Single-Valued Neutrosophic Matrix Energy and Its Application Under a Single-Valued Neutrosophic Matrix Scenario","authors":"Jun Ye, Kaiqian Du, Shigui Du, Rui Yong","doi":"10.1007/s00357-024-09487-x","DOIUrl":"https://doi.org/10.1007/s00357-024-09487-x","url":null,"abstract":"<p>Since matrix energy (ME) implies the expressive merit of collective information, a classification method based on ME has not been investigated in the existing literature, which reflects its research gap in a matrix scenario. Therefore, the purpose of this paper is to propose a slope stability classification model based on the single-valued neutrosophic matrix (SVNM) energy to solve the current research gap in slope stability classification analysis with uncertain and inconsistent information. In this study, we first present SVNM and define the SVNM energy based on true, uncertain, and false MEs. Then, using a neutrosophication technique based on true, false, and uncertain Gaussian membership functions, the multiple sampling data of the stability affecting factors for each slope are transformed into SVNM. Next, a slope stability classification model based on the SVNM energy and score function is developed to solve the slope stability classification analysis under the full SVNM scenario of both the affecting factor weights and the affecting factors of slope stability. Finally, the developed classification model is applied to the classification analysis of 50 slope samples collected from different areas of Zhejiang province in China as a case study to verify its rationality and accuracy under the SVNM scenario. The accuracy of the classification results for the 50 slope samples is 100%.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"29 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Effective Crow Search Algorithm and Its Application in Data Clustering 一种有效的乌鸦搜索算法及其在数据聚类中的应用
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-23 DOI: 10.1007/s00357-024-09486-y
Rajesh Ranjan, Jitender Kumar Chhabra

In today’s data-centric world, the significance of generated data has increased manifold. Clustering the data into a similar group is one of the dynamic research areas among other data practices. Several algorithms’ proposals exist for clustering. Apart from the traditional algorithms, researchers worldwide have successfully employed some metaheuristic approaches for clustering. The crow search algorithm (CSA) is a recently introduced swarm-based algorithm that imitates the performance of the crow. An effective crow search algorithm (ECSA) has been proposed in the present work, which dynamically attunes its parameter to sustain the search balance and perform an oppositional-based random initialization. The ECSA is evaluated over CEC2019 Benchmark Functions and simulated for data clustering tasks compared with well-known metaheuristic approaches and famous partition-based K-means algorithm over benchmark datasets. The results reveal that the ECSA performs better than other algorithms in the context of external cluster quality metrics and convergence rate.

在当今以数据为中心的世界,生成数据的重要性成倍增加。在其他数据实践中,将数据聚类为相似组是一个充满活力的研究领域。目前有多种聚类算法建议。除传统算法外,世界各地的研究人员还成功采用了一些元启发式方法进行聚类。乌鸦搜索算法(CSA)是最近推出的一种基于蜂群的算法,它模仿乌鸦的表现。本研究提出了一种有效的乌鸦搜索算法(ECSA),它能动态调整参数以保持搜索平衡,并执行基于对立的随机初始化。在 CEC2019 基准函数上对 ECSA 进行了评估,并在数据聚类任务中与知名的元启发式方法和著名的基于分区的 K-means 算法在基准数据集上进行了模拟比较。结果表明,在外部聚类质量指标和收敛速度方面,ECSA 的表现优于其他算法。
{"title":"An Effective Crow Search Algorithm and Its Application in Data Clustering","authors":"Rajesh Ranjan, Jitender Kumar Chhabra","doi":"10.1007/s00357-024-09486-y","DOIUrl":"https://doi.org/10.1007/s00357-024-09486-y","url":null,"abstract":"<p>In today’s data-centric world, the significance of generated data has increased manifold. Clustering the data into a similar group is one of the dynamic research areas among other data practices. Several algorithms’ proposals exist for clustering. Apart from the traditional algorithms, researchers worldwide have successfully employed some metaheuristic approaches for clustering. The crow search algorithm (CSA) is a recently introduced swarm-based algorithm that imitates the performance of the crow. An effective crow search algorithm (ECSA) has been proposed in the present work, which dynamically attunes its parameter to sustain the search balance and perform an oppositional-based random initialization. The ECSA is evaluated over CEC2019 Benchmark Functions and simulated for data clustering tasks compared with well-known metaheuristic approaches and famous partition-based K-means algorithm over benchmark datasets. The results reveal that the ECSA performs better than other algorithms in the context of external cluster quality metrics and convergence rate.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"95 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Flexible Clustering with a Sparse Mixture of Generalized Hyperbolic Distributions 使用广义双曲分布的稀疏混合物进行灵活聚类
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-12 DOI: 10.1007/s00357-024-09479-x
Alexa A. Sochaniwsky, Michael P. B. Gallaugher, Yang Tang, Paul D. McNicholas

Robust clustering of high-dimensional data is an important topic because clusters in real datasets are often heavy-tailed and/or asymmetric. Traditional approaches to model-based clustering often fail for high dimensional data, e.g., due to the number of free covariance parameters. A parametrization of the component scale matrices for the mixture of generalized hyperbolic distributions is proposed. This parameterization includes a penalty term in the likelihood. An analytically feasible expectation-maximization algorithm is developed by placing a gamma-lasso penalty constraining the concentration matrix. The proposed methodology is investigated through simulation studies and illustrated using two real datasets.

高维数据的稳健聚类是一个重要课题,因为真实数据集中的聚类通常是重尾和/或不对称的。基于模型的传统聚类方法往往无法处理高维数据,例如,由于自由协方差参数的数量。本文提出了广义双曲分布混合物的分量标度矩阵参数化。该参数化包括似然中的惩罚项。通过对浓度矩阵施加伽马-拉索(gamma-lasso)惩罚约束,开发了一种分析上可行的期望最大化算法。通过模拟研究对所提出的方法进行了调查,并使用两个真实数据集进行了说明。
{"title":"Flexible Clustering with a Sparse Mixture of Generalized Hyperbolic Distributions","authors":"Alexa A. Sochaniwsky, Michael P. B. Gallaugher, Yang Tang, Paul D. McNicholas","doi":"10.1007/s00357-024-09479-x","DOIUrl":"https://doi.org/10.1007/s00357-024-09479-x","url":null,"abstract":"<p>Robust clustering of high-dimensional data is an important topic because clusters in real datasets are often heavy-tailed and/or asymmetric. Traditional approaches to model-based clustering often fail for high dimensional data, e.g., due to the number of free covariance parameters. A parametrization of the component scale matrices for the mixture of generalized hyperbolic distributions is proposed. This parameterization includes a penalty term in the likelihood. An analytically feasible expectation-maximization algorithm is developed by placing a gamma-lasso penalty constraining the concentration matrix. The proposed methodology is investigated through simulation studies and illustrated using two real datasets.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"33 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141614134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clustering with Minimum Spanning Trees: How Good Can It Be? 用最小生成树进行聚类:它能有多好?
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-06 DOI: 10.1007/s00357-024-09483-1
Marek Gagolewski, Anna Cena, Maciej Bartoszuk, Łukasz Brzozowski

Minimum spanning trees (MSTs) provide a convenient representation of datasets in numerous pattern recognition activities. Moreover, they are relatively fast to compute. In this paper, we quantify the extent to which they are meaningful in low-dimensional partitional data clustering tasks. By identifying the upper bounds for the agreement between the best (oracle) algorithm and the expert labels from a large battery of benchmark data, we discover that MST methods can be very competitive. Next, we review, study, extend, and generalise a few existing, state-of-the-art MST-based partitioning schemes. This leads to some new noteworthy approaches. Overall, the Genie and the information-theoretic methods often outperform the non-MST algorithms such as K-means, Gaussian mixtures, spectral clustering, Birch, density-based, and classical hierarchical agglomerative procedures. Nevertheless, we identify that there is still some room for improvement, and thus the development of novel algorithms is encouraged.

在许多模式识别活动中,最小生成树(MST)都能方便地表示数据集。此外,它们的计算速度相对较快。在本文中,我们量化了它们在低维分区数据聚类任务中的意义程度。通过确定最佳(oracle)算法与来自大量基准数据的专家标签之间的一致性上限,我们发现 MST 方法具有很强的竞争力。接下来,我们回顾、研究、扩展并推广了几种现有的、最先进的基于 MST 的分区方案。这就产生了一些新的值得注意的方法。总体而言,Genie 和信息论方法往往优于非 MST 算法,如 K-means、高斯混合物、频谱聚类、Birch、基于密度和经典分层聚类程序。尽管如此,我们认为仍有改进的余地,因此鼓励开发新的算法。
{"title":"Clustering with Minimum Spanning Trees: How Good Can It Be?","authors":"Marek Gagolewski, Anna Cena, Maciej Bartoszuk, Łukasz Brzozowski","doi":"10.1007/s00357-024-09483-1","DOIUrl":"https://doi.org/10.1007/s00357-024-09483-1","url":null,"abstract":"<p>Minimum spanning trees (MSTs) provide a convenient representation of datasets in numerous pattern recognition activities. Moreover, they are relatively fast to compute. In this paper, we quantify the extent to which they are meaningful in low-dimensional partitional data clustering tasks. By identifying the upper bounds for the agreement between the best (oracle) algorithm and the expert labels from a large battery of benchmark data, we discover that MST methods can be very competitive. Next, we review, study, extend, and generalise a few existing, state-of-the-art MST-based partitioning schemes. This leads to some new noteworthy approaches. Overall, the Genie and the information-theoretic methods often outperform the non-MST algorithms such as K-means, Gaussian mixtures, spectral clustering, Birch, density-based, and classical hierarchical agglomerative procedures. Nevertheless, we identify that there is still some room for improvement, and thus the development of novel algorithms is encouraged.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"4 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141577897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A New Matrix Feature Selection Strategy in Machine Learning Models for Certain Krylov Solver Prediction 机器学习模型中用于特定克雷洛夫求解器预测的新矩阵特征选择策略
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-06 DOI: 10.1007/s00357-024-09484-0
Hai-Bing Sun, Yan-Fei Jing, Xiao-Wen Xu

Numerical simulation processes in scientific and engineering applications require efficient solutions of large sparse linear systems, and variants of Krylov subspace solvers with various preconditioning techniques have been developed. However, it is time-consuming for practitioners with trial and error to find a high-performance Krylov solver in a candidate solver set for a given linear system. Therefore, it is instructive to select an efficient solver intelligently among a solver set rather than exploratory application of all solvers to solve the linear system. One promising direction of solver selection is to apply machine learning methods to construct a mapping from the matrix features to the candidate solvers. However, the computation of some matrix features is quite difficult. In this paper, we design a new selection strategy of matrix features to reduce computing cost, and then employ the selected features to construct a machine learning classifier to predict an appropriate solver for a given linear system. Numerical experiments on two attractive GMRES-type solvers for solving linear systems from the University of Florida Sparse Matrix Collection and Matrix Market verify the efficiency of our strategy, not only reducing the computing time for obtaining features and construction time of classifier but also keeping more than 90% prediction accuracy.

科学和工程应用中的数值模拟过程需要对大型稀疏线性系统进行高效求解,采用各种预处理技术的克雷洛夫子空间求解器的变体也已开发出来。然而,对于从业人员来说,要在给定线性系统的候选求解器集中找到一个高性能的 Krylov 求解器,需要反复试验,耗费大量时间。因此,在求解器集中智能地选择一个高效求解器,而不是探索性地应用所有求解器来求解线性系统,是很有启发意义的。解算器选择的一个有前途的方向是应用机器学习方法,构建从矩阵特征到候选解算器的映射。然而,某些矩阵特征的计算相当困难。本文设计了一种新的矩阵特征选择策略,以降低计算成本,然后利用所选特征构建机器学习分类器,为给定线性系统预测合适的求解器。对佛罗里达大学稀疏矩阵集和矩阵市场中两种有吸引力的 GMRES 型线性系统求解器进行的数值实验验证了我们策略的高效性,不仅减少了获取特征的计算时间和构建分类器的时间,还保持了 90% 以上的预测准确率。
{"title":"A New Matrix Feature Selection Strategy in Machine Learning Models for Certain Krylov Solver Prediction","authors":"Hai-Bing Sun, Yan-Fei Jing, Xiao-Wen Xu","doi":"10.1007/s00357-024-09484-0","DOIUrl":"https://doi.org/10.1007/s00357-024-09484-0","url":null,"abstract":"<p>Numerical simulation processes in scientific and engineering applications require efficient solutions of large sparse linear systems, and variants of Krylov subspace solvers with various preconditioning techniques have been developed. However, it is time-consuming for practitioners with trial and error to find a high-performance Krylov solver in a candidate solver set for a given linear system. Therefore, it is instructive to select an efficient solver intelligently among a solver set rather than exploratory application of all solvers to solve the linear system. One promising direction of solver selection is to apply machine learning methods to construct a mapping from the matrix features to the candidate solvers. However, the computation of some matrix features is quite difficult. In this paper, we design a new selection strategy of matrix features to reduce computing cost, and then employ the selected features to construct a machine learning classifier to predict an appropriate solver for a given linear system. Numerical experiments on two attractive GMRES-type solvers for solving linear systems from the University of Florida Sparse Matrix Collection and Matrix Market verify the efficiency of our strategy, not only reducing the computing time for obtaining features and construction time of classifier but also keeping more than 90% prediction accuracy.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"30 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141573774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cluster Validation Based on Fisher’s Linear Discriminant Analysis 基于费雪线性判别分析的聚类验证
IF 2 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-04 DOI: 10.1007/s00357-024-09481-3
Fabian Kächele, Nora Schneider

Cluster analysis aims to find meaningful groups, called clusters, in data. The objects within a cluster should be similar to each other and dissimilar to objects from other clusters. The fundamental question arising is whether found clusters are “valid clusters” or not. Existing cluster validity indices are computation-intensive, make assumptions about the underlying cluster structure, or cannot detect the absence of clusters. Thus, we present a new cluster validation framework to assess the validity of a clustering and determine the underlying number of clusters (k^*). Within the framework, we introduce a new merge criterion analyzing the data in a one-dimensional projection, which maximizes the ratio of between-cluster- variance to within-cluster-variance in the clusters. Nonetheless, other local methods can be applied as a merge criterion within the framework. Experiments on synthetic and real-world data sets show promising results for both the overall framework and the introduced merge criterion.

聚类分析的目的是在数据中找到有意义的群体,即聚类。一个聚类中的对象应该彼此相似,而与其他聚类中的对象不相似。由此产生的基本问题是,找到的聚类是否是 "有效聚类"。现有的聚类有效性指数需要大量计算,对潜在的聚类结构进行假设,或者无法检测到聚类的缺失。因此,我们提出了一个新的聚类验证框架来评估聚类的有效性,并确定聚类的基本数量 (k^*)。在这个框架内,我们引入了一个新的合并标准,以一维投影的方式分析数据,使聚类中的聚类间方差与聚类内方差之比最大化。不过,其他局部方法也可以作为合并标准应用于该框架中。在合成数据集和实际数据集上进行的实验表明,整体框架和引入的合并标准都取得了可喜的成果。
{"title":"Cluster Validation Based on Fisher’s Linear Discriminant Analysis","authors":"Fabian Kächele, Nora Schneider","doi":"10.1007/s00357-024-09481-3","DOIUrl":"https://doi.org/10.1007/s00357-024-09481-3","url":null,"abstract":"<p>Cluster analysis aims to find meaningful groups, called clusters, in data. The objects within a cluster should be similar to each other and dissimilar to objects from other clusters. The fundamental question arising is whether found clusters are “valid clusters” or not. Existing cluster validity indices are computation-intensive, make assumptions about the underlying cluster structure, or cannot detect the absence of clusters. Thus, we present a new cluster validation framework to assess the validity of a clustering and determine the underlying number of clusters <span>(k^*)</span>. Within the framework, we introduce a new merge criterion analyzing the data in a one-dimensional projection, which maximizes the ratio of between-cluster- variance to within-cluster-variance in the clusters. Nonetheless, other local methods can be applied as a merge criterion within the framework. Experiments on synthetic and real-world data sets show promising results for both the overall framework and the introduced merge criterion.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"40 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141549520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Classification
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1