Assuming that the subject of each scientific publication can be identified by one or more classification entities, we address the problem of determining a similarity function (distance) between classification entities based on how often two classification entities are used in the same publication. This similarity function is then used to obtain a representation of the classification entities as points of an Euclidean space of a suitable dimension by means of optimization and dimensionality reduction algorithms. This procedure allows us also to represent the researchers as points in the same Euclidean space and to determine the distance between researchers according to their scientific production. As a case study, we consider as classification entities the codes of the American Mathematical Society Classification System.
{"title":"How to Measure the Researcher Impact with the Aid of its Impactable Area: A Concrete Approach Using Distance Geometry","authors":"Beniamino Cappelletti-Montano, Gianmarco Cherchi, Benedetto Manca, Stefano Montaldo, Monica Musio","doi":"10.1007/s00357-024-09490-2","DOIUrl":"https://doi.org/10.1007/s00357-024-09490-2","url":null,"abstract":"<p>Assuming that the subject of each scientific publication can be identified by one or more classification entities, we address the problem of determining a similarity function (distance) between classification entities based on how often two classification entities are used in the same publication. This similarity function is then used to obtain a representation of the classification entities as points of an Euclidean space of a suitable dimension by means of optimization and dimensionality reduction algorithms. This procedure allows us also to represent the researchers as points in the same Euclidean space and to determine the distance between researchers according to their scientific production. As a case study, we consider as classification entities the codes of the American Mathematical Society Classification System.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"56 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-23DOI: 10.1007/s00357-024-09488-w
Qi Liu, Wenxin Zhu, Zhengming Dai, Zhihong Ma
Compared to single-task learning (STL), multi-task learning (MTL) achieves a better generalization by exploiting domain-specific information implicit in the training signals of several related tasks. The adaptation of MTL to support vector machines (SVMs) is a rather successful example. Inspired by the recently published generalized Huber loss SVM (GHSVM) and regularized multi-task learning (RMTL), we propose a novel generalized Huber loss multi-task support vector machine including linear and non-linear cases for binary classification, named as MTL-GHSVM. The new method extends the GHSVM from single-task to multi-task learning, and the application of Huber loss to MTL-SVM is innovative to the best of our knowledge. The proposed method has two main advantages: on the one hand, compared with SVMs with hinge loss and GHSVM, our MTL-GHSVM using the differentiable generalized Huber loss has better generalization performance; on the other hand, it adopts functional iteration to find the optimal solution, and does not need to solve a quadratic programming problem (QPP), which can significantly reduce the computational cost. Numerical experiments have been conducted on fifteen real datasets, and the results demonstrate the effectiveness of the proposed multi-task classification algorithm compared with the state-of-the-art algorithms.
{"title":"Multi-task Support Vector Machine Classifier with Generalized Huber Loss","authors":"Qi Liu, Wenxin Zhu, Zhengming Dai, Zhihong Ma","doi":"10.1007/s00357-024-09488-w","DOIUrl":"https://doi.org/10.1007/s00357-024-09488-w","url":null,"abstract":"<p>Compared to single-task learning (STL), multi-task learning (MTL) achieves a better generalization by exploiting domain-specific information implicit in the training signals of several related tasks. The adaptation of MTL to support vector machines (SVMs) is a rather successful example. Inspired by the recently published generalized Huber loss SVM (GHSVM) and regularized multi-task learning (RMTL), we propose a novel generalized Huber loss multi-task support vector machine including linear and non-linear cases for binary classification, named as MTL-GHSVM. The new method extends the GHSVM from single-task to multi-task learning, and the application of Huber loss to MTL-SVM is innovative to the best of our knowledge. The proposed method has two main advantages: on the one hand, compared with SVMs with hinge loss and GHSVM, our MTL-GHSVM using the differentiable generalized Huber loss has better generalization performance; on the other hand, it adopts functional iteration to find the optimal solution, and does not need to solve a quadratic programming problem (QPP), which can significantly reduce the computational cost. Numerical experiments have been conducted on fifteen real datasets, and the results demonstrate the effectiveness of the proposed multi-task classification algorithm compared with the state-of-the-art algorithms.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"166 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-22DOI: 10.1007/s00357-024-09491-1
Haixia Zhao, Jian Wu
Multi-class imbalanced data learning faces many challenges. Its complex structural characteristics cause severe intra-class imbalance or overgeneralization in most solution strategies. This negatively affects data learning. This paper proposes a clustering-based oversampling algorithm (COM) to handle multi-class imbalance learning. In order to avoid the loss of important information, COM clusters the minority class based on the structural characteristics of the instances, among which rare instances and outliers are carefully portrayed through assigning a sampling weight to each of the clusters. Clusters with high densities are given low weights, and then, oversampling is performed within clusters to avoid overgeneralization. COM avoids intra-class imbalance effectively because low-density clusters are more likely than high-density ones to be selected to synthesize instances. Our study used the UCI and KEEL imbalanced datasets to demonstrate the effectiveness and stability of the proposed method.
多类不平衡数据学习面临许多挑战。其复杂的结构特征会导致大多数解决策略出现严重的类内不平衡或过度泛化。这对数据学习产生了负面影响。本文提出了一种基于聚类的超采样算法(COM)来处理多类不平衡学习。为了避免重要信息的丢失,COM 根据实例的结构特征对少数类进行聚类,并通过为每个聚类分配采样权重来仔细刻画其中的罕见实例和异常值。密度高的聚类被赋予较低的权重,然后在聚类内部进行超采样,以避免过度泛化。COM 能有效避免类内不平衡,因为低密度聚类比高密度聚类更有可能被选中合成实例。我们的研究使用 UCI 和 KEEL 不平衡数据集来证明所提方法的有效性和稳定性。
{"title":"Clustering-Based Oversampling Algorithm for Multi-class Imbalance Learning","authors":"Haixia Zhao, Jian Wu","doi":"10.1007/s00357-024-09491-1","DOIUrl":"https://doi.org/10.1007/s00357-024-09491-1","url":null,"abstract":"<p>Multi-class imbalanced data learning faces many challenges. Its complex structural characteristics cause severe intra-class imbalance or overgeneralization in most solution strategies. This negatively affects data learning. This paper proposes a clustering-based oversampling algorithm (COM) to handle multi-class imbalance learning. In order to avoid the loss of important information, COM clusters the minority class based on the structural characteristics of the instances, among which rare instances and outliers are carefully portrayed through assigning a sampling weight to each of the clusters. Clusters with high densities are given low weights, and then, oversampling is performed within clusters to avoid overgeneralization. COM avoids intra-class imbalance effectively because low-density clusters are more likely than high-density ones to be selected to synthesize instances. Our study used the UCI and KEEL imbalanced datasets to demonstrate the effectiveness and stability of the proposed method.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"17 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-13DOI: 10.1007/s00357-024-09489-9
Zhen Jiang, Lingyun Zhao, Yu Lu
Most machine learning algorithms rely on having a sufficient amount of labeled data to train a reliable classifier. However, labeling data is often costly and time-consuming, while unlabeled data can be readily accessible. Therefore, learning from both labeled and unlabeled data has become a hot topic of interest. Inspired by the co-training algorithm, we present a learning framework called CSCC, which combines semi-supervised clustering and classification to learn from both labeled and unlabeled data. Unlike existing co-training style methods that construct diverse classifiers to learn from each other, CSCC leverages the diversity between semi-supervised clustering and classification models to achieve mutual enhancement. Existing classification algorithms can be easily adapted to CSCC, allowing them to generalize from a few labeled data. Especially, in order to bridge the gap between class information and clustering, we propose a semi-supervised hierarchical clustering algorithm that utilizes labeled data to guide the process of cluster-splitting. Within the CSCC framework, we introduce two loss functions to supervise the iterative updating of the semi-supervised clustering and classification models, respectively. Extensive experiments conducted on a variety of benchmark datasets validate the superiority of CSCC over other state-of-the-art methods.
{"title":"Combining Semi-supervised Clustering and Classification Under a Generalized Framework","authors":"Zhen Jiang, Lingyun Zhao, Yu Lu","doi":"10.1007/s00357-024-09489-9","DOIUrl":"https://doi.org/10.1007/s00357-024-09489-9","url":null,"abstract":"<p>Most machine learning algorithms rely on having a sufficient amount of labeled data to train a reliable classifier. However, labeling data is often costly and time-consuming, while unlabeled data can be readily accessible. Therefore, learning from both labeled and unlabeled data has become a hot topic of interest. Inspired by the co-training algorithm, we present a learning framework called CSCC, which combines semi-supervised clustering and classification to learn from both labeled and unlabeled data. Unlike existing co-training style methods that construct diverse classifiers to learn from each other, CSCC leverages the diversity between semi-supervised clustering and classification models to achieve mutual enhancement. Existing classification algorithms can be easily adapted to CSCC, allowing them to generalize from a few labeled data. Especially, in order to bridge the gap between class information and clustering, we propose a semi-supervised hierarchical clustering algorithm that utilizes labeled data to guide the process of cluster-splitting. Within the CSCC framework, we introduce two loss functions to supervise the iterative updating of the semi-supervised clustering and classification models, respectively. Extensive experiments conducted on a variety of benchmark datasets validate the superiority of CSCC over other state-of-the-art methods.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"13 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-23DOI: 10.1007/s00357-024-09487-x
Jun Ye, Kaiqian Du, Shigui Du, Rui Yong
Since matrix energy (ME) implies the expressive merit of collective information, a classification method based on ME has not been investigated in the existing literature, which reflects its research gap in a matrix scenario. Therefore, the purpose of this paper is to propose a slope stability classification model based on the single-valued neutrosophic matrix (SVNM) energy to solve the current research gap in slope stability classification analysis with uncertain and inconsistent information. In this study, we first present SVNM and define the SVNM energy based on true, uncertain, and false MEs. Then, using a neutrosophication technique based on true, false, and uncertain Gaussian membership functions, the multiple sampling data of the stability affecting factors for each slope are transformed into SVNM. Next, a slope stability classification model based on the SVNM energy and score function is developed to solve the slope stability classification analysis under the full SVNM scenario of both the affecting factor weights and the affecting factors of slope stability. Finally, the developed classification model is applied to the classification analysis of 50 slope samples collected from different areas of Zhejiang province in China as a case study to verify its rationality and accuracy under the SVNM scenario. The accuracy of the classification results for the 50 slope samples is 100%.
{"title":"Slope Stability Classification Model Based on Single-Valued Neutrosophic Matrix Energy and Its Application Under a Single-Valued Neutrosophic Matrix Scenario","authors":"Jun Ye, Kaiqian Du, Shigui Du, Rui Yong","doi":"10.1007/s00357-024-09487-x","DOIUrl":"https://doi.org/10.1007/s00357-024-09487-x","url":null,"abstract":"<p>Since matrix energy (ME) implies the expressive merit of collective information, a classification method based on ME has not been investigated in the existing literature, which reflects its research gap in a matrix scenario. Therefore, the purpose of this paper is to propose a slope stability classification model based on the single-valued neutrosophic matrix (SVNM) energy to solve the current research gap in slope stability classification analysis with uncertain and inconsistent information. In this study, we first present SVNM and define the SVNM energy based on true, uncertain, and false MEs. Then, using a neutrosophication technique based on true, false, and uncertain Gaussian membership functions, the multiple sampling data of the stability affecting factors for each slope are transformed into SVNM. Next, a slope stability classification model based on the SVNM energy and score function is developed to solve the slope stability classification analysis under the full SVNM scenario of both the affecting factor weights and the affecting factors of slope stability. Finally, the developed classification model is applied to the classification analysis of 50 slope samples collected from different areas of Zhejiang province in China as a case study to verify its rationality and accuracy under the SVNM scenario. The accuracy of the classification results for the 50 slope samples is 100%.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"29 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-23DOI: 10.1007/s00357-024-09486-y
Rajesh Ranjan, Jitender Kumar Chhabra
In today’s data-centric world, the significance of generated data has increased manifold. Clustering the data into a similar group is one of the dynamic research areas among other data practices. Several algorithms’ proposals exist for clustering. Apart from the traditional algorithms, researchers worldwide have successfully employed some metaheuristic approaches for clustering. The crow search algorithm (CSA) is a recently introduced swarm-based algorithm that imitates the performance of the crow. An effective crow search algorithm (ECSA) has been proposed in the present work, which dynamically attunes its parameter to sustain the search balance and perform an oppositional-based random initialization. The ECSA is evaluated over CEC2019 Benchmark Functions and simulated for data clustering tasks compared with well-known metaheuristic approaches and famous partition-based K-means algorithm over benchmark datasets. The results reveal that the ECSA performs better than other algorithms in the context of external cluster quality metrics and convergence rate.
{"title":"An Effective Crow Search Algorithm and Its Application in Data Clustering","authors":"Rajesh Ranjan, Jitender Kumar Chhabra","doi":"10.1007/s00357-024-09486-y","DOIUrl":"https://doi.org/10.1007/s00357-024-09486-y","url":null,"abstract":"<p>In today’s data-centric world, the significance of generated data has increased manifold. Clustering the data into a similar group is one of the dynamic research areas among other data practices. Several algorithms’ proposals exist for clustering. Apart from the traditional algorithms, researchers worldwide have successfully employed some metaheuristic approaches for clustering. The crow search algorithm (CSA) is a recently introduced swarm-based algorithm that imitates the performance of the crow. An effective crow search algorithm (ECSA) has been proposed in the present work, which dynamically attunes its parameter to sustain the search balance and perform an oppositional-based random initialization. The ECSA is evaluated over CEC2019 Benchmark Functions and simulated for data clustering tasks compared with well-known metaheuristic approaches and famous partition-based K-means algorithm over benchmark datasets. The results reveal that the ECSA performs better than other algorithms in the context of external cluster quality metrics and convergence rate.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"95 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-12DOI: 10.1007/s00357-024-09479-x
Alexa A. Sochaniwsky, Michael P. B. Gallaugher, Yang Tang, Paul D. McNicholas
Robust clustering of high-dimensional data is an important topic because clusters in real datasets are often heavy-tailed and/or asymmetric. Traditional approaches to model-based clustering often fail for high dimensional data, e.g., due to the number of free covariance parameters. A parametrization of the component scale matrices for the mixture of generalized hyperbolic distributions is proposed. This parameterization includes a penalty term in the likelihood. An analytically feasible expectation-maximization algorithm is developed by placing a gamma-lasso penalty constraining the concentration matrix. The proposed methodology is investigated through simulation studies and illustrated using two real datasets.
{"title":"Flexible Clustering with a Sparse Mixture of Generalized Hyperbolic Distributions","authors":"Alexa A. Sochaniwsky, Michael P. B. Gallaugher, Yang Tang, Paul D. McNicholas","doi":"10.1007/s00357-024-09479-x","DOIUrl":"https://doi.org/10.1007/s00357-024-09479-x","url":null,"abstract":"<p>Robust clustering of high-dimensional data is an important topic because clusters in real datasets are often heavy-tailed and/or asymmetric. Traditional approaches to model-based clustering often fail for high dimensional data, e.g., due to the number of free covariance parameters. A parametrization of the component scale matrices for the mixture of generalized hyperbolic distributions is proposed. This parameterization includes a penalty term in the likelihood. An analytically feasible expectation-maximization algorithm is developed by placing a gamma-lasso penalty constraining the concentration matrix. The proposed methodology is investigated through simulation studies and illustrated using two real datasets.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"33 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141614134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-06DOI: 10.1007/s00357-024-09483-1
Marek Gagolewski, Anna Cena, Maciej Bartoszuk, Łukasz Brzozowski
Minimum spanning trees (MSTs) provide a convenient representation of datasets in numerous pattern recognition activities. Moreover, they are relatively fast to compute. In this paper, we quantify the extent to which they are meaningful in low-dimensional partitional data clustering tasks. By identifying the upper bounds for the agreement between the best (oracle) algorithm and the expert labels from a large battery of benchmark data, we discover that MST methods can be very competitive. Next, we review, study, extend, and generalise a few existing, state-of-the-art MST-based partitioning schemes. This leads to some new noteworthy approaches. Overall, the Genie and the information-theoretic methods often outperform the non-MST algorithms such as K-means, Gaussian mixtures, spectral clustering, Birch, density-based, and classical hierarchical agglomerative procedures. Nevertheless, we identify that there is still some room for improvement, and thus the development of novel algorithms is encouraged.
{"title":"Clustering with Minimum Spanning Trees: How Good Can It Be?","authors":"Marek Gagolewski, Anna Cena, Maciej Bartoszuk, Łukasz Brzozowski","doi":"10.1007/s00357-024-09483-1","DOIUrl":"https://doi.org/10.1007/s00357-024-09483-1","url":null,"abstract":"<p>Minimum spanning trees (MSTs) provide a convenient representation of datasets in numerous pattern recognition activities. Moreover, they are relatively fast to compute. In this paper, we quantify the extent to which they are meaningful in low-dimensional partitional data clustering tasks. By identifying the upper bounds for the agreement between the best (oracle) algorithm and the expert labels from a large battery of benchmark data, we discover that MST methods can be very competitive. Next, we review, study, extend, and generalise a few existing, state-of-the-art MST-based partitioning schemes. This leads to some new noteworthy approaches. Overall, the Genie and the information-theoretic methods often outperform the non-MST algorithms such as K-means, Gaussian mixtures, spectral clustering, Birch, density-based, and classical hierarchical agglomerative procedures. Nevertheless, we identify that there is still some room for improvement, and thus the development of novel algorithms is encouraged.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"4 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141577897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-06DOI: 10.1007/s00357-024-09484-0
Hai-Bing Sun, Yan-Fei Jing, Xiao-Wen Xu
Numerical simulation processes in scientific and engineering applications require efficient solutions of large sparse linear systems, and variants of Krylov subspace solvers with various preconditioning techniques have been developed. However, it is time-consuming for practitioners with trial and error to find a high-performance Krylov solver in a candidate solver set for a given linear system. Therefore, it is instructive to select an efficient solver intelligently among a solver set rather than exploratory application of all solvers to solve the linear system. One promising direction of solver selection is to apply machine learning methods to construct a mapping from the matrix features to the candidate solvers. However, the computation of some matrix features is quite difficult. In this paper, we design a new selection strategy of matrix features to reduce computing cost, and then employ the selected features to construct a machine learning classifier to predict an appropriate solver for a given linear system. Numerical experiments on two attractive GMRES-type solvers for solving linear systems from the University of Florida Sparse Matrix Collection and Matrix Market verify the efficiency of our strategy, not only reducing the computing time for obtaining features and construction time of classifier but also keeping more than 90% prediction accuracy.
{"title":"A New Matrix Feature Selection Strategy in Machine Learning Models for Certain Krylov Solver Prediction","authors":"Hai-Bing Sun, Yan-Fei Jing, Xiao-Wen Xu","doi":"10.1007/s00357-024-09484-0","DOIUrl":"https://doi.org/10.1007/s00357-024-09484-0","url":null,"abstract":"<p>Numerical simulation processes in scientific and engineering applications require efficient solutions of large sparse linear systems, and variants of Krylov subspace solvers with various preconditioning techniques have been developed. However, it is time-consuming for practitioners with trial and error to find a high-performance Krylov solver in a candidate solver set for a given linear system. Therefore, it is instructive to select an efficient solver intelligently among a solver set rather than exploratory application of all solvers to solve the linear system. One promising direction of solver selection is to apply machine learning methods to construct a mapping from the matrix features to the candidate solvers. However, the computation of some matrix features is quite difficult. In this paper, we design a new selection strategy of matrix features to reduce computing cost, and then employ the selected features to construct a machine learning classifier to predict an appropriate solver for a given linear system. Numerical experiments on two attractive GMRES-type solvers for solving linear systems from the University of Florida Sparse Matrix Collection and Matrix Market verify the efficiency of our strategy, not only reducing the computing time for obtaining features and construction time of classifier but also keeping more than 90% prediction accuracy.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"30 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141573774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-04DOI: 10.1007/s00357-024-09481-3
Fabian Kächele, Nora Schneider
Cluster analysis aims to find meaningful groups, called clusters, in data. The objects within a cluster should be similar to each other and dissimilar to objects from other clusters. The fundamental question arising is whether found clusters are “valid clusters” or not. Existing cluster validity indices are computation-intensive, make assumptions about the underlying cluster structure, or cannot detect the absence of clusters. Thus, we present a new cluster validation framework to assess the validity of a clustering and determine the underlying number of clusters (k^*). Within the framework, we introduce a new merge criterion analyzing the data in a one-dimensional projection, which maximizes the ratio of between-cluster- variance to within-cluster-variance in the clusters. Nonetheless, other local methods can be applied as a merge criterion within the framework. Experiments on synthetic and real-world data sets show promising results for both the overall framework and the introduced merge criterion.
{"title":"Cluster Validation Based on Fisher’s Linear Discriminant Analysis","authors":"Fabian Kächele, Nora Schneider","doi":"10.1007/s00357-024-09481-3","DOIUrl":"https://doi.org/10.1007/s00357-024-09481-3","url":null,"abstract":"<p>Cluster analysis aims to find meaningful groups, called clusters, in data. The objects within a cluster should be similar to each other and dissimilar to objects from other clusters. The fundamental question arising is whether found clusters are “valid clusters” or not. Existing cluster validity indices are computation-intensive, make assumptions about the underlying cluster structure, or cannot detect the absence of clusters. Thus, we present a new cluster validation framework to assess the validity of a clustering and determine the underlying number of clusters <span>(k^*)</span>. Within the framework, we introduce a new merge criterion analyzing the data in a one-dimensional projection, which maximizes the ratio of between-cluster- variance to within-cluster-variance in the clusters. Nonetheless, other local methods can be applied as a merge criterion within the framework. Experiments on synthetic and real-world data sets show promising results for both the overall framework and the introduced merge criterion.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"40 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141549520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}