Pub Date : 2024-05-27DOI: 10.1007/s11634-024-00595-5
Gero Szepannek, Rabea Aschenbruck, Adalbert Wilhelm
One of the most frequently used algorithms for clustering data with both numeric and categorical variables is the k-prototypes algorithm, an extension of the well-known k-means clustering. Gower’s distance denotes another popular approach for dealing with mixed-type data and is suitable not only for numeric and categorical but also for ordinal variables. In the paper a modification of the k-prototypes algorithm to Gower’s distance is proposed that ensures convergence. This provides a tool that allows to take into account ordinal information for clustering and can also be used for large data. A simulation study demonstrates convergence, good clustering results as well as small runtimes.
最常用的数字变量和分类变量数据聚类算法之一是 k 原型算法,它是著名的 k 均值聚类算法的扩展。高尔距离(Gower's distance)是处理混合类型数据的另一种常用方法,不仅适用于数字变量和分类变量,也适用于顺序变量。本文提出了一种对高尔距离 k 原型算法的修改,以确保收敛性。这提供了一种考虑到聚类中序数信息的工具,也可用于大型数据。模拟研究证明了该算法的收敛性、良好的聚类结果以及较小的运行时间。
{"title":"Clustering large mixed-type data with ordinal variables","authors":"Gero Szepannek, Rabea Aschenbruck, Adalbert Wilhelm","doi":"10.1007/s11634-024-00595-5","DOIUrl":"https://doi.org/10.1007/s11634-024-00595-5","url":null,"abstract":"<p>One of the most frequently used algorithms for clustering data with both numeric and categorical variables is the k-prototypes algorithm, an extension of the well-known k-means clustering. Gower’s distance denotes another popular approach for dealing with mixed-type data and is suitable not only for numeric and categorical but also for ordinal variables. In the paper a modification of the k-prototypes algorithm to Gower’s distance is proposed that ensures convergence. This provides a tool that allows to take into account ordinal information for clustering and can also be used for large data. A simulation study demonstrates convergence, good clustering results as well as small runtimes.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"46 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141167174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-06DOI: 10.1007/s11634-024-00593-7
Niel le Roux, Sugnet Gardner-Lubbe
Canonical variate analysis (CVA) entails a two-sided eigenvalue decomposition. When the number of groups, J, is less than the number of variables, p, at most (J-1) eigenvalues are not exactly zero. A CVA biplot is the simultaneous display of the two entities: group means as points and variables as calibrated biplot axes. It follows that with two groups the group means can be exactly represented in a one-dimensional biplot but the individual samples are approximated. We define a criterion to measure the quality of representing the individual samples in a CVA biplot. Then, for the two-group case we propose an additional dimension for constructing an optimal two-dimensional CVA biplot. The proposed novel CVA biplot maintains the exact display of group means and biplot axes, but the individual sample points satisfy the optimality criterion in a unique simultaneous display of group means, calibrated biplot axes for the variables, and within group samples. Although our primary aim is to address two-group CVA, our proposal extends immediately to an optimal three-dimensional biplot when encountering the equally important case of comparing three groups in practice.
{"title":"A two-group canonical variate analysis biplot for an optimal display of both means and cases","authors":"Niel le Roux, Sugnet Gardner-Lubbe","doi":"10.1007/s11634-024-00593-7","DOIUrl":"https://doi.org/10.1007/s11634-024-00593-7","url":null,"abstract":"<p>Canonical variate analysis (CVA) entails a two-sided eigenvalue decomposition. When the number of groups, <i>J</i>, is less than the number of variables, <i>p</i>, at most <span>(J-1)</span> eigenvalues are not exactly zero. A CVA biplot is the simultaneous display of the two entities: group means as points and variables as calibrated biplot axes. It follows that with two groups the group means can be exactly represented in a one-dimensional biplot but the individual samples are approximated. We define a criterion to measure the quality of representing the individual samples in a CVA biplot. Then, for the two-group case we propose an additional dimension for constructing an optimal two-dimensional CVA biplot. The proposed novel CVA biplot maintains the exact display of group means and biplot axes, but the individual sample points satisfy the optimality criterion in a unique simultaneous display of group means, calibrated biplot axes for the variables, and within group samples. Although our primary aim is to address two-group CVA, our proposal extends immediately to an optimal three-dimensional biplot when encountering the equally important case of comparing three groups in practice.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"15 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140888158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-30DOI: 10.1007/s11634-024-00590-w
Chengqian Xian, Camila P. E. de Souza, John Jewell, Ronaldo Dias
Among different functional data analyses, clustering analysis aims to determine underlying groups of curves in the dataset when there is no information on the group membership of each curve. In this work, we develop a novel variational Bayes (VB) algorithm for clustering and smoothing functional data simultaneously via a B-spline regression mixture model with random intercepts. We employ the deviance information criterion to select the best number of clusters. The proposed VB algorithm is evaluated and compared with other methods (k-means, functional k-means and two other model-based methods) via a simulation study under various scenarios. We apply our proposed methodology to two publicly available datasets. We demonstrate that the proposed VB algorithm achieves satisfactory clustering performance in both simulation and real data analyses.
在不同的功能数据分析中,聚类分析的目的是在没有每条曲线所属组别的信息时,确定数据集中曲线的潜在组别。在这项工作中,我们开发了一种新型变异贝叶斯(VB)算法,通过带有随机截距的 B 样条回归混合模型,同时对功能数据进行聚类和平滑。我们采用偏差信息准则来选择最佳聚类数目。通过在各种情况下进行模拟研究,对所提出的 VB 算法进行了评估,并与其他方法(k-means、函数式 k-means 和其他两种基于模型的方法)进行了比较。我们将提出的方法应用于两个公开的数据集。我们证明,在模拟和真实数据分析中,建议的 VB 算法都取得了令人满意的聚类性能。
{"title":"Clustering functional data via variational inference","authors":"Chengqian Xian, Camila P. E. de Souza, John Jewell, Ronaldo Dias","doi":"10.1007/s11634-024-00590-w","DOIUrl":"https://doi.org/10.1007/s11634-024-00590-w","url":null,"abstract":"<p>Among different functional data analyses, clustering analysis aims to determine underlying groups of curves in the dataset when there is no information on the group membership of each curve. In this work, we develop a novel variational Bayes (VB) algorithm for clustering and smoothing functional data simultaneously via a B-spline regression mixture model with random intercepts. We employ the deviance information criterion to select the best number of clusters. The proposed VB algorithm is evaluated and compared with other methods (<i>k</i>-means, functional <i>k</i>-means and two other model-based methods) via a simulation study under various scenarios. We apply our proposed methodology to two publicly available datasets. We demonstrate that the proposed VB algorithm achieves satisfactory clustering performance in both simulation and real data analyses.\u0000</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"50 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140831385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-26DOI: 10.1007/s11634-024-00594-6
Matteo Farnè
In this paper, we review the main signal processing tools of Music Information Retrieval (MIR) from audio data, and we apply them to two recordings (by Leslie Howard and Thomas Rajna) of Franz Liszt’s Étude S.136 no.1, with the aim of uncovering the macro-formal structure and comparing the interpretative styles of the two performers. In particular, after a thorough spectrogram analysis, we perform a segmentation based on the degree of novelty, in the sense of spectral dissimilarity, calculated frame-by-frame via the cosine distance. We then compare the metrical, temporal and timbrical features of the two executions by MIR tools. Via this method, we are able to identify in a data-driven way the different moments of the piece according to their melodic and harmonic content, and to find out that Rajna’s execution is faster and less various, in terms of intensity and timbre, than Howard’s one. This enquiry represents a case study able to show the potentialities of MIR from audio data in supporting traditional music score analyses and in providing objective information for statistically founded musical execution analyses.
本文回顾了从音频数据中进行音乐信息检索(MIR)的主要信号处理工具,并将其应用于弗朗兹-李斯特《Étude S.136 no.1》的两段录音(分别由莱斯利-霍华德和托马斯-拉吉纳录制),旨在揭示其宏观形式结构并比较两位演奏者的演绎风格。具体而言,在对频谱图进行全面分析后,我们根据新颖程度进行分段,即通过余弦距离逐帧计算出的频谱异同度。然后,我们通过 MIR 工具比较两次执行的韵律、时间和时态特征。通过这种方法,我们能够根据旋律和和声的内容,以数据驱动的方式识别乐曲的不同时刻,并发现拉吉纳的演奏比霍华德的演奏速度更快,在力度和音色方面的变化也更少。这项研究是一项案例研究,它展示了从音频数据中提取的 MIR 在支持传统乐谱分析以及为基于统计的音乐执行分析提供客观信息方面的潜力。
{"title":"Liszt’s Étude S.136 no.1: audio data analysis of two different piano recordings","authors":"Matteo Farnè","doi":"10.1007/s11634-024-00594-6","DOIUrl":"10.1007/s11634-024-00594-6","url":null,"abstract":"<div><p>In this paper, we review the main signal processing tools of Music Information Retrieval (MIR) from audio data, and we apply them to two recordings (by Leslie Howard and Thomas Rajna) of Franz Liszt’s Étude S.136 no.1, with the aim of uncovering the macro-formal structure and comparing the interpretative styles of the two performers. In particular, after a thorough spectrogram analysis, we perform a segmentation based on the degree of novelty, in the sense of spectral dissimilarity, calculated frame-by-frame via the cosine distance. We then compare the metrical, temporal and timbrical features of the two executions by MIR tools. Via this method, we are able to identify in a data-driven way the different moments of the piece according to their melodic and harmonic content, and to find out that Rajna’s execution is faster and less various, in terms of intensity and timbre, than Howard’s one. This enquiry represents a case study able to show the potentialities of MIR from audio data in supporting traditional music score analyses and in providing objective information for statistically founded musical execution analyses.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"797 - 822"},"PeriodicalIF":1.4,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-024-00594-6.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-13DOI: 10.1007/s11634-024-00592-8
Zdenek Sulc, Jaroslav Hornicek, Hana Rezankova, Jana Cibulkova
The paper discusses eleven internal evaluation criteria that can be used in the area of hierarchical clustering of categorical data. The criteria are divided into two distinct groups based on how they treat the cluster quality: variability- and distance-based. The paper follows three main aims. The first one is to compare the examined criteria regarding their mutual similarity and dependence on the clustered datasets’ properties and the used similarity measures. The second one is to analyze the relationships between internal and external cluster evaluation to determine how well the internal criteria can recognize the original number of clusters in datasets and to what extent they provide comparable results to the external criteria. The third aim is to propose two new variability-based internal evaluation criteria. In the experiment, 81 types of generated datasets with controlled properties are used. The results show which internal criteria can be recommended for specific tasks, such as judging the cluster quality or the optimal number of clusters determination.
{"title":"Comparison of internal evaluation criteria in hierarchical clustering of categorical data","authors":"Zdenek Sulc, Jaroslav Hornicek, Hana Rezankova, Jana Cibulkova","doi":"10.1007/s11634-024-00592-8","DOIUrl":"https://doi.org/10.1007/s11634-024-00592-8","url":null,"abstract":"<p>The paper discusses eleven internal evaluation criteria that can be used in the area of hierarchical clustering of categorical data. The criteria are divided into two distinct groups based on how they treat the cluster quality: variability- and distance-based. The paper follows three main aims. The first one is to compare the examined criteria regarding their mutual similarity and dependence on the clustered datasets’ properties and the used similarity measures. The second one is to analyze the relationships between internal and external cluster evaluation to determine how well the internal criteria can recognize the original number of clusters in datasets and to what extent they provide comparable results to the external criteria. The third aim is to propose two new variability-based internal evaluation criteria. In the experiment, 81 types of generated datasets with controlled properties are used. The results show which internal criteria can be recommended for specific tasks, such as judging the cluster quality or the optimal number of clusters determination.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"49 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140589022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-13DOI: 10.1007/s11634-024-00591-9
Pedro Delicado, Cristian Pachón-García
We present a set of algorithms implementing multidimensional scaling (MDS) for large data sets. MDS is a family of dimensionality reduction techniques using a (n times n) distance matrix as input, where n is the number of individuals, and producing a low dimensional configuration: a (ntimes r) matrix with (r<<n). When n is large, MDS is unaffordable with classical MDS algorithms because their extremely large memory and time requirements. We compare six non-standard algorithms intended to overcome these difficulties. They are based on the central idea of partitioning the data set into small pieces, where classical MDS methods can work. Two of these algorithms are original proposals. In order to check the performance of the algorithms as well as to compare them, we have done a simulation study. Additionally, we have used the algorithms to obtain an MDS configuration for EMNIST: a real large data set with more than 800000 points. We conclude that all the algorithms are appropriate to use for obtaining an MDS configuration, but we recommend to use one of our proposals, since it is a fast algorithm with satisfactory statistical properties when working with big data. An R package implementing the algorithms has been created.
我们提出了一套为大型数据集实现多维缩放(MDS)的算法。MDS 是一系列降维技术,使用一个 (n times n) 距离矩阵作为输入,其中 n 是个体的数量,并产生一个低维配置:一个 (r<<n) 的 (n times r) 矩阵。当 n 较大时,经典的 MDS 算法由于内存和时间要求极高而难以承受。我们比较了六种旨在克服这些困难的非标准算法。这些算法的核心思想是将数据集分割成小块,这样经典的 MDS 方法就能发挥作用。其中两种算法是原创提案。为了检验这些算法的性能并进行比较,我们进行了模拟研究。此外,我们还使用这些算法获得了 EMNIST 的 MDS 配置:一个拥有超过 800000 个点的真实大型数据集。我们的结论是,所有算法都适合用于获取 MDS 配置,但我们建议使用我们的建议之一,因为它是一种快速算法,在处理大数据时具有令人满意的统计特性。我们创建了一个实现这些算法的 R 软件包。
{"title":"Multidimensional scaling for big data","authors":"Pedro Delicado, Cristian Pachón-García","doi":"10.1007/s11634-024-00591-9","DOIUrl":"https://doi.org/10.1007/s11634-024-00591-9","url":null,"abstract":"<p>We present a set of algorithms implementing multidimensional scaling (MDS) for large data sets. MDS is a family of dimensionality reduction techniques using a <span>(n times n)</span> distance matrix as input, where <i>n</i> is the number of individuals, and producing a low dimensional configuration: a <span>(ntimes r)</span> matrix with <span>(r<<n)</span>. When <i>n</i> is large, MDS is unaffordable with classical MDS algorithms because their extremely large memory and time requirements. We compare six non-standard algorithms intended to overcome these difficulties. They are based on the central idea of partitioning the data set into small pieces, where classical MDS methods can work. Two of these algorithms are original proposals. In order to check the performance of the algorithms as well as to compare them, we have done a simulation study. Additionally, we have used the algorithms to obtain an MDS configuration for EMNIST: a real large data set with more than 800000 points. We conclude that all the algorithms are appropriate to use for obtaining an MDS configuration, but we recommend to use one of our proposals, since it is a fast algorithm with satisfactory statistical properties when working with big data. An <span>R</span> package implementing the algorithms has been created.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"26 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140588671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-12DOI: 10.1007/s11634-024-00587-5
Wouter van Loon, Marjolein Fokkema, Botond Szabo, Mark de Rooij
Multi-view stacking is a framework for combining information from different views (i.e. different feature sets) describing the same set of objects. In this framework, a base-learner algorithm is trained on each view separately, and their predictions are then combined by a meta-learner algorithm. In a previous study, stacked penalized logistic regression, a special case of multi-view stacking, has been shown to be useful in identifying which views are most important for prediction. In this article we expand this research by considering seven different algorithms to use as the meta-learner, and evaluating their view selection and classification performance in simulations and two applications on real gene-expression data sets. Our results suggest that if both view selection and classification accuracy are important to the research at hand, then the nonnegative lasso, nonnegative adaptive lasso and nonnegative elastic net are suitable meta-learners. Exactly which among these three is to be preferred depends on the research context. The remaining four meta-learners, namely nonnegative ridge regression, nonnegative forward selection, stability selection and the interpolating predictor, show little advantages in order to be preferred over the other three.
{"title":"View selection in multi-view stacking: choosing the meta-learner","authors":"Wouter van Loon, Marjolein Fokkema, Botond Szabo, Mark de Rooij","doi":"10.1007/s11634-024-00587-5","DOIUrl":"https://doi.org/10.1007/s11634-024-00587-5","url":null,"abstract":"<p>Multi-view stacking is a framework for combining information from different views (i.e. different feature sets) describing the same set of objects. In this framework, a <i>base-learner</i> algorithm is trained on each view separately, and their predictions are then combined by a <i>meta-learner</i> algorithm. In a previous study, stacked penalized logistic regression, a special case of multi-view stacking, has been shown to be useful in identifying which views are most important for prediction. In this article we expand this research by considering seven different algorithms to use as the meta-learner, and evaluating their view selection and classification performance in simulations and two applications on real gene-expression data sets. Our results suggest that if both view selection and classification accuracy are important to the research at hand, then the nonnegative lasso, nonnegative adaptive lasso and nonnegative elastic net are suitable meta-learners. Exactly which among these three is to be preferred depends on the research context. The remaining four meta-learners, namely nonnegative ridge regression, nonnegative forward selection, stability selection and the interpolating predictor, show little advantages in order to be preferred over the other three.\u0000</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"438 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140602607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-30DOI: 10.1007/s11634-024-00589-3
Payel Sadhukhan, Sarbani Palit
This work presents a novel undersampling scheme to tackle the imbalance problem in multi-label datasets. We use the principles of the natural nearest neighborhood and follow a paradigm of label-specific undersampling. Natural-nearest neighborhood is a parameter-free principle. Our scheme’s novelty lies in exploring the parameter-optimization-free natural nearest neighborhood principles. The class imbalance problem is particularly challenging in a multi-label context, as the imbalance ratio and the majority–minority distributions vary from label to label. Consequently, the majority–minority class overlaps also vary across the labels. Working on this aspect, we propose a framework where a single natural neighbor search is sufficient to identify all the label-specific overlaps. Natural neighbor information is also used to find the key lattices of the majority class (which we do not undersample). The performance of the proposed method, NaNUML, indicates its ability to mitigate the class-imbalance issue in multi-label datasets to a considerable extent. We could also establish a statistically superior performance over other competing methods several times. An empirical study involving twelve real-world multi-label datasets, seven competing methods, and four evaluating metrics—shows that the proposed method effectively handles the class-imbalance issue in multi-label datasets. In this work, we have presented a novel label-specific undersampling scheme, NaNUML, for multi-label datasets. NaNUML is based on the parameter-free natural neighbor search and the key factor, neighborhood size ‘k’ is determined without invoking any parameter optimization.
{"title":"Natural-neighborhood based, label-specific undersampling for imbalanced, multi-label data","authors":"Payel Sadhukhan, Sarbani Palit","doi":"10.1007/s11634-024-00589-3","DOIUrl":"10.1007/s11634-024-00589-3","url":null,"abstract":"<div><p>This work presents a novel undersampling scheme to tackle the imbalance problem in multi-label datasets. We use the principles of the natural nearest neighborhood and follow a paradigm of label-specific undersampling. Natural-nearest neighborhood is a parameter-free principle. Our scheme’s novelty lies in exploring the parameter-optimization-free natural nearest neighborhood principles. The class imbalance problem is particularly challenging in a multi-label context, as the imbalance ratio and the majority–minority distributions vary from label to label. Consequently, the majority–minority class overlaps also vary across the labels. Working on this aspect, we propose a framework where a single natural neighbor search is sufficient to identify all the label-specific overlaps. Natural neighbor information is also used to find the key lattices of the majority class (which we do not undersample). The performance of the proposed method, NaNUML, indicates its ability to mitigate the class-imbalance issue in multi-label datasets to a considerable extent. We could also establish a statistically superior performance over other competing methods several times. An empirical study involving twelve real-world multi-label datasets, seven competing methods, and four evaluating metrics—shows that the proposed method effectively handles the class-imbalance issue in multi-label datasets. In this work, we have presented a novel label-specific undersampling scheme, NaNUML, for multi-label datasets. NaNUML is based on the parameter-free natural neighbor search and the key factor, neighborhood size ‘k’ is determined without invoking any parameter optimization.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"723 - 744"},"PeriodicalIF":1.4,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140363801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clustering ensemble combines several fundamental clusterings with a consensus function to produce the final clustering without gaining access to data features. The quality and diversity of a vast library of base clusterings influence the performance of the consensus function. When a huge library of various clusterings is not available, this function produces results of lower quality than those of the basic clustering. The expansion of diverse clusters in the collection to increase the performance of consensus, especially in cases where there is no access to specific data features or assumptions in the data distribution, has still remained an open problem. The approach proposed in this paper, Clustering Ensemble Extraction, considers the similarity criterion at the cluster level and places the most similar clusters in the same group. Then, it extracts new clusters with the help of the Extracting Clusters Algorithm. Finally, two new consensus functions, namely Cluster-based extracted partitioning algorithm and Meta-cluster extracted algorithm, are defined and then applied to new clusters in order to create a high-quality clustering. The results of the empirical experiments conducted in this study showed that the new consensus function obtained by our proposed method outperformed the methods previously proposed in the literature regarding the clustering quality and efficiency.
{"title":"Clustering ensemble extraction: a knowledge reuse framework","authors":"Mohaddeseh Sedghi, Ebrahim Akbari, Homayun Motameni, Touraj Banirostam","doi":"10.1007/s11634-024-00588-4","DOIUrl":"https://doi.org/10.1007/s11634-024-00588-4","url":null,"abstract":"<p>Clustering ensemble combines several fundamental clusterings with a consensus function to produce the final clustering without gaining access to data features. The quality and diversity of a vast library of base clusterings influence the performance of the consensus function. When a huge library of various clusterings is not available, this function produces results of lower quality than those of the basic clustering. The expansion of diverse clusters in the collection to increase the performance of consensus, especially in cases where there is no access to specific data features or assumptions in the data distribution, has still remained an open problem. The approach proposed in this paper, Clustering Ensemble Extraction, considers the similarity criterion at the cluster level and places the most similar clusters in the same group. Then, it extracts new clusters with the help of the Extracting Clusters Algorithm. Finally, two new consensus functions, namely Cluster-based extracted partitioning algorithm and Meta-cluster extracted algorithm, are defined and then applied to new clusters in order to create a high-quality clustering. The results of the empirical experiments conducted in this study showed that the new consensus function obtained by our proposed method outperformed the methods previously proposed in the literature regarding the clustering quality and efficiency.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"68 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140312089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-16DOI: 10.1007/s11634-024-00585-7
Abstract
Finite mixtures of regressions (FMRs) are powerful clustering devices used in many regression-type analyses. Unfortunately, real data often present atypical observations that make the commonly adopted normality assumption of the mixture components inadequate. Thus, to robustify the FMR approach in a matrix-variate framework, we introduce ten FMRs based on the matrix-variate t and contaminated normal distributions. Furthermore, once one of our models is estimated and the observations are assigned to the groups, different procedures can be used for the detection of the atypical points in the data. An ECM algorithm is outlined for maximum likelihood parameter estimation. By using simulated data, we show the negative consequences (in terms of parameter estimates and inferred classification) of the wrong normality assumption in the presence of heavy-tailed clusters or noisy matrices. Such issues are properly addressed by our models instead. Additionally, over the same data, the atypical points detection procedures are also investigated. A real-data analysis concerning the relationship between greenhouse gas emissions and their determinants is conducted, and the behavior of our models in the presence of heterogeneity and atypical observations is discussed.
摘要 有限回归混合物(FMR)是许多回归类型分析中使用的强大聚类工具。遗憾的是,真实数据经常出现非典型观测结果,这使得通常采用的混合物成分正态性假设变得不充分。因此,为了在矩阵变量框架中稳健地使用 FMR 方法,我们引入了十种基于矩阵变量 t 和污染正态分布的 FMR。此外,一旦估算出我们的模型之一并将观测值分配到组中,就可以使用不同的程序来检测数据中的非典型点。我们概述了一种用于最大似然参数估计的 ECM 算法。通过使用模拟数据,我们展示了在存在重尾聚类或噪声矩阵的情况下,错误的正态性假设所带来的负面影响(在参数估计和推断分类方面)。而我们的模型可以妥善解决这些问题。此外,我们还对同一数据的非典型点检测程序进行了研究。对温室气体排放及其决定因素之间的关系进行了真实数据分析,并讨论了我们的模型在存在异质性和非典型观测时的行为。
{"title":"Mixtures of regressions using matrix-variate heavy-tailed distributions","authors":"","doi":"10.1007/s11634-024-00585-7","DOIUrl":"https://doi.org/10.1007/s11634-024-00585-7","url":null,"abstract":"<h3>Abstract</h3> <p>Finite mixtures of regressions (FMRs) are powerful clustering devices used in many regression-type analyses. Unfortunately, real data often present atypical observations that make the commonly adopted normality assumption of the mixture components inadequate. Thus, to robustify the FMR approach in a matrix-variate framework, we introduce ten FMRs based on the matrix-variate <em>t</em> and contaminated normal distributions. Furthermore, once one of our models is estimated and the observations are assigned to the groups, different procedures can be used for the detection of the atypical points in the data. An ECM algorithm is outlined for maximum likelihood parameter estimation. By using simulated data, we show the negative consequences (in terms of parameter estimates and inferred classification) of the wrong normality assumption in the presence of heavy-tailed clusters or noisy matrices. Such issues are properly addressed by our models instead. Additionally, over the same data, the atypical points detection procedures are also investigated. A real-data analysis concerning the relationship between greenhouse gas emissions and their determinants is conducted, and the behavior of our models in the presence of heterogeneity and atypical observations is discussed.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"1 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140147449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}