Advances in Data Analysis and Classification最新文献_第4页

Clustering large mixed-type data with ordinal variables 使用顺序变量对大型混合型数据进行聚类

IF 1.3 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-05-27 DOI: 10.1007/s11634-024-00595-5

Gero Szepannek, Rabea Aschenbruck, Adalbert Wilhelm

One of the most frequently used algorithms for clustering data with both numeric and categorical variables is the k-prototypes algorithm, an extension of the well-known k-means clustering. Gower’s distance denotes another popular approach for dealing with mixed-type data and is suitable not only for numeric and categorical but also for ordinal variables. In the paper a modification of the k-prototypes algorithm to Gower’s distance is proposed that ensures convergence. This provides a tool that allows to take into account ordinal information for clustering and can also be used for large data. A simulation study demonstrates convergence, good clustering results as well as small runtimes.

最常用的数字变量和分类变量数据聚类算法之一是 k 原型算法，它是著名的 k 均值聚类算法的扩展。高尔距离（Gower's distance）是处理混合类型数据的另一种常用方法，不仅适用于数字变量和分类变量，也适用于顺序变量。本文提出了一种对高尔距离 k 原型算法的修改，以确保收敛性。这提供了一种考虑到聚类中序数信息的工具，也可用于大型数据。模拟研究证明了该算法的收敛性、良好的聚类结果以及较小的运行时间。

引用次数: 0

A two-group canonical variate analysis biplot for an optimal display of both means and cases 两组典型变量分析双线图，优化显示均值和情况

IF 1.3 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-05-06 DOI: 10.1007/s11634-024-00593-7

Niel le Roux, Sugnet Gardner-Lubbe

Canonical variate analysis (CVA) entails a two-sided eigenvalue decomposition. When the number of groups, J, is less than the number of variables, p, at most (J-1) eigenvalues are not exactly zero. A CVA biplot is the simultaneous display of the two entities: group means as points and variables as calibrated biplot axes. It follows that with two groups the group means can be exactly represented in a one-dimensional biplot but the individual samples are approximated. We define a criterion to measure the quality of representing the individual samples in a CVA biplot. Then, for the two-group case we propose an additional dimension for constructing an optimal two-dimensional CVA biplot. The proposed novel CVA biplot maintains the exact display of group means and biplot axes, but the individual sample points satisfy the optimality criterion in a unique simultaneous display of group means, calibrated biplot axes for the variables, and within group samples. Although our primary aim is to address two-group CVA, our proposal extends immediately to an optimal three-dimensional biplot when encountering the equally important case of comparing three groups in practice.

典型变量分析（CVA）需要进行双面特征值分解。当组数 J 小于变量数 p 时，最多有（J-1）个特征值不完全为零。CVA 双轴图同时显示了两个实体：作为点的组均值和作为校准双轴图轴的变量。因此，对于两个组，组均值可以在一维双图中精确表示，但单个样本是近似的。我们定义了一个标准来衡量在 CVA 双轴图中表示单个样本的质量。然后，针对两组情况，我们提出了构建最佳二维 CVA 双曲线图的额外维度。所提出的新颖 CVA 双曲线图保持了组平均值和双曲线图轴的精确显示，但单个样本点在同时显示组平均值、校准的变量双曲线图轴和组内样本时满足了最优性标准。虽然我们的主要目的是解决两组 CVA 问题，但在实际应用中遇到同样重要的三组比较情况时，我们的建议可立即扩展到最优三维双线图。

{"title":"A two-group canonical variate analysis biplot for an optimal display of both means and cases","authors":"Niel le Roux, Sugnet Gardner-Lubbe","doi":"10.1007/s11634-024-00593-7","DOIUrl":"10.1007/s11634-024-00593-7","url":null,"abstract":"<div><p>Canonical variate analysis (CVA) entails a two-sided eigenvalue decomposition. When the number of groups, <i>J</i>, is less than the number of variables, <i>p</i>, at most <span>(J-1)</span> eigenvalues are not exactly zero. A CVA biplot is the simultaneous display of the two entities: group means as points and variables as calibrated biplot axes. It follows that with two groups the group means can be exactly represented in a one-dimensional biplot but the individual samples are approximated. We define a criterion to measure the quality of representing the individual samples in a CVA biplot. Then, for the two-group case we propose an additional dimension for constructing an optimal two-dimensional CVA biplot. The proposed novel CVA biplot maintains the exact display of group means and biplot axes, but the individual sample points satisfy the optimality criterion in a unique simultaneous display of group means, calibrated biplot axes for the variables, and within group samples. Although our primary aim is to address two-group CVA, our proposal extends immediately to an optimal three-dimensional biplot when encountering the equally important case of comparing three groups in practice.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 3","pages":"721 - 748"},"PeriodicalIF":1.3,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-024-00593-7.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140888158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Clustering functional data via variational inference 通过变异推理对功能数据进行聚类

IF 1.3 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-04-30 DOI: 10.1007/s11634-024-00590-w

Chengqian Xian, Camila P. E. de Souza, John Jewell, Ronaldo Dias

Among different functional data analyses, clustering analysis aims to determine underlying groups of curves in the dataset when there is no information on the group membership of each curve. In this work, we develop a novel variational Bayes (VB) algorithm for clustering and smoothing functional data simultaneously via a B-spline regression mixture model with random intercepts. We employ the deviance information criterion to select the best number of clusters. The proposed VB algorithm is evaluated and compared with other methods (k-means, functional k-means and two other model-based methods) via a simulation study under various scenarios. We apply our proposed methodology to two publicly available datasets. We demonstrate that the proposed VB algorithm achieves satisfactory clustering performance in both simulation and real data analyses.

在不同的功能数据分析中，聚类分析的目的是在没有每条曲线所属组别的信息时，确定数据集中曲线的潜在组别。在这项工作中，我们开发了一种新型变异贝叶斯（VB）算法，通过带有随机截距的 B 样条回归混合模型，同时对功能数据进行聚类和平滑。我们采用偏差信息准则来选择最佳聚类数目。通过在各种情况下进行模拟研究，对所提出的 VB 算法进行了评估，并与其他方法（k-means、函数式 k-means 和其他两种基于模型的方法）进行了比较。我们将提出的方法应用于两个公开的数据集。我们证明，在模拟和真实数据分析中，建议的 VB 算法都取得了令人满意的聚类性能。

引用次数: 0

Liszt’s Étude S.136 no.1: audio data analysis of two different piano recordings 李斯特的 Étude S.136 no.1：两种不同钢琴录音的音频数据分析

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-04-26 DOI: 10.1007/s11634-024-00594-6

Matteo Farnè

In this paper, we review the main signal processing tools of Music Information Retrieval (MIR) from audio data, and we apply them to two recordings (by Leslie Howard and Thomas Rajna) of Franz Liszt’s Étude S.136 no.1, with the aim of uncovering the macro-formal structure and comparing the interpretative styles of the two performers. In particular, after a thorough spectrogram analysis, we perform a segmentation based on the degree of novelty, in the sense of spectral dissimilarity, calculated frame-by-frame via the cosine distance. We then compare the metrical, temporal and timbrical features of the two executions by MIR tools. Via this method, we are able to identify in a data-driven way the different moments of the piece according to their melodic and harmonic content, and to find out that Rajna’s execution is faster and less various, in terms of intensity and timbre, than Howard’s one. This enquiry represents a case study able to show the potentialities of MIR from audio data in supporting traditional music score analyses and in providing objective information for statistically founded musical execution analyses.

本文回顾了从音频数据中进行音乐信息检索（MIR）的主要信号处理工具，并将其应用于弗朗兹-李斯特《Étude S.136 no.1》的两段录音（分别由莱斯利-霍华德和托马斯-拉吉纳录制），旨在揭示其宏观形式结构并比较两位演奏者的演绎风格。具体而言，在对频谱图进行全面分析后，我们根据新颖程度进行分段，即通过余弦距离逐帧计算出的频谱异同度。然后，我们通过 MIR 工具比较两次执行的韵律、时间和时态特征。通过这种方法，我们能够根据旋律和和声的内容，以数据驱动的方式识别乐曲的不同时刻，并发现拉吉纳的演奏比霍华德的演奏速度更快，在力度和音色方面的变化也更少。这项研究是一项案例研究，它展示了从音频数据中提取的 MIR 在支持传统乐谱分析以及为基于统计的音乐执行分析提供客观信息方面的潜力。

{"title":"Liszt’s Étude S.136 no.1: audio data analysis of two different piano recordings","authors":"Matteo Farnè","doi":"10.1007/s11634-024-00594-6","DOIUrl":"10.1007/s11634-024-00594-6","url":null,"abstract":"<div><p>In this paper, we review the main signal processing tools of Music Information Retrieval (MIR) from audio data, and we apply them to two recordings (by Leslie Howard and Thomas Rajna) of Franz Liszt’s Étude S.136 no.1, with the aim of uncovering the macro-formal structure and comparing the interpretative styles of the two performers. In particular, after a thorough spectrogram analysis, we perform a segmentation based on the degree of novelty, in the sense of spectral dissimilarity, calculated frame-by-frame via the cosine distance. We then compare the metrical, temporal and timbrical features of the two executions by MIR tools. Via this method, we are able to identify in a data-driven way the different moments of the piece according to their melodic and harmonic content, and to find out that Rajna’s execution is faster and less various, in terms of intensity and timbre, than Howard’s one. This enquiry represents a case study able to show the potentialities of MIR from audio data in supporting traditional music score analyses and in providing objective information for statistically founded musical execution analyses.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"797 - 822"},"PeriodicalIF":1.4,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-024-00594-6.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comparison of internal evaluation criteria in hierarchical clustering of categorical data 分类数据分层聚类的内部评价标准比较

IF 1.3 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-04-13 DOI: 10.1007/s11634-024-00592-8

Zdenek Sulc, Jaroslav Hornicek, Hana Rezankova, Jana Cibulkova

The paper discusses eleven internal evaluation criteria that can be used in the area of hierarchical clustering of categorical data. The criteria are divided into two distinct groups based on how they treat the cluster quality: variability- and distance-based. The paper follows three main aims. The first one is to compare the examined criteria regarding their mutual similarity and dependence on the clustered datasets’ properties and the used similarity measures. The second one is to analyze the relationships between internal and external cluster evaluation to determine how well the internal criteria can recognize the original number of clusters in datasets and to what extent they provide comparable results to the external criteria. The third aim is to propose two new variability-based internal evaluation criteria. In the experiment, 81 types of generated datasets with controlled properties are used. The results show which internal criteria can be recommended for specific tasks, such as judging the cluster quality or the optimal number of clusters determination.

本文讨论了可用于分类数据分层聚类领域的十一项内部评估标准。这些标准根据其处理聚类质量的方式分为两组：基于变异性的标准和基于距离的标准。本文有三个主要目的。第一个目的是比较所研究的标准在聚类数据集属性和所使用的相似性度量方面的相互相似性和依赖性。第二个目的是分析内部聚类评价和外部聚类评价之间的关系，以确定内部标准能在多大程度上识别数据集中的原始聚类数量，以及它们能在多大程度上提供与外部标准相当的结果。第三个目的是提出两个新的基于可变性的内部评价标准。在实验中，使用了 81 种具有可控属性的生成数据集。实验结果表明，哪些内部标准可推荐用于特定任务，如判断聚类质量或确定最佳聚类数量。

{"title":"Comparison of internal evaluation criteria in hierarchical clustering of categorical data","authors":"Zdenek Sulc, Jaroslav Hornicek, Hana Rezankova, Jana Cibulkova","doi":"10.1007/s11634-024-00592-8","DOIUrl":"10.1007/s11634-024-00592-8","url":null,"abstract":"<div><p>The paper discusses eleven internal evaluation criteria that can be used in the area of hierarchical clustering of categorical data. The criteria are divided into two distinct groups based on how they treat the cluster quality: variability- and distance-based. The paper follows three main aims. The first one is to compare the examined criteria regarding their mutual similarity and dependence on the clustered datasets’ properties and the used similarity measures. The second one is to analyze the relationships between internal and external cluster evaluation to determine how well the internal criteria can recognize the original number of clusters in datasets and to what extent they provide comparable results to the external criteria. The third aim is to propose two new variability-based internal evaluation criteria. In the experiment, 81 types of generated datasets with controlled properties are used. The results show which internal criteria can be recommended for specific tasks, such as judging the cluster quality or the optimal number of clusters determination.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 3","pages":"619 - 648"},"PeriodicalIF":1.3,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140589022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multidimensional scaling for big data 大数据的多维扩展

IF 1.3 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-04-13 DOI: 10.1007/s11634-024-00591-9

Pedro Delicado, Cristian Pachón-García

We present a set of algorithms implementing multidimensional scaling (MDS) for large data sets. MDS is a family of dimensionality reduction techniques using a (n times n) distance matrix as input, where n is the number of individuals, and producing a low dimensional configuration: a (ntimes r) matrix with (r<<n). When n is large, MDS is unaffordable with classical MDS algorithms because their extremely large memory and time requirements. We compare six non-standard algorithms intended to overcome these difficulties. They are based on the central idea of partitioning the data set into small pieces, where classical MDS methods can work. Two of these algorithms are original proposals. In order to check the performance of the algorithms as well as to compare them, we have done a simulation study. Additionally, we have used the algorithms to obtain an MDS configuration for EMNIST: a real large data set with more than 800000 points. We conclude that all the algorithms are appropriate to use for obtaining an MDS configuration, but we recommend to use one of our proposals, since it is a fast algorithm with satisfactory statistical properties when working with big data. An R package implementing the algorithms has been created.

我们提出了一套为大型数据集实现多维缩放（MDS）的算法。MDS 是一系列降维技术，使用一个 (n times n) 距离矩阵作为输入，其中 n 是个体的数量，并产生一个低维配置：一个 (r<<n) 的 (n times r) 矩阵。当 n 较大时，经典的 MDS 算法由于内存和时间要求极高而难以承受。我们比较了六种旨在克服这些困难的非标准算法。这些算法的核心思想是将数据集分割成小块，这样经典的 MDS 方法就能发挥作用。其中两种算法是原创提案。为了检验这些算法的性能并进行比较，我们进行了模拟研究。此外，我们还使用这些算法获得了 EMNIST 的 MDS 配置：一个拥有超过 800000 个点的真实大型数据集。我们的结论是，所有算法都适合用于获取 MDS 配置，但我们建议使用我们的建议之一，因为它是一种快速算法，在处理大数据时具有令人满意的统计特性。我们创建了一个实现这些算法的 R 软件包。

{"title":"Multidimensional scaling for big data","authors":"Pedro Delicado, Cristian Pachón-García","doi":"10.1007/s11634-024-00591-9","DOIUrl":"10.1007/s11634-024-00591-9","url":null,"abstract":"<div><p>We present a set of algorithms implementing multidimensional scaling (MDS) for large data sets. MDS is a family of dimensionality reduction techniques using a <span>(n times n)</span> distance matrix as input, where <i>n</i> is the number of individuals, and producing a low dimensional configuration: a <span>(ntimes r)</span> matrix with <span>(r<<n)</span>. When <i>n</i> is large, MDS is unaffordable with classical MDS algorithms because their extremely large memory and time requirements. We compare six non-standard algorithms intended to overcome these difficulties. They are based on the central idea of partitioning the data set into small pieces, where classical MDS methods can work. Two of these algorithms are original proposals. In order to check the performance of the algorithms as well as to compare them, we have done a simulation study. Additionally, we have used the algorithms to obtain an MDS configuration for EMNIST: a real large data set with more than 800000 points. We conclude that all the algorithms are appropriate to use for obtaining an MDS configuration, but we recommend to use one of our proposals, since it is a fast algorithm with satisfactory statistical properties when working with big data. An <span>R</span> package implementing the algorithms has been created.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 3","pages":"649 - 670"},"PeriodicalIF":1.3,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-024-00591-9.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140588671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

View selection in multi-view stacking: choosing the meta-learner 多视图叠加中的视图选择：选择元学习器

IF 1.3 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-04-12 DOI: 10.1007/s11634-024-00587-5

Wouter van Loon, Marjolein Fokkema, Botond Szabo, Mark de Rooij

Multi-view stacking is a framework for combining information from different views (i.e. different feature sets) describing the same set of objects. In this framework, a base-learner algorithm is trained on each view separately, and their predictions are then combined by a meta-learner algorithm. In a previous study, stacked penalized logistic regression, a special case of multi-view stacking, has been shown to be useful in identifying which views are most important for prediction. In this article we expand this research by considering seven different algorithms to use as the meta-learner, and evaluating their view selection and classification performance in simulations and two applications on real gene-expression data sets. Our results suggest that if both view selection and classification accuracy are important to the research at hand, then the nonnegative lasso, nonnegative adaptive lasso and nonnegative elastic net are suitable meta-learners. Exactly which among these three is to be preferred depends on the research context. The remaining four meta-learners, namely nonnegative ridge regression, nonnegative forward selection, stability selection and the interpolating predictor, show little advantages in order to be preferred over the other three.

多视图堆叠是一种将描述同一组物体的不同视图（即不同特征集）的信息结合起来的框架。在这一框架中，基础学习算法对每个视图分别进行训练，然后通过元学习算法将它们的预测结果结合起来。在之前的一项研究中，多视图堆叠的一种特殊情况--堆叠惩罚逻辑回归，已被证明有助于确定哪些视图对预测最重要。在本文中，我们扩展了这一研究，考虑了七种不同的算法作为元学习器，并在模拟和真实基因表达数据集的两个应用中评估了它们的视图选择和分类性能。我们的结果表明，如果视图选择和分类准确性对当前研究都很重要，那么非负拉索、非负自适应拉索和非负弹性网就是合适的元学习器。至于这三种元学习器中哪一种更适合，则取决于研究环境。其余四种元学习器，即非负脊回归、非负前向选择、稳定性选择和插值预测器，与其他三种元学习器相比，优势不大，不值得优先考虑。

{"title":"View selection in multi-view stacking: choosing the meta-learner","authors":"Wouter van Loon, Marjolein Fokkema, Botond Szabo, Mark de Rooij","doi":"10.1007/s11634-024-00587-5","DOIUrl":"10.1007/s11634-024-00587-5","url":null,"abstract":"<div><p>Multi-view stacking is a framework for combining information from different views (i.e. different feature sets) describing the same set of objects. In this framework, a <i>base-learner</i> algorithm is trained on each view separately, and their predictions are then combined by a <i>meta-learner</i> algorithm. In a previous study, stacked penalized logistic regression, a special case of multi-view stacking, has been shown to be useful in identifying which views are most important for prediction. In this article we expand this research by considering seven different algorithms to use as the meta-learner, and evaluating their view selection and classification performance in simulations and two applications on real gene-expression data sets. Our results suggest that if both view selection and classification accuracy are important to the research at hand, then the nonnegative lasso, nonnegative adaptive lasso and nonnegative elastic net are suitable meta-learners. Exactly which among these three is to be preferred depends on the research context. The remaining four meta-learners, namely nonnegative ridge regression, nonnegative forward selection, stability selection and the interpolating predictor, show little advantages in order to be preferred over the other three.\u0000</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 3","pages":"579 - 617"},"PeriodicalIF":1.3,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-024-00587-5.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140602607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Natural-neighborhood based, label-specific undersampling for imbalanced, multi-label data 基于自然邻域的特定标签欠采样，用于不平衡多标签数据

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-03-30 DOI: 10.1007/s11634-024-00589-3

Payel Sadhukhan, Sarbani Palit

This work presents a novel undersampling scheme to tackle the imbalance problem in multi-label datasets. We use the principles of the natural nearest neighborhood and follow a paradigm of label-specific undersampling. Natural-nearest neighborhood is a parameter-free principle. Our scheme’s novelty lies in exploring the parameter-optimization-free natural nearest neighborhood principles. The class imbalance problem is particularly challenging in a multi-label context, as the imbalance ratio and the majority–minority distributions vary from label to label. Consequently, the majority–minority class overlaps also vary across the labels. Working on this aspect, we propose a framework where a single natural neighbor search is sufficient to identify all the label-specific overlaps. Natural neighbor information is also used to find the key lattices of the majority class (which we do not undersample). The performance of the proposed method, NaNUML, indicates its ability to mitigate the class-imbalance issue in multi-label datasets to a considerable extent. We could also establish a statistically superior performance over other competing methods several times. An empirical study involving twelve real-world multi-label datasets, seven competing methods, and four evaluating metrics—shows that the proposed method effectively handles the class-imbalance issue in multi-label datasets. In this work, we have presented a novel label-specific undersampling scheme, NaNUML, for multi-label datasets. NaNUML is based on the parameter-free natural neighbor search and the key factor, neighborhood size ‘k’ is determined without invoking any parameter optimization.

本研究提出了一种新颖的欠采样方案，用于解决多标签数据集中的不平衡问题。我们利用自然最近邻域原理，遵循特定标签欠采样范式。自然最近邻域是一种无参数原则。我们方案的新颖之处在于探索了无参数优化的自然最近邻原则。在多标签情况下，类不平衡问题尤其具有挑战性，因为不平衡率和多数-少数分布因标签而异。因此，不同标签的多数-少数类重叠也各不相同。针对这一点，我们提出了一个框架，只需一次自然邻接搜索就足以识别所有特定标签的重叠。自然邻接信息还可用于找到多数类的关键网格（我们不会对其进行低采样）。所提方法 NaNUML 的性能表明，它能在很大程度上缓解多标签数据集中的类不平衡问题。在统计上，我们也多次证明该方法优于其他竞争方法。一项涉及 12 个真实世界多标签数据集、7 种竞争方法和 4 个评估指标的实证研究表明，所提出的方法能有效地处理多标签数据集中的类不平衡问题。在这项工作中，我们针对多标签数据集提出了一种新颖的标签特定欠采样方案 NaNUML。NaNUML 基于无参数自然邻域搜索，关键因素邻域大小 "k "的确定不需要任何参数优化。

{"title":"Natural-neighborhood based, label-specific undersampling for imbalanced, multi-label data","authors":"Payel Sadhukhan, Sarbani Palit","doi":"10.1007/s11634-024-00589-3","DOIUrl":"10.1007/s11634-024-00589-3","url":null,"abstract":"<div><p>This work presents a novel undersampling scheme to tackle the imbalance problem in multi-label datasets. We use the principles of the natural nearest neighborhood and follow a paradigm of label-specific undersampling. Natural-nearest neighborhood is a parameter-free principle. Our scheme’s novelty lies in exploring the parameter-optimization-free natural nearest neighborhood principles. The class imbalance problem is particularly challenging in a multi-label context, as the imbalance ratio and the majority–minority distributions vary from label to label. Consequently, the majority–minority class overlaps also vary across the labels. Working on this aspect, we propose a framework where a single natural neighbor search is sufficient to identify all the label-specific overlaps. Natural neighbor information is also used to find the key lattices of the majority class (which we do not undersample). The performance of the proposed method, NaNUML, indicates its ability to mitigate the class-imbalance issue in multi-label datasets to a considerable extent. We could also establish a statistically superior performance over other competing methods several times. An empirical study involving twelve real-world multi-label datasets, seven competing methods, and four evaluating metrics—shows that the proposed method effectively handles the class-imbalance issue in multi-label datasets. In this work, we have presented a novel label-specific undersampling scheme, NaNUML, for multi-label datasets. NaNUML is based on the parameter-free natural neighbor search and the key factor, neighborhood size ‘k’ is determined without invoking any parameter optimization.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"723 - 744"},"PeriodicalIF":1.4,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140363801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Clustering ensemble extraction: a knowledge reuse framework 聚类组合提取：知识再利用框架

IF 1.3 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-03-27 DOI: 10.1007/s11634-024-00588-4

Mohaddeseh Sedghi, Ebrahim Akbari, Homayun Motameni, Touraj Banirostam

Clustering ensemble combines several fundamental clusterings with a consensus function to produce the final clustering without gaining access to data features. The quality and diversity of a vast library of base clusterings influence the performance of the consensus function. When a huge library of various clusterings is not available, this function produces results of lower quality than those of the basic clustering. The expansion of diverse clusters in the collection to increase the performance of consensus, especially in cases where there is no access to specific data features or assumptions in the data distribution, has still remained an open problem. The approach proposed in this paper, Clustering Ensemble Extraction, considers the similarity criterion at the cluster level and places the most similar clusters in the same group. Then, it extracts new clusters with the help of the Extracting Clusters Algorithm. Finally, two new consensus functions, namely Cluster-based extracted partitioning algorithm and Meta-cluster extracted algorithm, are defined and then applied to new clusters in order to create a high-quality clustering. The results of the empirical experiments conducted in this study showed that the new consensus function obtained by our proposed method outperformed the methods previously proposed in the literature regarding the clustering quality and efficiency.

聚类集合将若干基本聚类与共识函数相结合，在不获取数据特征的情况下产生最终聚类。庞大的基本聚类库的质量和多样性会影响共识函数的性能。如果没有庞大的各种聚类库，该函数产生的结果就会比基本聚类的质量低。特别是在无法获取特定数据特征或数据分布假设的情况下，如何扩展集合中的各种聚类以提高共识的性能，仍然是一个有待解决的问题。本文提出的 "聚类集合提取 "方法考虑了簇层面的相似性标准，将最相似的簇归入同一组。然后，它借助聚类提取算法来提取新的聚类。最后，定义两个新的共识函数，即基于聚类的提取分区算法和元聚类提取算法，然后应用于新的聚类，以创建高质量的聚类。本研究进行的实证实验结果表明，我们提出的方法所获得的新共识函数在聚类质量和效率方面优于之前文献中提出的方法。

{"title":"Clustering ensemble extraction: a knowledge reuse framework","authors":"Mohaddeseh Sedghi, Ebrahim Akbari, Homayun Motameni, Touraj Banirostam","doi":"10.1007/s11634-024-00588-4","DOIUrl":"10.1007/s11634-024-00588-4","url":null,"abstract":"<div><p>Clustering ensemble combines several fundamental clusterings with a consensus function to produce the final clustering without gaining access to data features. The quality and diversity of a vast library of base clusterings influence the performance of the consensus function. When a huge library of various clusterings is not available, this function produces results of lower quality than those of the basic clustering. The expansion of diverse clusters in the collection to increase the performance of consensus, especially in cases where there is no access to specific data features or assumptions in the data distribution, has still remained an open problem. The approach proposed in this paper, Clustering Ensemble Extraction, considers the similarity criterion at the cluster level and places the most similar clusters in the same group. Then, it extracts new clusters with the help of the Extracting Clusters Algorithm. Finally, two new consensus functions, namely Cluster-based extracted partitioning algorithm and Meta-cluster extracted algorithm, are defined and then applied to new clusters in order to create a high-quality clustering. The results of the empirical experiments conducted in this study showed that the new consensus function obtained by our proposed method outperformed the methods previously proposed in the literature regarding the clustering quality and efficiency.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 3","pages":"551 - 578"},"PeriodicalIF":1.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140312089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mixtures of regressions using matrix-variate heavy-tailed distributions 使用矩阵变量重尾分布的回归混合物

IF 1.6 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-03-16 DOI: 10.1007/s11634-024-00585-7

Abstract

Finite mixtures of regressions (FMRs) are powerful clustering devices used in many regression-type analyses. Unfortunately, real data often present atypical observations that make the commonly adopted normality assumption of the mixture components inadequate. Thus, to robustify the FMR approach in a matrix-variate framework, we introduce ten FMRs based on the matrix-variate t and contaminated normal distributions. Furthermore, once one of our models is estimated and the observations are assigned to the groups, different procedures can be used for the detection of the atypical points in the data. An ECM algorithm is outlined for maximum likelihood parameter estimation. By using simulated data, we show the negative consequences (in terms of parameter estimates and inferred classification) of the wrong normality assumption in the presence of heavy-tailed clusters or noisy matrices. Such issues are properly addressed by our models instead. Additionally, over the same data, the atypical points detection procedures are also investigated. A real-data analysis concerning the relationship between greenhouse gas emissions and their determinants is conducted, and the behavior of our models in the presence of heterogeneity and atypical observations is discussed.

摘要有限回归混合物（FMR）是许多回归类型分析中使用的强大聚类工具。遗憾的是，真实数据经常出现非典型观测结果，这使得通常采用的混合物成分正态性假设变得不充分。因此，为了在矩阵变量框架中稳健地使用 FMR 方法，我们引入了十种基于矩阵变量 t 和污染正态分布的 FMR。此外，一旦估算出我们的模型之一并将观测值分配到组中，就可以使用不同的程序来检测数据中的非典型点。我们概述了一种用于最大似然参数估计的 ECM 算法。通过使用模拟数据，我们展示了在存在重尾聚类或噪声矩阵的情况下，错误的正态性假设所带来的负面影响（在参数估计和推断分类方面）。而我们的模型可以妥善解决这些问题。此外，我们还对同一数据的非典型点检测程序进行了研究。对温室气体排放及其决定因素之间的关系进行了真实数据分析，并讨论了我们的模型在存在异质性和非典型观测时的行为。

{"title":"Mixtures of regressions using matrix-variate heavy-tailed distributions","authors":"","doi":"10.1007/s11634-024-00585-7","DOIUrl":"https://doi.org/10.1007/s11634-024-00585-7","url":null,"abstract":"<h3>Abstract</h3> <p>Finite mixtures of regressions (FMRs) are powerful clustering devices used in many regression-type analyses. Unfortunately, real data often present atypical observations that make the commonly adopted normality assumption of the mixture components inadequate. Thus, to robustify the FMR approach in a matrix-variate framework, we introduce ten FMRs based on the matrix-variate <em>t</em> and contaminated normal distributions. Furthermore, once one of our models is estimated and the observations are assigned to the groups, different procedures can be used for the detection of the atypical points in the data. An ECM algorithm is outlined for maximum likelihood parameter estimation. By using simulated data, we show the negative consequences (in terms of parameter estimates and inferred classification) of the wrong normality assumption in the presence of heavy-tailed clusters or noisy matrices. Such issues are properly addressed by our models instead. Additionally, over the same data, the atypical points detection procedures are also investigated. A real-data analysis concerning the relationship between greenhouse gas emissions and their determinants is conducted, and the behavior of our models in the presence of heterogeneity and atypical observations is discussed.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"1 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140147449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0