首页 > 最新文献

Advances in Data Analysis and Classification最新文献

英文 中文
Flexible mixture regression with the generalized hyperbolic distribution 使用广义双曲分布的灵活混合回归
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-01-04 DOI: 10.1007/s11634-022-00532-4
Nam-Hwui Kim, Ryan P. Browne

When modeling the functional relationship between a response variable and covariates via linear regression, multiple relationships may be present depending on the underlying component structure. Deploying a flexible mixture distribution can help with capturing a wide variety of such structures, thereby successfully modeling the response–covariate relationship while addressing the components. In that spirit, a mixture regression model based on the finite mixture of generalized hyperbolic distributions is introduced, and its parameter estimation method is presented. The flexibility of the generalized hyperbolic distribution can identify better-fitting components, which can lead to a more meaningful functional relationship between the response variable and the covariates. In addition, we introduce an iterative component combining procedure to aid the interpretability of the model. The results from simulated and real data analyses indicate that our method offers a distinctive edge over some of the existing methods, and that it can generate useful insights on the data set at hand for further investigation.

在通过线性回归对响应变量和协变因素之间的函数关系进行建模时,可能会出现多重关系,这取决于基本的成分结构。采用灵活的混合分布有助于捕捉多种此类结构,从而在解决成分问题的同时,成功地模拟响应变量与协变量之间的关系。本着这一精神,本文介绍了一种基于广义双曲分布有限混合物的混合物回归模型,并提出了其参数估计方法。广义双曲分布的灵活性可以识别出更拟合的成分,从而在响应变量和协变因素之间建立更有意义的函数关系。此外,我们还介绍了一种迭代成分组合程序,以帮助模型的可解释性。模拟和真实数据分析的结果表明,与现有的一些方法相比,我们的方法具有独特的优势,可以为进一步研究手头的数据集提供有用的见解。
{"title":"Flexible mixture regression with the generalized hyperbolic distribution","authors":"Nam-Hwui Kim,&nbsp;Ryan P. Browne","doi":"10.1007/s11634-022-00532-4","DOIUrl":"10.1007/s11634-022-00532-4","url":null,"abstract":"<div><p>When modeling the functional relationship between a response variable and covariates via linear regression, multiple relationships may be present depending on the underlying component structure. Deploying a flexible mixture distribution can help with capturing a wide variety of such structures, thereby successfully modeling the response–covariate relationship while addressing the components. In that spirit, a mixture regression model based on the finite mixture of generalized hyperbolic distributions is introduced, and its parameter estimation method is presented. The flexibility of the generalized hyperbolic distribution can identify better-fitting components, which can lead to a more meaningful functional relationship between the response variable and the covariates. In addition, we introduce an iterative component combining procedure to aid the interpretability of the model. The results from simulated and real data analyses indicate that our method offers a distinctive edge over some of the existing methods, and that it can generate useful insights on the data set at hand for further investigation.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 1","pages":"33 - 60"},"PeriodicalIF":1.4,"publicationDate":"2023-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82422675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sparse correspondence analysis for large contingency tables 大型列联表的稀疏对应分析
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-01-02 DOI: 10.1007/s11634-022-00531-5
Ruiping Liu, Ndeye Niang, Gilbert Saporta, Huiwen Wang

We propose sparse variants of correspondence analysis (CA) for large contingency tables like documents-terms matrices used in text mining. By seeking to obtain many zero coefficients, sparse CA remedies to the difficulty of interpreting CA results when the size of the table is large. Since CA is a double weighted PCA (for rows and columns) or a weighted generalized SVD, we adapt known sparse versions of these methods with specific developments to obtain orthogonal solutions and to tune the sparseness parameters. We distinguish two cases depending on whether sparseness is asked for both rows and columns, or only for one set.

我们提出了对应分析(CA)的稀疏变体,用于大型列联表,如文本挖掘中使用的文档术语矩阵。通过寻求获得许多零系数,稀疏CA解决了当表的大小很大时解释CA结果的困难。由于CA是双加权PCA(用于行和列)或加权广义SVD,因此我们对这些方法的已知稀疏版本进行了特定的改进,以获得正交解并调整稀疏性参数。我们区分两种情况取决于是否对行和列都要求稀疏性,还是只对一个集合要求稀疏性。
{"title":"Sparse correspondence analysis for large contingency tables","authors":"Ruiping Liu,&nbsp;Ndeye Niang,&nbsp;Gilbert Saporta,&nbsp;Huiwen Wang","doi":"10.1007/s11634-022-00531-5","DOIUrl":"10.1007/s11634-022-00531-5","url":null,"abstract":"<div><p>We propose sparse variants of correspondence analysis (CA) for large contingency tables like documents-terms matrices used in text mining. By seeking to obtain many zero coefficients, sparse CA remedies to the difficulty of interpreting CA results when the size of the table is large. Since CA is a double weighted PCA (for rows and columns) or a weighted generalized SVD, we adapt known sparse versions of these methods with specific developments to obtain orthogonal solutions and to tune the sparseness parameters. We distinguish two cases depending on whether sparseness is asked for both rows and columns, or only for one set.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 4","pages":"1037 - 1056"},"PeriodicalIF":1.6,"publicationDate":"2023-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50003542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Proximal methods for sparse optimal scoring and discriminant analysis 稀疏最优评分和判别分析的近似方法
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-12-21 DOI: 10.1007/s11634-022-00530-6
Summer Atkins, Gudmundur Einarsson, Line Clemmensen, Brendan Ames

Linear discriminant analysis (LDA) is a classical method for dimensionality reduction, where discriminant vectors are sought to project data to a lower dimensional space for optimal separability of classes. Several recent papers have outlined strategies, based on exploiting sparsity of the discriminant vectors, for performing LDA in the high-dimensional setting where the number of features exceeds the number of observations in the data. However, many of these proposed methods lack scalable methods for solution of the underlying optimization problems. We consider an optimization scheme for solving the sparse optimal scoring formulation of LDA based on block coordinate descent. Each iteration of this algorithm requires an update of a scoring vector, which admits an analytic formula, and an update of the corresponding discriminant vector, which requires solution of a convex subproblem; we will propose several variants of this algorithm where the proximal gradient method or the alternating direction method of multipliers is used to solve this subproblem. We show that the per-iteration cost of these methods scales linearly in the dimension of the data provided restricted regularization terms are employed, and cubically in the dimension of the data in the worst case. Furthermore, we establish that when this block coordinate descent framework generates convergent subsequences of iterates, then these subsequences converge to the stationary points of the sparse optimal scoring problem. We demonstrate the effectiveness of our new methods with empirical results for classification of Gaussian data and data sets drawn from benchmarking repositories, including time-series and multispectral X-ray data, and provide Matlab and R implementations of our optimization schemes.

线性判别分析(LDA)是一种经典的降维方法,其中寻求判别向量来将数据投影到较低维空间,以实现类的最佳可分性。最近的几篇论文概述了基于利用判别向量的稀疏性的策略,用于在高维环境中执行LDA,其中特征的数量超过了数据中的观测数量。然而,这些提出的方法中的许多缺乏用于解决潜在优化问题的可扩展方法。我们考虑了一种基于块坐标下降的LDA稀疏最优评分公式的优化方案。该算法的每次迭代都需要更新评分向量,该向量允许分析公式,并更新相应的判别向量,该判别向量需要求解凸子问题;我们将提出该算法的几种变体,其中使用近梯度法或乘法器的交替方向法来解决该子问题。我们证明了这些方法的每次迭代成本在所提供的数据维度上是线性的,在最坏的情况下,在数据维度上使用了限制正则化项。此外,我们建立了当这个块坐标下降框架生成迭代的收敛子序列时,这些子序列收敛到稀疏最优评分问题的平稳点。我们通过对高斯数据和从基准存储库中提取的数据集(包括时间序列和多光谱X射线数据)进行分类的经验结果证明了我们新方法的有效性,并提供了我们优化方案的Matlab和R实现。
{"title":"Proximal methods for sparse optimal scoring and discriminant analysis","authors":"Summer Atkins,&nbsp;Gudmundur Einarsson,&nbsp;Line Clemmensen,&nbsp;Brendan Ames","doi":"10.1007/s11634-022-00530-6","DOIUrl":"10.1007/s11634-022-00530-6","url":null,"abstract":"<div><p>Linear discriminant analysis (LDA) is a classical method for dimensionality reduction, where discriminant vectors are sought to project data to a lower dimensional space for optimal separability of classes. Several recent papers have outlined strategies, based on exploiting sparsity of the discriminant vectors, for performing LDA in the high-dimensional setting where the number of features exceeds the number of observations in the data. However, many of these proposed methods lack scalable methods for solution of the underlying optimization problems. We consider an optimization scheme for solving the sparse optimal scoring formulation of LDA based on block coordinate descent. Each iteration of this algorithm requires an update of a scoring vector, which admits an analytic formula, and an update of the corresponding discriminant vector, which requires solution of a convex subproblem; we will propose several variants of this algorithm where the proximal gradient method or the alternating direction method of multipliers is used to solve this subproblem. We show that the per-iteration cost of these methods scales linearly in the dimension of the data provided restricted regularization terms are employed, and cubically in the dimension of the data in the worst case. Furthermore, we establish that when this block coordinate descent framework generates convergent subsequences of iterates, then these subsequences converge to the stationary points of the sparse optimal scoring problem. We demonstrate the effectiveness of our new methods with empirical results for classification of Gaussian data and data sets drawn from benchmarking repositories, including time-series and multispectral X-ray data, and provide <span>Matlab</span> and <span>R</span> implementations of our optimization schemes.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 4","pages":"983 - 1036"},"PeriodicalIF":1.6,"publicationDate":"2022-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50502301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
LASSO regularization within the LocalGLMnet architecture LocalGLMnet体系结构中的LASSO正则化
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-12-13 DOI: 10.1007/s11634-022-00529-z
Ronald Richman, Mario V. Wüthrich

Deep learning models have been very successful in the application of machine learning methods, often out-performing classical statistical models such as linear regression models or generalized linear models. On the other hand, deep learning models are often criticized for not being explainable nor allowing for variable selection. There are two different ways of dealing with this problem, either we use post-hoc model interpretability methods or we design specific deep learning architectures that allow for an easier interpretation and explanation. This paper builds on our previous work on the LocalGLMnet architecture that gives an interpretable deep learning architecture. In the present paper, we show how group LASSO regularization (and other regularization schemes) can be implemented within the LocalGLMnet architecture so that we receive feature sparsity for variable selection. We benchmark our approach with the recently developed LassoNet of Lemhadri et al. ( LassoNet: a neural network with feature sparsity. J Mach Learn Res 22:1–29, 2021).

深度学习模型在机器学习方法的应用中非常成功,通常优于线性回归模型或广义线性模型等经典统计模型。另一方面,深度学习模型经常被批评为无法解释,也不允许变量选择。有两种不同的方法来处理这个问题,要么我们使用事后模型可解释性方法,要么我们设计特定的深度学习架构,以便更容易地进行解释和解释。本文建立在我们之前关于LocalGLMnet架构的工作之上,该架构提供了一个可解释的深度学习架构。在本文中,我们展示了如何在LocalGLMnet架构中实现组LASSO正则化(和其他正则化方案),以便我们接收用于变量选择的特征稀疏性。我们将我们的方法与Lemhardi等人最近开发的LassoNet进行了比较。(LassoNet:一种具有特征稀疏性的神经网络。J Mach Learn Res 22:1-292021)。
{"title":"LASSO regularization within the LocalGLMnet architecture","authors":"Ronald Richman,&nbsp;Mario V. Wüthrich","doi":"10.1007/s11634-022-00529-z","DOIUrl":"10.1007/s11634-022-00529-z","url":null,"abstract":"<div><p>Deep learning models have been very successful in the application of machine learning methods, often out-performing classical statistical models such as linear regression models or generalized linear models. On the other hand, deep learning models are often criticized for not being explainable nor allowing for variable selection. There are two different ways of dealing with this problem, either we use post-hoc model interpretability methods or we design specific deep learning architectures that allow for an easier interpretation and explanation. This paper builds on our previous work on the LocalGLMnet architecture that gives an interpretable deep learning architecture. In the present paper, we show how group LASSO regularization (and other regularization schemes) can be implemented within the LocalGLMnet architecture so that we receive feature sparsity for variable selection. We benchmark our approach with the recently developed LassoNet of Lemhadri et al. ( LassoNet: a neural network with feature sparsity. J Mach Learn Res 22:1–29, 2021).</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 4","pages":"951 - 981"},"PeriodicalIF":1.6,"publicationDate":"2022-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50047295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A power-controlled reliability assessment for multi-class probabilistic classifiers 多类概率分类器的功率控制可靠性评估
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-11-17 DOI: 10.1007/s11634-022-00528-0
Hyukjun Gweon

In multi-class classification, the output of a probabilistic classifier is a probability distribution of the classes. In this work, we focus on a statistical assessment of the reliability of probabilistic classifiers for multi-class problems. Our approach generates a Pearson (chi ^2) statistic based on the k-nearest-neighbors in the prediction space. Further, we develop a Bayesian approach for estimating the expected power of the reliability test that can be used for an appropriate sample size k. We propose a sampling algorithm and demonstrate that this algorithm obtains a valid prior distribution. The effectiveness of the proposed reliability test and expected power is evaluated through a simulation study. We also provide illustrative examples of the proposed methods with practical applications.

在多类别分类中,概率分类器的输出是类别的概率分布。在这项工作中,我们专注于对多类问题的概率分类器的可靠性进行统计评估。我们的方法基于预测空间中的k近邻生成Pearson(chi^2)统计量。此外,我们开发了一种贝叶斯方法来估计可靠性测试的预期功率,该方法可用于适当的样本量k。我们提出了一种采样算法,并证明该算法获得了有效的先验分布。通过仿真研究评估了所提出的可靠性测试的有效性和预期功率。我们还提供了所提出的方法的示例和实际应用。
{"title":"A power-controlled reliability assessment for multi-class probabilistic classifiers","authors":"Hyukjun Gweon","doi":"10.1007/s11634-022-00528-0","DOIUrl":"10.1007/s11634-022-00528-0","url":null,"abstract":"<div><p>In multi-class classification, the output of a probabilistic classifier is a probability distribution of the classes. In this work, we focus on a statistical assessment of the reliability of probabilistic classifiers for multi-class problems. Our approach generates a Pearson <span>(chi ^2)</span> statistic based on the <i>k</i>-nearest-neighbors in the prediction space. Further, we develop a Bayesian approach for estimating the expected power of the reliability test that can be used for an appropriate sample size <i>k</i>. We propose a sampling algorithm and demonstrate that this algorithm obtains a valid prior distribution. The effectiveness of the proposed reliability test and expected power is evaluated through a simulation study. We also provide illustrative examples of the proposed methods with practical applications.\u0000</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 4","pages":"927 - 949"},"PeriodicalIF":1.6,"publicationDate":"2022-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50071056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A dual subspace parsimonious mixture of matrix normal distributions 矩阵正态分布的对偶子空间简约混合
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-11-16 DOI: 10.1007/s11634-022-00526-2
Alex Sharp, Glen Chalatov, Ryan P. Browne

We present a parsimonious dual-subspace clustering approach for a mixture of matrix-normal distributions. By assuming certain principal components of the row and column covariance matrices are equally important, we express the model in fewer parameters without sacrificing discriminatory information. We derive update rules for an ECM algorithm and set forth necessary conditions to ensure identifiability. We use simulation to demonstrate parameter recovery, and we illustrate the parsimony and competitive performance of the model through two data analyses.

我们提出了一种矩阵正态分布混合的简约对偶子空间聚类方法。通过假设行和列协方差矩阵的某些主分量同样重要,我们在不牺牲判别信息的情况下用更少的参数来表达模型。我们推导了ECM算法的更新规则,并提出了确保可识别性的必要条件。我们使用仿真来演示参数恢复,并通过两个数据分析来说明模型的简约性和竞争性能。
{"title":"A dual subspace parsimonious mixture of matrix normal distributions","authors":"Alex Sharp,&nbsp;Glen Chalatov,&nbsp;Ryan P. Browne","doi":"10.1007/s11634-022-00526-2","DOIUrl":"10.1007/s11634-022-00526-2","url":null,"abstract":"<div><p>We present a parsimonious dual-subspace clustering approach for a mixture of matrix-normal distributions. By assuming certain principal components of the row and column covariance matrices are equally important, we express the model in fewer parameters without sacrificing discriminatory information. We derive update rules for an ECM algorithm and set forth necessary conditions to ensure identifiability. We use simulation to demonstrate parameter recovery, and we illustrate the parsimony and competitive performance of the model through two data analyses.\u0000</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 3","pages":"801 - 822"},"PeriodicalIF":1.6,"publicationDate":"2022-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50032840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Monitoring photochemical pollutants based on symbolic interval-valued data analysis 基于符号区间值数据分析的光化学污染物监测
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-11-12 DOI: 10.1007/s11634-022-00527-1
Liang-Ching Lin, Meihui Guo, Sangyeol Lee

This study considers monitoring photochemical pollutants for anomaly detection based on symbolic interval-valued data analysis. For this task, we construct control charts based on the principal component scores of symbolic interval-valued data. Herein, the symbolic interval-valued data are assumed to follow a normal distribution, and an approximate expectation formula of order statistics from the normal distribution is used in the univariate case to estimate the mean and variance via the method of moments. In addition, we consider the bivariate case wherein we use the maximum likelihood estimator calculated from the likelihood function derived under a bivariate copula. We also establish the procedures for the statistical control chart based on the univariate and bivariate interval-valued variables, and the procedures are potentially extendable to higher dimensional cases. Monte Carlo simulations and real data analysis using photochemical pollutants confirm the validity of the proposed method. The results particularly show the superiority over the conventional method that uses the averages to identify the date on which the abnormal maximum occurred.

本文研究了基于符号区间值数据分析的光化学污染物监测异常检测。为此,我们基于符号区间值数据的主成分分数构造控制图。本文假设符号区间值数据服从正态分布,在单变量情况下,采用正态分布阶统计量的近似期望公式,通过矩量法估计均值和方差。此外,我们还考虑了二元情况,其中我们使用由二元copula导出的似然函数计算的最大似然估计量。我们还建立了基于单变量和双变量区间值变量的统计控制图的程序,并且该程序有可能扩展到高维情况。蒙特卡罗模拟和使用光化学污染物的实际数据分析证实了该方法的有效性。结果特别表明,该方法优于使用平均值来确定异常最大值发生日期的传统方法。
{"title":"Monitoring photochemical pollutants based on symbolic interval-valued data analysis","authors":"Liang-Ching Lin,&nbsp;Meihui Guo,&nbsp;Sangyeol Lee","doi":"10.1007/s11634-022-00527-1","DOIUrl":"10.1007/s11634-022-00527-1","url":null,"abstract":"<div><p>This study considers monitoring photochemical pollutants for anomaly detection based on symbolic interval-valued data analysis. For this task, we construct control charts based on the principal component scores of symbolic interval-valued data. Herein, the symbolic interval-valued data are assumed to follow a normal distribution, and an approximate expectation formula of order statistics from the normal distribution is used in the univariate case to estimate the mean and variance via the method of moments. In addition, we consider the bivariate case wherein we use the maximum likelihood estimator calculated from the likelihood function derived under a bivariate copula. We also establish the procedures for the statistical control chart based on the univariate and bivariate interval-valued variables, and the procedures are potentially extendable to higher dimensional cases. Monte Carlo simulations and real data analysis using photochemical pollutants confirm the validity of the proposed method. The results particularly show the superiority over the conventional method that uses the averages to identify the date on which the abnormal maximum occurred.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 4","pages":"897 - 926"},"PeriodicalIF":1.6,"publicationDate":"2022-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50045936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Editorial for ADAC issue 4 of volume 16 (2022) ADAC第16卷第4期社论(2022)
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-10-31 DOI: 10.1007/s11634-022-00525-3
Maurizio Vichi, Andrea Ceroli, Hans A. Kestler, Akinori Okada, Claus Weihs
{"title":"Editorial for ADAC issue 4 of volume 16 (2022)","authors":"Maurizio Vichi,&nbsp;Andrea Ceroli,&nbsp;Hans A. Kestler,&nbsp;Akinori Okada,&nbsp;Claus Weihs","doi":"10.1007/s11634-022-00525-3","DOIUrl":"10.1007/s11634-022-00525-3","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"16 4","pages":"817 - 821"},"PeriodicalIF":1.6,"publicationDate":"2022-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50529237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Attraction-repulsion clustering: a way of promoting diversity linked to demographic parity in fair clustering 吸引-排斥聚类:一种在公平聚类中促进与人口均等相关的多样性的方法
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-10-20 DOI: 10.1007/s11634-022-00516-4
Eustasio del Barrio, Hristo Inouzhe, Jean-Michel Loubes

We consider the problem of diversity enhancing clustering, i.e, developing clustering methods which produce clusters that favour diversity with respect to a set of protected attributes such as race, sex, age, etc. In the context of fair clustering, diversity plays a major role when fairness is understood as demographic parity. To promote diversity, we introduce perturbations to the distance in the unprotected attributes that account for protected attributes in a way that resembles attraction-repulsion of charged particles in Physics. These perturbations are defined through dissimilarities with a tractable interpretation. Cluster analysis based on attraction-repulsion dissimilarities penalizes homogeneity of the clusters with respect to the protected attributes and leads to an improvement in diversity. An advantage of our approach, which falls into a pre-processing set-up, is its compatibility with a wide variety of clustering methods and whit non-Euclidean data. We illustrate the use of our procedures with both synthetic and real data and provide discussion about the relation between diversity, fairness, and cluster structure.

我们考虑了增强多样性聚类的问题,即开发聚类方法,产生有利于种族、性别、年龄等一组受保护属性多样性的聚类。在公平聚类的背景下,当公平被理解为人口均等时,多样性发挥着重要作用。为了促进多样性,我们在解释受保护属性的未保护属性中引入了距离扰动,其方式类似于物理学中带电粒子的吸引-排斥。这些扰动是通过可处理解释的相异性来定义的。基于吸引-排斥相异性的聚类分析惩罚了聚类相对于受保护属性的同质性,并提高了多样性。我们的方法属于预处理设置,其优点是它与各种聚类方法和whit非欧几里得数据兼容。我们用合成数据和真实数据说明了我们的程序的使用,并讨论了多样性、公平性和集群结构之间的关系。
{"title":"Attraction-repulsion clustering: a way of promoting diversity linked to demographic parity in fair clustering","authors":"Eustasio del Barrio,&nbsp;Hristo Inouzhe,&nbsp;Jean-Michel Loubes","doi":"10.1007/s11634-022-00516-4","DOIUrl":"10.1007/s11634-022-00516-4","url":null,"abstract":"<div><p>We consider the problem of <i>diversity enhancing clustering</i>, i.e, developing clustering methods which produce clusters that favour diversity with respect to a set of protected attributes such as race, sex, age, etc. In the context of <i>fair clustering</i>, diversity plays a major role when fairness is understood as demographic parity. To promote diversity, we introduce perturbations to the distance in the unprotected attributes that account for protected attributes in a way that resembles attraction-repulsion of charged particles in Physics. These perturbations are defined through dissimilarities with a tractable interpretation. Cluster analysis based on attraction-repulsion dissimilarities penalizes homogeneity of the clusters with respect to the protected attributes and leads to an improvement in diversity. An advantage of our approach, which falls into a pre-processing set-up, is its compatibility with a wide variety of clustering methods and whit non-Euclidean data. We illustrate the use of our procedures with both synthetic and real data and provide discussion about the relation between diversity, fairness, and cluster structure.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 4","pages":"859 - 896"},"PeriodicalIF":1.6,"publicationDate":"2022-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-022-00516-4.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50040006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A structured covariance ensemble for sufficient dimension reduction 一种用于充分降维的结构化协方差系综
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-10-19 DOI: 10.1007/s11634-022-00524-4
Qin Wang, Yuan Xue

Sufficient dimension reduction (SDR) is a useful tool for high-dimensional data analysis. SDR aims at reducing the data dimensionality without loss of regression information between the response and its high-dimensional predictors. Many existing SDR methods are designed for the data with continuous responses. Motivated by a recent work on aggregate dimension reduction (Wang in Stat Si 30:1027–1048, 2020), we propose a unified SDR framework for both continuous and binary responses through a structured covariance ensemble. The connection with existing approaches is discussed in details and an efficient algorithm is proposed. Numerical examples and a real data application demonstrate its satisfactory performance.

充分降维(SDR)是高维数据分析的一种有用工具。SDR旨在降低数据维度,而不会丢失响应与其高维预测因子之间的回归信息。许多现有的SDR方法都是针对具有连续响应的数据而设计的。受最近一项关于聚合降维的工作的启发(Wang在Stat Si 30:1027-10482020中),我们通过结构化协方差集合为连续和二进制响应提出了一个统一的SDR框架。详细讨论了与现有方法的联系,并提出了一种有效的算法。数值算例和实际数据应用表明,该方法具有令人满意的性能。
{"title":"A structured covariance ensemble for sufficient dimension reduction","authors":"Qin Wang,&nbsp;Yuan Xue","doi":"10.1007/s11634-022-00524-4","DOIUrl":"10.1007/s11634-022-00524-4","url":null,"abstract":"<div><p>Sufficient dimension reduction (SDR) is a useful tool for high-dimensional data analysis. SDR aims at reducing the data dimensionality without loss of regression information between the response and its high-dimensional predictors. Many existing SDR methods are designed for the data with continuous responses. Motivated by a recent work on aggregate dimension reduction (Wang in Stat Si 30:1027–1048, 2020), we propose a unified SDR framework for both continuous and binary responses through a structured covariance ensemble. The connection with existing approaches is discussed in details and an efficient algorithm is proposed. Numerical examples and a real data application demonstrate its satisfactory performance.\u0000</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 3","pages":"777 - 800"},"PeriodicalIF":1.6,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50497854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Advances in Data Analysis and Classification
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1