首页 > 最新文献

Advances in Data Analysis and Classification最新文献

英文 中文
A new model for counterfactual analysis for functional data 功能数据反事实分析的新模型
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-10-25 DOI: 10.1007/s11634-023-00563-5
Emilio Carrizosa, Jasone Ramírez-Ayerbe, Dolores Romero Morales

Counterfactual explanations have become a very popular interpretability tool to understand and explain how complex machine learning models make decisions for individual instances. Most of the research on counterfactual explainability focuses on tabular and image data and much less on models dealing with functional data. In this paper, a counterfactual analysis for functional data is addressed, in which the goal is to identify the samples of the dataset from which the counterfactual explanation is made of, as well as how they are combined so that the individual instance and its counterfactual are as close as possible. Our methodology can be used with different distance measures for multivariate functional data and is applicable to any score-based classifier. We illustrate our methodology using two different real-world datasets, one univariate and another multivariate.

反事实解释已成为一种非常流行的可解释性工具,用于理解和解释复杂的机器学习模型如何针对单个实例做出决策。大多数关于反事实可解释性的研究都集中在表格和图像数据上,而对处理功能数据的模型的研究则少得多。本文探讨了函数数据的反事实分析,其目标是确定反事实解释所依据的数据集样本,以及如何将这些样本组合在一起,从而使单个实例与其反事实尽可能接近。我们的方法可用于多元函数数据的不同距离度量,并适用于任何基于分数的分类器。我们使用两个不同的真实世界数据集来说明我们的方法,一个是单变量数据集,另一个是多变量数据集。
{"title":"A new model for counterfactual analysis for functional data","authors":"Emilio Carrizosa,&nbsp;Jasone Ramírez-Ayerbe,&nbsp;Dolores Romero Morales","doi":"10.1007/s11634-023-00563-5","DOIUrl":"10.1007/s11634-023-00563-5","url":null,"abstract":"<div><p>Counterfactual explanations have become a very popular interpretability tool to understand and explain how complex machine learning models make decisions for individual instances. Most of the research on counterfactual explainability focuses on tabular and image data and much less on models dealing with functional data. In this paper, a counterfactual analysis for functional data is addressed, in which the goal is to identify the samples of the dataset from which the counterfactual explanation is made of, as well as how they are combined so that the individual instance and its counterfactual are as close as possible. Our methodology can be used with different distance measures for multivariate functional data and is applicable to any score-based classifier. We illustrate our methodology using two different real-world datasets, one univariate and another multivariate.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 4","pages":"981 - 1000"},"PeriodicalIF":1.4,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00563-5.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135216055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Profile-based latent class distance association analyses for sparse tables:application to the attitude of European citizens towards sustainable tourism 针对稀疏表格的基于特征的潜类距离关联分析:应用于欧洲公民对可持续旅游业的态度
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-10-18 DOI: 10.1007/s11634-023-00559-1
Francesca Bassi, José Fernando Vera, Juan Antonio Marmolejo Martín

Social and behavioural sciences often deal with the analysis of associations for cross-classified data. This paper focuses on the study of the patterns observed on European citizens regarding their attitude towards sustainable tourism, specifically their willingness to change travel and tourism habits to be more sustainable. The data collected the intention to comply with nine sustainable actions; answers to these questions generated individual profiles; moreover, European country belonging is reported. Therefore, unlike a variable-oriented approach, here we are interested in a person-oriented approach through profiles. Some traditional methods are limited in their performance when using profiles, for example, by sparseness of the contingency table. We removed many of these limitations by using a latent class distance association model, clustering the row profiles into classes and representing these together with the categories of the response variable in a low-dimensional space. We showed, furthermore, that an easy interpretation of associations between clusters’ centres and categories of a response variable can be incorporated in this framework in an intuitive way using unfolding. Results of the analyses outlined that citizens mostly committed to an environmentally friendly behavior live in Sweden and Romania; citizens less willing to change their habits towards a more sustainable behavior live in Belgium, Cyprus, France, Lithuania and the Netherlands. Citizens preparedness to change habits however depends also on their socio-demographic characteristics such as gender, age, occupation, type of community where living, household size, and the frequency of travelling before the Covid-19 pandemic.

社会科学和行为科学经常要对交叉分类数据进行关联分析。本文重点研究欧洲公民对可持续旅游业的态度模式,特别是他们是否愿意改变旅行和旅游习惯,使其更具可持续性。数据收集了遵守九项可持续行动的意愿;对这些问题的回答生成了个人档案;此外,还报告了欧洲国家的归属。因此,与以变量为导向的方法不同,在这里我们感兴趣的是通过个人档案以个人为导向的方法。一些传统方法在使用个人档案时,其性能会受到限制,例如,或然率表的稀疏性。我们通过使用潜类距离关联模型,将行剖面图聚类为类别,并将这些类别与响应变量的类别一起在低维空间中表示出来,从而消除了许多这些限制。此外,我们还表明,可以利用展开法以直观的方式将聚类中心与响应变量类别之间的关联纳入该框架,从而对其进行简便的解释。分析结果表明,瑞典和罗马尼亚的公民大多致力于环保行为;比利时、塞浦路斯、法国、立陶宛和荷兰的公民不太愿意改变自己的习惯,转而采取更可持续的行为。然而,公民改变习惯的意愿还取决于他们的社会人口特征,如性别、年龄、职业、居住社区类型、家庭规模以及在 Covid-19 大流行之前的旅行频率。
{"title":"Profile-based latent class distance association analyses for sparse tables:application to the attitude of European citizens towards sustainable tourism","authors":"Francesca Bassi,&nbsp;José Fernando Vera,&nbsp;Juan Antonio Marmolejo Martín","doi":"10.1007/s11634-023-00559-1","DOIUrl":"10.1007/s11634-023-00559-1","url":null,"abstract":"<div><p>Social and behavioural sciences often deal with the analysis of associations for cross-classified data. This paper focuses on the study of the patterns observed on European citizens regarding their attitude towards sustainable tourism, specifically their willingness to change travel and tourism habits to be more sustainable. The data collected the intention to comply with nine sustainable actions; answers to these questions generated individual profiles; moreover, European country belonging is reported. Therefore, unlike a variable-oriented approach, here we are interested in a person-oriented approach through profiles. Some traditional methods are limited in their performance when using profiles, for example, by sparseness of the contingency table. We removed many of these limitations by using a latent class distance association model, clustering the row profiles into classes and representing these together with the categories of the response variable in a low-dimensional space. We showed, furthermore, that an easy interpretation of associations between clusters’ centres and categories of a response variable can be incorporated in this framework in an intuitive way using unfolding. Results of the analyses outlined that citizens mostly committed to an environmentally friendly behavior live in Sweden and Romania; citizens less willing to change their habits towards a more sustainable behavior live in Belgium, Cyprus, France, Lithuania and the Netherlands. Citizens preparedness to change habits however depends also on their socio-demographic characteristics such as gender, age, occupation, type of community where living, household size, and the frequency of travelling before the Covid-19 pandemic.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 4","pages":"953 - 980"},"PeriodicalIF":1.4,"publicationDate":"2023-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00559-1.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135884070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Editorial for ADAC issue 4 of volume 17 (2023) ADAC第17卷第4期(2023年)社论
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-10-14 DOI: 10.1007/s11634-023-00564-4
Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs
{"title":"Editorial for ADAC issue 4 of volume 17 (2023)","authors":"Maurizio Vichi,&nbsp;Andrea Cerioli,&nbsp;Hans A. Kestler,&nbsp;Akinori Okada,&nbsp;Claus Weihs","doi":"10.1007/s11634-023-00564-4","DOIUrl":"10.1007/s11634-023-00564-4","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 4","pages":"823 - 827"},"PeriodicalIF":1.6,"publicationDate":"2023-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50028115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Discovering interpretable structure in longitudinal predictors via coefficient trees 通过系数树发现纵向预测因子中的可解释结构
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-10-11 DOI: 10.1007/s11634-023-00562-6
Özge Sürer, Daniel W. Apley, Edward C. Malthouse

We consider the regression setting in which the response variable is not longitudinal (i.e., it is observed once for each case), but it is assumed to depend functionally on a set of predictors that are observed longitudinally, which is a specific form of functional predictors. In this situation, we often expect that the same predictor observed at nearby time points are more likely to be associated with the response in the same way. In such situations, we can exploit those aspects and discover groups of predictors that share the same (or similar) coefficient according to their temporal proximity. We propose a new algorithm called coefficient tree regression for data in which the non-longitudinal response depends on longitudinal predictors to efficiently discover the underlying temporal characteristics of the data. The approach results in a simple and highly interpretable tree structure from which the hierarchical relationships between groups of predictors that affect the response in a similar manner based on their temporal proximity can be observed, and we demonstrate with a real example that it can provide a clear and concise interpretation of the data. In numerical comparisons over a variety of examples, we show that our approach achieves substantially better predictive accuracy than existing competitors, most likely due to its inherent form of dimensionality reduction that is automatically discovered when fitting the model, in addition to having interpretability advantages and lower computational expense.

我们考虑的回归情况是,响应变量不是纵向的(即对每个案例只观测一次),但假设它在功能上依赖于一组纵向观测的预测因子,这是功能预测因子的一种特定形式。在这种情况下,我们通常会期望在附近时间点观察到的相同预测因子更有可能以相同的方式与反应相关联。在这种情况下,我们可以利用这些方面,根据时间上的接近程度发现具有相同(或相似)系数的预测因子组。我们针对非纵向响应依赖于纵向预测因子的数据,提出了一种名为系数树回归的新算法,以有效发现数据的潜在时间特征。这种方法产生了一种简单、可解释性强的树状结构,从中可以观察到根据时间上的接近程度以类似方式影响响应的各组预测因子之间的层次关系。通过对各种示例进行数值比较,我们发现我们的方法比现有的竞争者获得了更高的预测准确性,这很可能是由于它在拟合模型时自动发现的固有降维形式,此外还具有可解释性优势和更低的计算成本。
{"title":"Discovering interpretable structure in longitudinal predictors via coefficient trees","authors":"Özge Sürer,&nbsp;Daniel W. Apley,&nbsp;Edward C. Malthouse","doi":"10.1007/s11634-023-00562-6","DOIUrl":"10.1007/s11634-023-00562-6","url":null,"abstract":"<div><p>We consider the regression setting in which the response variable is not longitudinal (i.e., it is observed once for each case), but it is assumed to depend functionally on a set of predictors that are observed longitudinally, which is a specific form of functional predictors. In this situation, we often expect that the same predictor observed at nearby time points are more likely to be associated with the response in the same way. In such situations, we can exploit those aspects and discover groups of predictors that share the same (or similar) coefficient according to their temporal proximity. We propose a new algorithm called coefficient tree regression for data in which the non-longitudinal response depends on longitudinal predictors to efficiently discover the underlying temporal characteristics of the data. The approach results in a simple and highly interpretable tree structure from which the hierarchical relationships between groups of predictors that affect the response in a similar manner based on their temporal proximity can be observed, and we demonstrate with a real example that it can provide a clear and concise interpretation of the data. In numerical comparisons over a variety of examples, we show that our approach achieves substantially better predictive accuracy than existing competitors, most likely due to its inherent form of dimensionality reduction that is automatically discovered when fitting the model, in addition to having interpretability advantages and lower computational expense.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 4","pages":"911 - 951"},"PeriodicalIF":1.4,"publicationDate":"2023-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136209673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generalized Cramér’s coefficient via f-divergence for contingency tables 通过或然率表的 f-发散计算广义克拉梅尔系数
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-10-05 DOI: 10.1007/s11634-023-00560-8
Wataru Urasaki, Tomoyuki Nakagawa, Tomotaka Momozaki, Sadao Tomizawa

Various measures in two-way contingency table analysis have been proposed to express the strength of association between row and column variables in contingency tables. Tomizawa et al. (2004) proposed more general measures, including Cramér’s coefficient, using the power-divergence. In this paper, we propose measures using the f-divergence that has a wider class than the power-divergence. Unlike statistical hypothesis tests, these measures provide quantification of the association structure in contingency tables. The contribution of our study is proving that a measure applying a function that satisfies the condition of the f-divergence has desirable properties for measuring the strength of association in contingency tables. With this contribution, we can easily construct a new measure using a divergence that has essential properties for the analyst. For example, we conducted numerical experiments with a measure applying the (theta)-divergence. Furthermore, we can give further interpretation of the association between the row and column variables in the contingency table, which could not be obtained with the conventional one. We also show a relationship between our proposed measures and the correlation coefficient in a bivariate normal distribution of latent variables in the contingency tables.

在双向或然表分析中,人们提出了各种测量方法来表示或然表中行变量和列变量之间的关联强度。Tomizawa 等人(2004 年)利用幂级数发散提出了包括克拉梅尔系数在内的更通用的度量方法。在本文中,我们提出了使用 f-发散度的度量方法,它比幂-发散度具有更广泛的类别。与统计假设检验不同的是,这些度量方法可以量化或然率表中的关联结构。我们研究的贡献在于证明了应用满足 f-发散条件的函数的测量方法具有测量或然率表中关联强度的理想特性。有了这一贡献,我们就能利用发散轻松构建出一种新的度量,这种度量对分析者来说具有基本特性。例如,我们用一个应用(theta)-发散的测量方法进行了数值实验。此外,我们还可以进一步解释或然表中行变量和列变量之间的关联,而传统的或然表是无法做到这一点的。我们还展示了我们提出的测量方法与或然率表中潜变量二元正态分布中相关系数之间的关系。
{"title":"Generalized Cramér’s coefficient via f-divergence for contingency tables","authors":"Wataru Urasaki,&nbsp;Tomoyuki Nakagawa,&nbsp;Tomotaka Momozaki,&nbsp;Sadao Tomizawa","doi":"10.1007/s11634-023-00560-8","DOIUrl":"10.1007/s11634-023-00560-8","url":null,"abstract":"<div><p>Various measures in two-way contingency table analysis have been proposed to express the strength of association between row and column variables in contingency tables. Tomizawa et al. (2004) proposed more general measures, including Cramér’s coefficient, using the power-divergence. In this paper, we propose measures using the <i>f</i>-divergence that has a wider class than the power-divergence. Unlike statistical hypothesis tests, these measures provide quantification of the association structure in contingency tables. The contribution of our study is proving that a measure applying a function that satisfies the condition of the <i>f</i>-divergence has desirable properties for measuring the strength of association in contingency tables. With this contribution, we can easily construct a new measure using a divergence that has essential properties for the analyst. For example, we conducted numerical experiments with a measure applying the <span>(theta)</span>-divergence. Furthermore, we can give further interpretation of the association between the row and column variables in the contingency table, which could not be obtained with the conventional one. We also show a relationship between our proposed measures and the correlation coefficient in a bivariate normal distribution of latent variables in the contingency tables.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 4","pages":"893 - 910"},"PeriodicalIF":1.4,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00560-8.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135481525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mixture modeling with normalizing flows for spherical density estimation 用于球形密度估计的带有归一化流量的混合物建模
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-10-04 DOI: 10.1007/s11634-023-00561-7
Tin Lok James Ng, Andrew Zammit-Mangion

Normalizing flows are objects used for modeling complicated probability density functions, and have attracted considerable interest in recent years. Many flexible families of normalizing flows have been developed. However, the focus to date has largely been on normalizing flows on Euclidean domains; while normalizing flows have been developed for spherical and other non-Euclidean domains, these are generally less flexible than their Euclidean counterparts. To address this shortcoming, in this work we introduce a mixture-of-normalizing-flows model to construct complicated probability density functions on the sphere. This model provides a flexible alternative to existing parametric, semiparametric, and nonparametric, finite mixture models. Model estimation is performed using the expectation maximization algorithm and a variant thereof. The model is applied to simulated data, where the benefit over the conventional (single component) normalizing flow is verified. The model is then applied to two real-world data sets of events occurring on the surface of Earth; the first relating to earthquakes, and the second to terrorist activity. In both cases, we see that the mixture-of-normalizing-flows model yields a good representation of the density of event occurrence.

归一化流是用于复杂概率密度函数建模的对象,近年来引起了广泛关注。目前已开发出许多灵活的归一化流系列。然而,迄今为止,人们主要关注的是欧几里得域上的归一化流;虽然也有人开发了球面域和其他非欧几里得域的归一化流,但这些归一化流的灵活性通常不如欧几里得域的同类产品。为了解决这一缺陷,我们在本研究中引入了正则流混合模型,以构建球面上的复杂概率密度函数。该模型为现有的参数、半参数和非参数有限混合模型提供了灵活的替代方案。模型估计采用期望最大化算法及其变体。该模型应用于模拟数据,验证了与传统(单一成分)归一化流程相比的优势。然后,将模型应用于地球表面发生事件的两个真实世界数据集;第一个数据集与地震有关,第二个数据集与恐怖活动有关。在这两种情况下,我们都发现混合归一化流模型能很好地反映事件发生的密度。
{"title":"Mixture modeling with normalizing flows for spherical density estimation","authors":"Tin Lok James Ng,&nbsp;Andrew Zammit-Mangion","doi":"10.1007/s11634-023-00561-7","DOIUrl":"10.1007/s11634-023-00561-7","url":null,"abstract":"<div><p>Normalizing flows are objects used for modeling complicated probability density functions, and have attracted considerable interest in recent years. Many flexible families of normalizing flows have been developed. However, the focus to date has largely been on normalizing flows on Euclidean domains; while normalizing flows have been developed for spherical and other non-Euclidean domains, these are generally less flexible than their Euclidean counterparts. To address this shortcoming, in this work we introduce a mixture-of-normalizing-flows model to construct complicated probability density functions on the sphere. This model provides a flexible alternative to existing parametric, semiparametric, and nonparametric, finite mixture models. Model estimation is performed using the expectation maximization algorithm and a variant thereof. The model is applied to simulated data, where the benefit over the conventional (single component) normalizing flow is verified. The model is then applied to two real-world data sets of events occurring on the surface of Earth; the first relating to earthquakes, and the second to terrorist activity. In both cases, we see that the mixture-of-normalizing-flows model yields a good representation of the density of event occurrence.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 1","pages":"103 - 120"},"PeriodicalIF":1.4,"publicationDate":"2023-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135548086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parsimony and parameter estimation for mixtures of multivariate leptokurtic-normal distributions 多元畸变正态分布混合物的解析和参数估计
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-09-27 DOI: 10.1007/s11634-023-00558-2
Ryan P. Browne, Luca Bagnato, Antonio Punzo

Mixtures of multivariate leptokurtic-normal distributions have been recently introduced in the clustering literature based on mixtures of elliptical heavy-tailed distributions. They have the advantage of having parameters directly related to the moments of practical interest. We derive two estimation procedures for these mixtures. The first one is based on the majorization-minimization algorithm, while the second is based on a fixed point approximation. Moreover, we introduce parsimonious forms of the considered mixtures and we use the illustrated estimation procedures to fit them. We use simulated and real data sets to investigate various aspects of the proposed models and algorithms.

最近,基于椭圆重尾分布混合物的聚类文献中引入了多元椭圆正态分布混合物。它们的优点是参数与实际关注的矩直接相关。我们为这些混合物推导了两种估计程序。第一种是基于大化-最小化算法,第二种是基于定点近似。此外,我们还引入了所考虑的混合物的拟合形式,并使用所说明的估计程序对其进行拟合。我们使用模拟数据集和真实数据集来研究拟议模型和算法的各个方面。
{"title":"Parsimony and parameter estimation for mixtures of multivariate leptokurtic-normal distributions","authors":"Ryan P. Browne,&nbsp;Luca Bagnato,&nbsp;Antonio Punzo","doi":"10.1007/s11634-023-00558-2","DOIUrl":"10.1007/s11634-023-00558-2","url":null,"abstract":"<div><p>Mixtures of multivariate leptokurtic-normal distributions have been recently introduced in the clustering literature based on mixtures of elliptical heavy-tailed distributions. They have the advantage of having parameters directly related to the moments of practical interest. We derive two estimation procedures for these mixtures. The first one is based on the majorization-minimization algorithm, while the second is based on a fixed point approximation. Moreover, we introduce parsimonious forms of the considered mixtures and we use the illustrated estimation procedures to fit them. We use simulated and real data sets to investigate various aspects of the proposed models and algorithms.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"597 - 625"},"PeriodicalIF":1.4,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00558-2.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135535813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Theory of angular depth for classification of directional data 用于定向数据分类的角深度理论
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-09-23 DOI: 10.1007/s11634-023-00557-3
Stanislav Nagy, Houyem Demni, Davide Buttarazzi, Giovanni C. Porzio

Depth functions offer an array of tools that enable the introduction of quantile- and ranking-like approaches to multivariate and non-Euclidean datasets. We investigate the potential of using depths in the problem of nonparametric supervised classification of directional data, that is classification of data that naturally live on the unit sphere of a Euclidean space. In this paper, we address the problem mainly from a theoretical side, with the final goal of offering guidelines on which angular depth function should be adopted in classifying directional data. A set of desirable properties of an angular depth is put forward. With respect to these properties, we compare and contrast the most widely used angular depth functions. Simulated and real data are eventually exploited to showcase the main implications of the discussed theoretical results, with an emphasis on potentials and limits of the often disregarded angular halfspace depth.

深度函数提供了一系列工具,可以在多元和非欧几里得数据集中引入类似于量化和排序的方法。我们研究了在方向性数据的非参数监督分类问题中使用深度的潜力,即对自然存在于欧几里得空间单位球内的数据进行分类。在本文中,我们主要从理论方面来解决这个问题,最终目标是为在定向数据分类中采用哪种角度深度函数提供指导。我们提出了一组理想的角深度属性。根据这些特性,我们对最广泛使用的角深度函数进行了比较和对比。最终,我们利用模拟数据和真实数据来展示所讨论的理论结果的主要影响,重点是经常被忽视的角半空间深度的潜力和限制。
{"title":"Theory of angular depth for classification of directional data","authors":"Stanislav Nagy,&nbsp;Houyem Demni,&nbsp;Davide Buttarazzi,&nbsp;Giovanni C. Porzio","doi":"10.1007/s11634-023-00557-3","DOIUrl":"10.1007/s11634-023-00557-3","url":null,"abstract":"<div><p>Depth functions offer an array of tools that enable the introduction of quantile- and ranking-like approaches to multivariate and non-Euclidean datasets. We investigate the potential of using depths in the problem of nonparametric supervised classification of directional data, that is classification of data that naturally live on the unit sphere of a Euclidean space. In this paper, we address the problem mainly from a theoretical side, with the final goal of offering guidelines on which angular depth function should be adopted in classifying directional data. A set of desirable properties of an angular depth is put forward. With respect to these properties, we compare and contrast the most widely used angular depth functions. Simulated and real data are eventually exploited to showcase the main implications of the discussed theoretical results, with an emphasis on potentials and limits of the often disregarded angular halfspace depth.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"627 - 662"},"PeriodicalIF":1.4,"publicationDate":"2023-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135966377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Co-clustering contaminated data: a robust model-based approach 对污染数据进行共聚类分析:基于模型的稳健方法
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-09-22 DOI: 10.1007/s11634-023-00549-3
Edoardo Fibbi, Domenico Perrotta, Francesca Torti, Stefan Van Aelst, Tim Verdonck

The exploration and analysis of large high-dimensional data sets calls for well-thought techniques to extract the salient information from the data, such as co-clustering. Latent block models cast co-clustering in a probabilistic framework that extends finite mixture models to the two-way setting. Real-world data sets often contain anomalies which could be of interest per se and may make the results provided by standard, non-robust procedures unreliable. Also estimation of latent block models can be heavily affected by contaminated data. We propose an algorithm to compute robust estimates for latent block models. Experiments on both simulated and real data show that our method is able to resist high levels of contamination and can provide additional insight into the data by highlighting possible anomalies.

在探索和分析大型高维数据集时,需要采用经过深思熟虑的技术来提取数据中的突出信息,例如共聚类分析。潜在块模型在概率框架内进行共聚类分析,将有限混合模型扩展到双向设置。现实世界中的数据集往往包含异常情况,这些异常情况本身可能会引起人们的兴趣,并可能使标准、非稳健程序提供的结果变得不可靠。此外,潜块模型的估计也会受到污染数据的严重影响。我们提出了一种计算潜块模型稳健估计值的算法。在模拟数据和真实数据上进行的实验表明,我们的方法能够抵御高水平的污染,并能通过突出可能的异常现象为数据提供额外的洞察力。
{"title":"Co-clustering contaminated data: a robust model-based approach","authors":"Edoardo Fibbi,&nbsp;Domenico Perrotta,&nbsp;Francesca Torti,&nbsp;Stefan Van Aelst,&nbsp;Tim Verdonck","doi":"10.1007/s11634-023-00549-3","DOIUrl":"10.1007/s11634-023-00549-3","url":null,"abstract":"<div><p>The exploration and analysis of large high-dimensional data sets calls for well-thought techniques to extract the salient information from the data, such as co-clustering. Latent block models cast co-clustering in a probabilistic framework that extends finite mixture models to the two-way setting. Real-world data sets often contain anomalies which could be of interest <i>per se</i> and may make the results provided by standard, non-robust procedures unreliable. Also estimation of latent block models can be heavily affected by contaminated data. We propose an algorithm to compute robust estimates for latent block models. Experiments on both simulated and real data show that our method is able to resist high levels of contamination and can provide additional insight into the data by highlighting possible anomalies.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 1","pages":"121 - 161"},"PeriodicalIF":1.4,"publicationDate":"2023-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00549-3.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136061315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Contamination transformation matrix mixture modeling for skewed data groups with heavy tails and scatter 针对具有重尾和散点的倾斜数据组的污染变换矩阵混合建模
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-09-13 DOI: 10.1007/s11634-023-00550-w
Xuwen Zhu, Yana Melnykov, Angelina S. Kolomoytseva

Model-based clustering is a popular application of the rapidly developing area of finite mixture modeling. While there is ample work focusing on clustering multivariate data, an increasing number of advancements have been aiming at the expansion of existing theory to the matrix-variate framework. Matrix-variate Gaussian mixtures are most popular in this setting despite the potential misfit for skewed and heavy-tailed data. To overcome this lack of flexibility, a new contaminated transformation matrix mixture model is proposed. We illustrate its utility in a series of experiments on simulated data and apply to a real-life data set containing COVID-related information. The performance of the developed model is promising in all considered settings.

基于模型的聚类是快速发展的有限混合物建模领域的一个热门应用。虽然有大量工作集中在多变量数据的聚类上,但越来越多的进展旨在将现有理论扩展到矩阵变量框架。尽管矩阵变量高斯混合物可能与倾斜和重尾数据不匹配,但在这种情况下,矩阵变量高斯混合物最受欢迎。为了克服这种缺乏灵活性的问题,我们提出了一种新的污染变换矩阵混合物模型。我们在一系列模拟数据实验中说明了该模型的实用性,并将其应用于包含 COVID 相关信息的真实数据集。在所有考虑的情况下,所开发模型的性能都很不错。
{"title":"Contamination transformation matrix mixture modeling for skewed data groups with heavy tails and scatter","authors":"Xuwen Zhu,&nbsp;Yana Melnykov,&nbsp;Angelina S. Kolomoytseva","doi":"10.1007/s11634-023-00550-w","DOIUrl":"10.1007/s11634-023-00550-w","url":null,"abstract":"<div><p>Model-based clustering is a popular application of the rapidly developing area of finite mixture modeling. While there is ample work focusing on clustering multivariate data, an increasing number of advancements have been aiming at the expansion of existing theory to the matrix-variate framework. Matrix-variate Gaussian mixtures are most popular in this setting despite the potential misfit for skewed and heavy-tailed data. To overcome this lack of flexibility, a new contaminated transformation matrix mixture model is proposed. We illustrate its utility in a series of experiments on simulated data and apply to a real-life data set containing COVID-related information. The performance of the developed model is promising in all considered settings.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 1","pages":"85 - 101"},"PeriodicalIF":1.4,"publicationDate":"2023-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135741082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Advances in Data Analysis and Classification
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1