首页 > 最新文献

Advances in Data Analysis and Classification最新文献

英文 中文
Determinantal consensus clustering 决定性共识聚类
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-08-25 DOI: 10.1007/s11634-022-00514-6
Serge Vicente, Alejandro Murua-Sazo

Random restart of a given algorithm produces many partitions that can be aggregated to yield a consensus clustering. Ensemble methods have been recognized as more robust approaches for data clustering than single clustering algorithms. We propose the use of determinantal point processes or DPPs for the random restart of clustering algorithms based on initial sets of center points, such as k-medoids or k-means. The relation between DPPs and kernel-based methods makes DPPs suitable to describe and quantify similarity between objects. DPPs favor diversity of the center points in initial sets, so that sets with similar points have less chance of being generated than sets with very distinct points. Most current inital sets are generated with center points sampled uniformly at random. We show through extensive simulations that, contrary to DPPs, this technique fails both to ensure diversity, and to obtain a good coverage of all data facets. The latter are two key properties that make DPPs achieve good performance. Simulations with artificial datasets and applications to real datasets show that determinantal consensus clustering outperforms consensus clusterings which are based on uniform random sampling of center points.

给定算法的随机重启会产生许多分区,这些分区可以聚合以产生一致性集群。集成方法已被认为是比单一聚类算法更稳健的数据聚类方法。我们建议使用确定点过程或DPP来随机重新启动基于初始中心点集的聚类算法,例如k-medoid或k-means。DPP和基于核的方法之间的关系使得DPP适合于描述和量化对象之间的相似性。DPP倾向于初始集合中中心点的多样性,所以具有相似点的集合比具有非常不同点的集合生成的机会更小。大多数当前的初始集是由随机均匀采样的中心点生成的。我们通过广泛的模拟表明,与DPP相反,这种技术既不能确保多样性,也不能获得对所有数据方面的良好覆盖。后者是使DP获得良好性能的两个关键特性。对人工数据集的模拟和对真实数据集的应用表明,确定性一致性聚类优于基于中心点均匀随机采样的一致性聚类。
{"title":"Determinantal consensus clustering","authors":"Serge Vicente,&nbsp;Alejandro Murua-Sazo","doi":"10.1007/s11634-022-00514-6","DOIUrl":"10.1007/s11634-022-00514-6","url":null,"abstract":"<div><p>Random restart of a given algorithm produces many partitions that can be aggregated to yield a consensus clustering. Ensemble methods have been recognized as more robust approaches for data clustering than single clustering algorithms. We propose the use of determinantal point processes or DPPs for the random restart of clustering algorithms based on initial sets of center points, such as <i>k</i>-medoids or <i>k</i>-means. The relation between DPPs and kernel-based methods makes DPPs suitable to describe and quantify similarity between objects. DPPs favor diversity of the center points in initial sets, so that sets with similar points have less chance of being generated than sets with very distinct points. Most current inital sets are generated with center points sampled uniformly at random. We show through extensive simulations that, contrary to DPPs, this technique fails both to ensure diversity, and to obtain a good coverage of all data facets. The latter are two key properties that make DPPs achieve good performance. Simulations with artificial datasets and applications to real datasets show that determinantal consensus clustering outperforms consensus clusterings which are based on uniform random sampling of center points.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 4","pages":"829 - 858"},"PeriodicalIF":1.6,"publicationDate":"2022-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50046217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Sequential classification of customer behavior based on sequence-to-sequence learning with gated-attention neural networks 基于序列对序列学习的门控注意神经网络客户行为顺序分类
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-08-24 DOI: 10.1007/s11634-022-00517-3
Licheng Zhao, Yi Zuo, Katsutoshi Yada

During the last decade, an increasing number of supermarkets have begun to use RFID technology to track consumers' in-store movements to collect data on their shopping behavioral. Marketers hope that such new types of RFID data will improve the accuracy of the existing customer segmentation, and provide effective marketing positioning from the customer’s perspective. Therefore, this paper presents an integrated work on combining RFID data with traditional point of sales (POS) data, and proposes a sequential classification-based model to classify and identify consumers’ purchasing behavior. We chose an island area of the supermarket to perform the tracking experiment and collected customer behavioral data for two months. RFID data are used to extract behavior explanatory variables, such as residence time and wandering direction. For these customers, we extracted their purchasing historical data for the past three months from the POS system to define customer background and segmentation. Finally, this paper proposes a novel classification model based on sequence-to-sequence (Seq2seq) learning architecture. The encoder–decoder of Seq2seq uses an attention mechanism to pursue sequential inputs, with gating units in the encoder and decoder adjusting the output weights based on the input variables. The experimental results showed that the proposed model has a higher accuracy and area under curve value for customer classification and recognition compared with other benchmark models. Furthermore, the validity of behavioral description variables among heterogeneous customers was verified by adjusting the attention mechanism.

在过去的十年里,越来越多的超市开始使用RFID技术来跟踪消费者的店内活动,以收集他们的购物行为数据。营销人员希望这种新型的RFID数据能够提高现有客户细分的准确性,并从客户的角度提供有效的营销定位。因此,本文将RFID数据与传统的销售点(POS)数据相结合,提出了一种基于序列分类的模型来对消费者的购买行为进行分类和识别。我们选择了超市的一个岛屿区域进行跟踪实验,并收集了两个月的顾客行为数据。RFID数据用于提取行为解释变量,如停留时间和徘徊方向。对于这些客户,我们从POS系统中提取了他们过去三个月的购买历史数据,以定义客户背景和细分。最后,本文提出了一种新的基于序列到序列(Seq2seq)学习架构的分类模型。Seq2seq的编码器-解码器使用注意力机制来追求顺序输入,编码器和解码器中的门控单元根据输入变量调整输出权重。实验结果表明,与其他基准模型相比,该模型在客户分类和识别方面具有更高的精度和曲线下面积值。此外,通过调整注意力机制,验证了行为描述变量在异质客户中的有效性。
{"title":"Sequential classification of customer behavior based on sequence-to-sequence learning with gated-attention neural networks","authors":"Licheng Zhao,&nbsp;Yi Zuo,&nbsp;Katsutoshi Yada","doi":"10.1007/s11634-022-00517-3","DOIUrl":"10.1007/s11634-022-00517-3","url":null,"abstract":"<div><p>During the last decade, an increasing number of supermarkets have begun to use RFID technology to track consumers' in-store movements to collect data on their shopping behavioral. Marketers hope that such new types of RFID data will improve the accuracy of the existing customer segmentation, and provide effective marketing positioning from the customer’s perspective. Therefore, this paper presents an integrated work on combining RFID data with traditional point of sales (POS) data, and proposes a sequential classification-based model to classify and identify consumers’ purchasing behavior. We chose an island area of the supermarket to perform the tracking experiment and collected customer behavioral data for two months. RFID data are used to extract behavior explanatory variables, such as residence time and wandering direction. For these customers, we extracted their purchasing historical data for the past three months from the POS system to define customer background and segmentation. Finally, this paper proposes a novel classification model based on sequence-to-sequence (Seq2seq) learning architecture. The encoder–decoder of Seq2seq uses an attention mechanism to pursue sequential inputs, with gating units in the encoder and decoder adjusting the output weights based on the input variables. The experimental results showed that the proposed model has a higher accuracy and area under curve value for customer classification and recognition compared with other benchmark models. Furthermore, the validity of behavioral description variables among heterogeneous customers was verified by adjusting the attention mechanism.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 3","pages":"549 - 581"},"PeriodicalIF":1.6,"publicationDate":"2022-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50044829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identification of representative trees in random forests based on a new tree-based distance measure 基于树木距离测度的随机森林中代表性树木的识别
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-08-19 DOI: 10.1101/2022.05.15.492004
Björn-Hergen Laabs von Holt, A. Westenberger, I. König
In life sciences, random forests are often used to train predictive models. However, gaining any explanatory insight into the mechanics leading to a specific outcome is rather complex, which impedes the implementation of random forests into clinical practice. By simplifying a complex ensemble of decision trees to a single most representative tree, it is assumed to be possible to observe common tree structures, the importance of specific features and variable interactions. Thus, representative trees could also help to understand interactions between genetic variants. Intuitively, representative trees are those with the minimal distance to all other trees, which requires a proper definition of the distance between two trees. Thus, we developed a new tree-based distance measure, which incorporates more of the underlying tree structure than other metrics. We compared our new method with the existing metrics in an extensive simulation study and applied it to predict the age at onset based on a set of genetic risk factors in a clinical data set. In our simulation study we were able to show the advantages of our weighted splitting variable approach. Our real data application revealed that representative trees are not only able to replicate the results from a recent genome-wide association study, but also can give additional explanations of the genetic mechanisms. Finally, we implemented all compared distance measures in R and made them publicly available in the R package timbR ( https://github.com/imbs-hl/timbR ).
在生命科学中,随机森林常用于训练预测模型。然而,获得任何导致特定结果的机制的解释性见解是相当复杂的,这阻碍了随机森林在临床实践中的实施。通过将一个复杂的决策树集合简化为一个最具代表性的树,假设有可能观察到共同的树结构、特定特征的重要性和变量的相互作用。因此,代表性树也可以帮助理解遗传变异之间的相互作用。直观地说,代表性树是那些与所有其他树的距离最小的树,这需要对两棵树之间的距离进行适当的定义。因此,我们开发了一种新的基于树的距离度量,它比其他度量包含更多的底层树结构。我们将我们的新方法与一项广泛的模拟研究中的现有指标进行了比较,并将其应用于基于临床数据集中的一组遗传风险因素的发病年龄预测。在我们的模拟研究中,我们能够展示我们的加权分割变量方法的优点。我们的实际数据应用表明,代表性树不仅能够复制最近全基因组关联研究的结果,而且还可以提供遗传机制的额外解释。最后,我们在R中实现了所有的比较距离度量,并在R包timbR (https://github.com/imbs-hl/timbR)中公开了它们。
{"title":"Identification of representative trees in random forests based on a new tree-based distance measure","authors":"Björn-Hergen Laabs von Holt, A. Westenberger, I. König","doi":"10.1101/2022.05.15.492004","DOIUrl":"https://doi.org/10.1101/2022.05.15.492004","url":null,"abstract":"In life sciences, random forests are often used to train predictive models. However, gaining any explanatory insight into the mechanics leading to a specific outcome is rather complex, which impedes the implementation of random forests into clinical practice. By simplifying a complex ensemble of decision trees to a single most representative tree, it is assumed to be possible to observe common tree structures, the importance of specific features and variable interactions. Thus, representative trees could also help to understand interactions between genetic variants. Intuitively, representative trees are those with the minimal distance to all other trees, which requires a proper definition of the distance between two trees. Thus, we developed a new tree-based distance measure, which incorporates more of the underlying tree structure than other metrics. We compared our new method with the existing metrics in an extensive simulation study and applied it to predict the age at onset based on a set of genetic risk factors in a clinical data set. In our simulation study we were able to show the advantages of our weighted splitting variable approach. Our real data application revealed that representative trees are not only able to replicate the results from a recent genome-wide association study, but also can give additional explanations of the genetic mechanisms. Finally, we implemented all compared distance measures in R and made them publicly available in the R package timbR ( https://github.com/imbs-hl/timbR ).","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"30 5","pages":"1-18"},"PeriodicalIF":1.6,"publicationDate":"2022-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72628235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Localization processes for functional data analysis 功能数据分析的本地化过程
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-08-19 DOI: 10.1007/s11634-022-00512-8
Antonio Elías, Raúl Jiménez, J. E. Yukich

We propose an alternative to k-nearest neighbors for functional data whereby the approximating neighboring curves are piecewise functions built from a functional sample. Using a locally defined distance function that satisfies stabilization criteria, we establish pointwise and global approximation results in function spaces when the number of data curves is large. We exploit this feature to develop the asymptotic theory when a finite number of curves is observed at time-points given by an i.i.d. sample whose cardinality increases up to infinity. We use these results to investigate the problem of estimating unobserved segments of a partially observed functional data sample as well as to study the problem of functional classification and outlier detection. For such problems our methods are competitive with and sometimes superior to benchmark predictions in the field. The R package localFDA provides routines for computing the localization processes and the estimators proposed in this article.

我们提出了一种替代函数数据的k近邻的方法,其中近似近邻曲线是从函数样本构建的分段函数。使用满足稳定准则的局部定义的距离函数,当数据曲线的数量很大时,我们在函数空间中建立逐点和全局近似结果。当在基数增加到无穷大的i.i.d.样本给定的时间点观察到有限数量的曲线时,我们利用这一特征来发展渐近理论。我们使用这些结果来研究估计部分观测到的函数数据样本的未观测片段的问题,以及研究函数分类和异常值检测的问题。对于此类问题,我们的方法与该领域的基准预测具有竞争力,有时甚至优于基准预测。R包localFDA提供了用于计算本文中提出的本地化过程和估计量的例程。
{"title":"Localization processes for functional data analysis","authors":"Antonio Elías,&nbsp;Raúl Jiménez,&nbsp;J. E. Yukich","doi":"10.1007/s11634-022-00512-8","DOIUrl":"10.1007/s11634-022-00512-8","url":null,"abstract":"<div><p>We propose an alternative to <i>k</i>-nearest neighbors for functional data whereby the approximating neighboring curves are piecewise functions built from a functional sample. Using a locally defined distance function that satisfies stabilization criteria, we establish pointwise and global approximation results in function spaces when the number of data curves is large. We exploit this feature to develop the asymptotic theory when a finite number of curves is observed at time-points given by an i.i.d. sample whose cardinality increases up to infinity. We use these results to investigate the problem of estimating unobserved segments of a partially observed functional data sample as well as to study the problem of functional classification and outlier detection. For such problems our methods are competitive with and sometimes superior to benchmark predictions in the field. The R package <span>localFDA</span> provides routines for computing the localization processes and the estimators proposed in this article.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 2","pages":"485 - 517"},"PeriodicalIF":1.6,"publicationDate":"2022-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50494833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Editorial for ADAC issue 3 of volume 16 (2022) ADAC第16卷第3期社论(2022)
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-08-16 DOI: 10.1007/s11634-022-00511-9
Maurizio Vichi, Andrea Ceroli, Hans A. Kestler, Akinori Okada, Claus Weihs
{"title":"Editorial for ADAC issue 3 of volume 16 (2022)","authors":"Maurizio Vichi,&nbsp;Andrea Ceroli,&nbsp;Hans A. Kestler,&nbsp;Akinori Okada,&nbsp;Claus Weihs","doi":"10.1007/s11634-022-00511-9","DOIUrl":"10.1007/s11634-022-00511-9","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"16 3","pages":"487 - 490"},"PeriodicalIF":1.6,"publicationDate":"2022-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50057265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Model based clustering of multinomial count data 基于模型的多项计数数据聚类
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-07-28 DOI: 10.1007/s11634-023-00547-5
Panagiotis Papastamoulis
{"title":"Model based clustering of multinomial count data","authors":"Panagiotis Papastamoulis","doi":"10.1007/s11634-023-00547-5","DOIUrl":"https://doi.org/10.1007/s11634-023-00547-5","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"50 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2022-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75623247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mixed-effect models with trees 具有树的混合效应模型
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-07-08 DOI: 10.1007/s11634-022-00509-3
Anna Gottard, Giulia Vannucci, Leonardo Grilli, Carla Rampichini

Tree-based regression models are a class of statistical models for predicting continuous response variables when the shape of the regression function is unknown. They naturally take into account both non-linearities and interactions. However, they struggle with linear and quasi-linear effects and assume iid data. This article proposes two new algorithms for jointly estimating an interpretable predictive mixed-effect model with two components: a linear part, capturing the main effects, and a non-parametric component consisting of three trees for capturing non-linearities and interactions among individual-level predictors, among cluster-level predictors or cross-level. The first proposed algorithm focuses on prediction. The second one is an extension which implements a post-selection inference strategy to provide valid inference. The performance of the two algorithms is validated via Monte Carlo studies. An application on INVALSI data illustrates the potentiality of the proposed approach.

基于树的回归模型是一类统计模型,用于在回归函数形状未知时预测连续响应变量。它们自然地考虑了非线性和相互作用。然而,他们与线性和准线性效应作斗争,并假设iid数据。本文提出了两种新的算法,用于联合估计具有两个组件的可解释预测混合效应模型:一个是线性部分,捕获主要效应,另一个是由三棵树组成的非参数组件,用于捕获个体水平预测因子之间、集群水平预测因子或交叉水平预测因子间的非线性和相互作用。第一个提出的算法侧重于预测。第二个是一个扩展,它实现了后选择推理策略,以提供有效的推理。通过蒙特卡洛研究验证了这两种算法的性能。INVALSI数据的应用说明了所提出方法的潜力。
{"title":"Mixed-effect models with trees","authors":"Anna Gottard,&nbsp;Giulia Vannucci,&nbsp;Leonardo Grilli,&nbsp;Carla Rampichini","doi":"10.1007/s11634-022-00509-3","DOIUrl":"10.1007/s11634-022-00509-3","url":null,"abstract":"<div><p>Tree-based regression models are a class of statistical models for predicting continuous response variables when the shape of the regression function is unknown. They naturally take into account both non-linearities and interactions. However, they struggle with linear and quasi-linear effects and assume <i>iid</i> data. This article proposes two new algorithms for jointly estimating an interpretable predictive mixed-effect model with two components: a linear part, capturing the main effects, and a non-parametric component consisting of three trees for capturing non-linearities and interactions among individual-level predictors, among cluster-level predictors or cross-level. The first proposed algorithm focuses on prediction. The second one is an extension which implements a post-selection inference strategy to provide valid inference. The performance of the two algorithms is validated via Monte Carlo studies. An application on INVALSI data illustrates the potentiality of the proposed approach.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 2","pages":"431 - 461"},"PeriodicalIF":1.6,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-022-00509-3.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50462000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On mathematical optimization for clustering categories in contingency tables 列联表聚类范畴的数学优化
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-06-28 DOI: 10.1007/s11634-022-00508-4
Emilio Carrizosa, Vanesa Guerrero, Dolores Romero Morales

Many applications in data analysis study whether two categorical variables are independent using a function of the entries of their contingency table. Often, the categories of the variables, associated with the rows and columns of the table, are grouped, yielding a less granular representation of the categorical variables. The purpose of this is to attain reasonable sample sizes in the cells of the table and, more importantly, to incorporate expert knowledge on the allowable groupings. However, it is known that the conclusions on independence depend, in general, on the chosen granularity, as in the Simpson paradox. In this paper we propose a methodology to, for a given contingency table and a fixed granularity, find a clustered table with the highest (chi ^2) statistic. Repeating this procedure for different values of the granularity, we can either identify an extreme grouping, namely the largest granularity for which the statistical dependence is still detected, or conclude that it does not exist and that the two variables are dependent regardless of the size of the clustered table. For this problem, we propose an assignment mathematical formulation and a set partitioning one. Our approach is flexible enough to include constraints on the desirable structure of the clusters, such as must-link or cannot-link constraints on the categories that can, or cannot, be merged together, and ensure reasonable sample sizes in the cells of the clustered table from which trustful statistical conclusions can be derived. We illustrate the usefulness of our methodology using a dataset of a medical study.

数据分析中的许多应用程序使用列联表的条目函数来研究两个分类变量是否独立。通常,与表的行和列相关联的变量类别被分组,从而产生分类变量的不太精细的表示。这样做的目的是在表格的单元格中获得合理的样本量,更重要的是,结合关于允许分组的专家知识。然而,众所周知,关于独立性的结论通常取决于所选择的粒度,如辛普森悖论。在本文中,我们提出了一种方法,对于给定的列联表和固定的粒度,找到具有最高统计的聚类表。对不同的粒度值重复这个过程,我们可以确定一个极端分组,即仍然检测到统计相关性的最大粒度,或者得出结论,它不存在,并且无论聚类表的大小如何,这两个变量都是相关的。对于这个问题,我们提出了一个赋值数学公式和一个集划分公式。我们的方法足够灵活,可以包括对聚类的理想结构的约束,例如必须链接或不能链接对可以或不能合并在一起的类别的约束,并确保聚类表单元格中的合理样本量,从中可以得出可靠的统计结论。我们使用医学研究的数据集来说明我们的方法的有用性。
{"title":"On mathematical optimization for clustering categories in contingency tables","authors":"Emilio Carrizosa,&nbsp;Vanesa Guerrero,&nbsp;Dolores Romero Morales","doi":"10.1007/s11634-022-00508-4","DOIUrl":"10.1007/s11634-022-00508-4","url":null,"abstract":"<div><p>Many applications in data analysis study whether two categorical variables are independent using a function of the entries of their contingency table. Often, the categories of the variables, associated with the rows and columns of the table, are grouped, yielding a less granular representation of the categorical variables. The purpose of this is to attain reasonable sample sizes in the cells of the table and, more importantly, to incorporate expert knowledge on the allowable groupings. However, it is known that the conclusions on independence depend, in general, on the chosen granularity, as in the Simpson paradox. In this paper we propose a methodology to, for a given contingency table and a fixed granularity, find a clustered table with the highest <span>(chi ^2)</span> statistic. Repeating this procedure for different values of the granularity, we can either identify an <i>extreme grouping</i>, namely the largest granularity for which the statistical dependence is still detected, or conclude that it does not exist and that the two variables are dependent regardless of the size of the clustered table. For this problem, we propose an assignment mathematical formulation and a set partitioning one. Our approach is flexible enough to include constraints on the desirable structure of the clusters, such as must-link or cannot-link constraints on the categories that can, or cannot, be merged together, and ensure reasonable sample sizes in the cells of the clustered table from which trustful statistical conclusions can be derived. We illustrate the usefulness of our methodology using a dataset of a medical study. \u0000</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 2","pages":"407 - 429"},"PeriodicalIF":1.6,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-022-00508-4.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50520905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Classification based on multivariate mixed type longitudinal data with an application to the EU-SILC database 基于多变量混合型纵向数据的分类及其在EU-SILC数据库中的应用
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-06-25 DOI: 10.1007/s11634-022-00504-8
Jan Vávra, Arnošt Komárek

Although many present day studies gather data of a diverse nature (numeric quantities, binary indicators or ordered categories) on the same units repeatedly over time, there only exist limited number of approaches in the literature to analyse so-called mixed-type longitudinal data. We present a statistical model capable of joint modelling several mixed-type outcomes, which also accounts for possible dependencies among the investigated outcomes. A thresholding approach to link binary or ordinal variables to their latent numeric counterparts allows us to jointly model all, including latent, numeric outcomes using a multivariate version of the linear mixed-effects model. We avoid the independence assumption over outcomes by relaxing the variance matrix of random effects to a completely general positive definite matrix. Moreover, we follow model-based clustering methodology to create a mixture of such models to model heterogeneity in the temporal evolution of the considered outcomes. The estimation of such an hierarchical model is approached by Bayesian principles with the use of Markov chain Monte Carlo methods. After a successful simulation study with the aim to examine the ability to consistently estimate the true parameter values and thus discover the different patterns, the EU-SILC dataset consisting of Czech households that were followed for 4 years in a time span from 2005 to 2016 was analysed. The households were classified into groups with a similar evolution of several closely related indicators of monetary poverty based on estimated classification probabilities.

尽管目前的许多研究随着时间的推移反复收集相同单位的不同性质的数据(数字量、二元指标或有序类别),但文献中分析所谓混合型纵向数据的方法数量有限。我们提出了一个能够联合建模几种混合类型结果的统计模型,该模型还考虑了所研究结果之间可能的相关性。将二元或序数变量与其潜在的数字对应物联系起来的阈值方法允许我们使用线性混合效应模型的多变量版本来联合建模所有结果,包括潜在的数字结果。我们通过将随机效应的方差矩阵放宽为完全一般的正定矩阵来避免对结果的独立性假设。此外,我们遵循基于模型的聚类方法来创建此类模型的混合,以对所考虑结果的时间演变中的异质性进行建模。利用贝叶斯原理和马尔可夫链蒙特卡罗方法对这种层次模型进行了估计。在一项旨在检验一致估计真实参数值从而发现不同模式的能力的成功模拟研究之后,分析了由捷克家庭组成的EU-SILC数据集,这些家庭在2005年至2016年的时间跨度内被跟踪了4年。根据估计的分类概率,将这些家庭分为几个密切相关的货币贫困指标演变相似的组。
{"title":"Classification based on multivariate mixed type longitudinal data with an application to the EU-SILC database","authors":"Jan Vávra,&nbsp;Arnošt Komárek","doi":"10.1007/s11634-022-00504-8","DOIUrl":"10.1007/s11634-022-00504-8","url":null,"abstract":"<div><p>Although many present day studies gather data of a diverse nature (numeric quantities, binary indicators or ordered categories) on the same units repeatedly over time, there only exist limited number of approaches in the literature to analyse so-called <i>mixed-type</i> longitudinal data. We present a statistical model capable of joint modelling several mixed-type outcomes, which also accounts for possible dependencies among the investigated outcomes. A thresholding approach to link binary or ordinal variables to their latent numeric counterparts allows us to jointly model all, including latent, numeric outcomes using a multivariate version of the linear mixed-effects model. We avoid the independence assumption over outcomes by relaxing the variance matrix of random effects to a completely general positive definite matrix. Moreover, we follow model-based clustering methodology to create a mixture of such models to model heterogeneity in the temporal evolution of the considered outcomes. The estimation of such an hierarchical model is approached by Bayesian principles with the use of Markov chain Monte Carlo methods. After a successful simulation study with the aim to examine the ability to consistently estimate the true parameter values and thus discover the different patterns, the EU-SILC dataset consisting of Czech households that were followed for 4 years in a time span from 2005 to 2016 was analysed. The households were classified into groups with a similar evolution of several closely related indicators of monetary poverty based on estimated classification probabilities.\u0000</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 2","pages":"369 - 406"},"PeriodicalIF":1.6,"publicationDate":"2022-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50512826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Correction to: Principal component analysis constrained by layered simple structures 更正:受分层简单结构约束的主成分分析
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-06-24 DOI: 10.1007/s11634-022-00506-6
Naoto Yamashita
{"title":"Correction to: Principal component analysis constrained by layered simple structures","authors":"Naoto Yamashita","doi":"10.1007/s11634-022-00506-6","DOIUrl":"10.1007/s11634-022-00506-6","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"16 4","pages":"1099 - 1100"},"PeriodicalIF":1.6,"publicationDate":"2022-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50435310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Advances in Data Analysis and Classification
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1