首页 > 最新文献

Advances in Data Analysis and Classification最新文献

英文 中文
Semiparametric finite mixture of regression models with Bayesian P-splines 贝叶斯p样条半参数有限混合回归模型
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-10-18 DOI: 10.1007/s11634-022-00523-5
Marco Berrettini, Giuliano Galimberti, Saverio Ranciati

Mixture models provide a useful tool to account for unobserved heterogeneity and are at the basis of many model-based clustering methods. To gain additional flexibility, some model parameters can be expressed as functions of concomitant covariates. In this Paper, a semiparametric finite mixture of regression models is defined, with concomitant information assumed to influence both the component weights and the conditional means. In particular, linear predictors are replaced with smooth functions of the covariate considered by resorting to cubic splines. An estimation procedure within the Bayesian paradigm is suggested, where smoothness of the covariate effects is controlled by suitable choices for the prior distributions of the spline coefficients. A data augmentation scheme based on difference random utility models is exploited to describe the mixture weights as functions of the covariate. The performance of the proposed methodology is investigated via simulation experiments and two real-world datasets, one about baseball salaries and the other concerning nitrogen oxide in engine exhaust.

混合模型为解释未观察到的异质性提供了一个有用的工具,并且是许多基于模型的聚类方法的基础。为了获得额外的灵活性,一些模型参数可以表示为伴随协变量的函数。本文定义了一个半参数有限混合回归模型,假设伴随信息影响分量权重和条件均值。特别地,线性预测被用三次样条所考虑的协变量的光滑函数所取代。在贝叶斯范式内提出了一种估计程序,其中协变量效应的平滑性由样条系数的先验分布的适当选择来控制。利用基于差分随机实用新型的数据增强方案,将混合权重描述为协变量的函数。通过模拟实验和两个真实世界的数据集(一个关于棒球工资,另一个关于发动机排气中的氮氧化物)对所提出方法的性能进行了研究。
{"title":"Semiparametric finite mixture of regression models with Bayesian P-splines","authors":"Marco Berrettini,&nbsp;Giuliano Galimberti,&nbsp;Saverio Ranciati","doi":"10.1007/s11634-022-00523-5","DOIUrl":"10.1007/s11634-022-00523-5","url":null,"abstract":"<div><p>Mixture models provide a useful tool to account for unobserved heterogeneity and are at the basis of many model-based clustering methods. To gain additional flexibility, some model parameters can be expressed as functions of concomitant covariates. In this Paper, a semiparametric finite mixture of regression models is defined, with concomitant information assumed to influence both the component weights and the conditional means. In particular, linear predictors are replaced with smooth functions of the covariate considered by resorting to cubic splines. An estimation procedure within the Bayesian paradigm is suggested, where smoothness of the covariate effects is controlled by suitable choices for the prior distributions of the spline coefficients. A data augmentation scheme based on difference random utility models is exploited to describe the mixture weights as functions of the covariate. The performance of the proposed methodology is investigated via simulation experiments and two real-world datasets, one about baseball salaries and the other concerning nitrogen oxide in engine exhaust.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 3","pages":"745 - 775"},"PeriodicalIF":1.6,"publicationDate":"2022-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-022-00523-5.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50036456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
On smoothing and scaling language model for sentiment based information retrieval 基于情感信息检索的平滑和缩放语言模型
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-10-13 DOI: 10.1007/s11634-022-00522-6
Fatma Najar, Nizar Bouguila

Sentiment analysis or opinion mining refers to the discovery of sentiment information within textual documents, tweets, or review posts. This field has emerged with the social media outgrowth which becomes of great interest for several applications such as marketing, tourism, and business. In this work, we approach Twitter sentiment analysis through a novel framework that addresses simultaneously the problems of text representation such as sparseness and high-dimensionality. We propose an information retrieval probabilistic model based on a new distribution namely the Smoothed Scaled Dirichlet distribution. We present a likelihood learning method for estimating the parameters of the distribution and we propose a feature generation from the information retrieval system. We apply the proposed approach Smoothed Scaled Relevance Model on four Twitter sentiment datasets: STD, STS-Gold, SemEval14, and SentiStrength. We evaluate the performance of the offered solution with a comparison against the baseline models and the related-works.

情感分析或观点挖掘是指在文本文档、推文或评论帖子中发现情感信息。这个领域是随着社交媒体的发展而出现的,它在市场营销、旅游和商业等几个应用领域引起了极大的兴趣。在这项工作中,我们通过一个新颖的框架来处理Twitter情感分析,该框架同时解决了文本表示的问题,如稀疏性和高维性。提出了一种基于平滑比例狄利克雷分布的信息检索概率模型。我们提出了一种估计分布参数的似然学习方法,并提出了一种基于信息检索系统的特征生成方法。我们将所提出的方法应用于四个Twitter情感数据集:STD, STS-Gold, SemEval14和SentiStrength。我们通过与基线模型和相关工作的比较来评估所提供解决方案的性能。
{"title":"On smoothing and scaling language model for sentiment based information retrieval","authors":"Fatma Najar,&nbsp;Nizar Bouguila","doi":"10.1007/s11634-022-00522-6","DOIUrl":"10.1007/s11634-022-00522-6","url":null,"abstract":"<div><p>Sentiment analysis or opinion mining refers to the discovery of sentiment information within textual documents, tweets, or review posts. This field has emerged with the social media outgrowth which becomes of great interest for several applications such as marketing, tourism, and business. In this work, we approach Twitter sentiment analysis through a novel framework that addresses simultaneously the problems of text representation such as sparseness and high-dimensionality. We propose an information retrieval probabilistic model based on a new distribution namely the Smoothed Scaled Dirichlet distribution. We present a likelihood learning method for estimating the parameters of the distribution and we propose a feature generation from the information retrieval system. We apply the proposed approach Smoothed Scaled Relevance Model on four Twitter sentiment datasets: STD, STS-Gold, SemEval14, and SentiStrength. We evaluate the performance of the offered solution with a comparison against the baseline models and the related-works.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 3","pages":"725 - 744"},"PeriodicalIF":1.6,"publicationDate":"2022-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50024344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The role of diversity and ensemble learning in credit card fraud detection 多样性和集合学习在信用卡欺诈检测中的作用。
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-09-28 DOI: 10.1007/s11634-022-00515-5
Gian Marco Paldino, Bertrand Lebichot, Yann-Aël Le Borgne, Wissam Siblini, Frédéric Oblé, Giacomo Boracchi, Gianluca Bontempi

The number of daily credit card transactions is inexorably growing: the e-commerce market expansion and the recent constraints for the Covid-19 pandemic have significantly increased the use of electronic payments. The ability to precisely detect fraudulent transactions is increasingly important, and machine learning models are now a key component of the detection process. Standard machine learning techniques are widely employed, but inadequate for the evolving nature of customers behavior entailing continuous changes in the underlying data distribution. his problem is often tackled by discarding past knowledge, despite its potential relevance in the case of recurrent concepts. Appropriate exploitation of historical knowledge is necessary: we propose a learning strategy that relies on diversity-based ensemble learning and allows to preserve past concepts and reuse them for a faster adaptation to changes. In our experiments, we adopt several state-of-the-art diversity measures and we perform comparisons with various other learning approaches. We assess the effectiveness of our proposed learning strategy on extracts of two real datasets from two European countries, containing more than 30 M and 50 M transactions, provided by our industrial partner, Worldline, a leading company in the field.

信用卡的日交易量正以不可阻挡之势不断增长:电子商务市场的扩张和最近对 Covid-19 大流行病的制约都大大增加了电子支付的使用。精确检测欺诈交易的能力越来越重要,而机器学习模型现已成为检测过程的关键组成部分。标准的机器学习技术已被广泛应用,但不足以应对客户行为不断变化的本质,即基础数据分布的持续变化。解决这一问题的方法通常是摒弃过去的知识,尽管这些知识在重复出现的概念中具有潜在的相关性。适当利用历史知识是必要的:我们提出了一种学习策略,该策略依赖于基于多样性的集合学习,允许保留过去的概念并重复使用,以更快地适应变化。在实验中,我们采用了几种最先进的多样性测量方法,并与其他各种学习方法进行了比较。我们评估了我们提出的学习策略在两个真实数据集上的有效性,这两个数据集来自两个欧洲国家,分别包含超过 3000 万和 5000 万笔交易,由我们的行业合作伙伴 Worldline(该领域的一家领先公司)提供。
{"title":"The role of diversity and ensemble learning in credit card fraud detection","authors":"Gian Marco Paldino,&nbsp;Bertrand Lebichot,&nbsp;Yann-Aël Le Borgne,&nbsp;Wissam Siblini,&nbsp;Frédéric Oblé,&nbsp;Giacomo Boracchi,&nbsp;Gianluca Bontempi","doi":"10.1007/s11634-022-00515-5","DOIUrl":"10.1007/s11634-022-00515-5","url":null,"abstract":"<div><p>The number of daily credit card transactions is inexorably growing: the e-commerce market expansion and the recent constraints for the Covid-19 pandemic have significantly increased the use of electronic payments. The ability to precisely detect fraudulent transactions is increasingly important, and machine learning models are now a key component of the detection process. Standard machine learning techniques are widely employed, but inadequate for the evolving nature of customers behavior entailing continuous changes in the underlying data distribution. his problem is often tackled by discarding past knowledge, despite its potential relevance in the case of recurrent concepts. Appropriate exploitation of historical knowledge is necessary: we propose a learning strategy that relies on diversity-based ensemble learning and allows to preserve past concepts and reuse them for a faster adaptation to changes. In our experiments, we adopt several state-of-the-art diversity measures and we perform comparisons with various other learning approaches. We assess the effectiveness of our proposed learning strategy on extracts of two real datasets from two European countries, containing more than 30 M and 50 M transactions, provided by our industrial partner, Worldline, a leading company in the field.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 1","pages":"193 - 217"},"PeriodicalIF":1.4,"publicationDate":"2022-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40392926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Benchmarking distance-based partitioning methods for mixed-type data 基于基准距离的混合类型数据划分方法
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-09-22 DOI: 10.1007/s11634-022-00521-7
Efthymios Costa, Ioanna Papatsouma, Angelos Markos

Clustering mixed-type data, that is, observation by variable data that consist of both continuous and categorical variables poses novel challenges. Foremost among these challenges is the choice of the most appropriate clustering method for the data. This paper presents a benchmarking study comparing eight distance-based partitioning methods for mixed-type data in terms of cluster recovery performance. A series of simulations carried out by a full factorial design are presented that examined the effect of a variety of factors on cluster recovery. The amount of cluster overlap, the percentage of categorical variables in the data set, the number of clusters and the number of observations had the largest effects on cluster recovery and in most of the tested scenarios. KAMILA, K-Prototypes and sequential Factor Analysis and K-Means clustering typically performed better than other methods. The study can be a useful reference for practitioners in the choice of the most appropriate method.

对混合类型数据进行聚类,即通过由连续变量和分类变量组成的变量数据进行观察,提出了新的挑战。这些挑战中最重要的是为数据选择最合适的聚类方法。本文从集群恢复性能的角度,对混合类型数据的八种基于距离的分区方法进行了基准测试研究。通过全因子设计进行了一系列模拟,考察了各种因素对集群恢复的影响。集群重叠的数量、数据集中分类变量的百分比、集群的数量和观察的数量对集群恢复和大多数测试场景的影响最大。KAMILA、K-Prototypes和序列因子分析以及K-Means聚类通常比其他方法表现得更好。该研究可为从业者选择最合适的方法提供有用的参考。
{"title":"Benchmarking distance-based partitioning methods for mixed-type data","authors":"Efthymios Costa,&nbsp;Ioanna Papatsouma,&nbsp;Angelos Markos","doi":"10.1007/s11634-022-00521-7","DOIUrl":"10.1007/s11634-022-00521-7","url":null,"abstract":"<div><p>Clustering mixed-type data, that is, observation by variable data that consist of both continuous and categorical variables poses novel challenges. Foremost among these challenges is the choice of the most appropriate clustering method for the data. This paper presents a benchmarking study comparing eight distance-based partitioning methods for mixed-type data in terms of cluster recovery performance. A series of simulations carried out by a full factorial design are presented that examined the effect of a variety of factors on cluster recovery. The amount of cluster overlap, the percentage of categorical variables in the data set, the number of clusters and the number of observations had the largest effects on cluster recovery and in most of the tested scenarios. KAMILA, K-Prototypes and sequential Factor Analysis and K-Means clustering typically performed better than other methods. The study can be a useful reference for practitioners in the choice of the most appropriate method.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 3","pages":"701 - 724"},"PeriodicalIF":1.6,"publicationDate":"2022-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-022-00521-7.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50506372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
New models for symbolic data analysis 符号数据分析的新模型
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-09-19 DOI: 10.1007/s11634-022-00520-8
Boris Beranger, Huan Lin, Scott Sisson

Symbolic data analysis (SDA) is an emerging area of statistics concerned with understanding and modelling data that takes distributional form (i.e. symbols), such as random lists, intervals and histograms. It was developed under the premise that the statistical unit of interest is the symbol, and that inference is required at this level. Here we consider a different perspective, which opens a new research direction in the field of SDA. We assume that, as with a standard statistical analysis, inference is required at the level of individual-level data. However, the individual-level data are unobserved, and are aggregated into observed symbols—group-based distributional-valued summaries—prior to the analysis. We introduce a novel general method for constructing likelihood functions for symbolic data based on a desired probability model for the underlying measurement-level data, while only observing the distributional summaries. This approach opens the door for new classes of symbol design and construction, in addition to developing SDA as a viable tool to enable and improve upon classical data analyses, particularly for very large and complex datasets. We illustrate this new direction for SDA research through several real and simulated data analyses, including a study of novel classes of multivariate symbol construction techniques.

符号数据分析(SDA)是一个新兴的统计学领域,涉及理解和建模具有分布形式(即符号)的数据,如随机列表、区间和直方图。它是在感兴趣的统计单位是符号的前提下发展起来的,在这个层面上需要推理。在这里,我们考虑了一个不同的视角,这为SDA领域开辟了一个新的研究方向。我们假设,与标准统计分析一样,需要在个人层面的数据层面进行推断。然而,个体层面的数据是未观察到的,并且在分析之前被聚合为观察到的符号——基于组的分布值摘要。我们介绍了一种新的通用方法,用于基于底层测量水平数据的期望概率模型来构造符号数据的似然函数,同时只观察分布摘要。这种方法为新类别的符号设计和构造打开了大门,此外,将SDA开发为一种可行的工具,以实现和改进经典数据分析,特别是对于非常大和复杂的数据集。我们通过一些真实和模拟数据分析,包括对新型多元符号构造技术的研究,说明了SDA研究的这一新方向。
{"title":"New models for symbolic data analysis","authors":"Boris Beranger,&nbsp;Huan Lin,&nbsp;Scott Sisson","doi":"10.1007/s11634-022-00520-8","DOIUrl":"10.1007/s11634-022-00520-8","url":null,"abstract":"<div><p>Symbolic data analysis (SDA) is an emerging area of statistics concerned with understanding and modelling data that takes distributional form (i.e. <i>symbols</i>), such as random lists, intervals and histograms. It was developed under the premise that the statistical unit of interest is the symbol, and that inference is required at this level. Here we consider a different perspective, which opens a new research direction in the field of SDA. We assume that, as with a standard statistical analysis, inference is required at the level of individual-level data. However, the individual-level data are unobserved, and are aggregated into observed symbols—group-based distributional-valued summaries—prior to the analysis. We introduce a novel general method for constructing likelihood functions for symbolic data based on a desired probability model for the underlying measurement-level data, while only observing the distributional summaries. This approach opens the door for new classes of symbol design and construction, in addition to developing SDA as a viable tool to enable and improve upon classical data analyses, particularly for very large and complex datasets. We illustrate this new direction for SDA research through several real and simulated data analyses, including a study of novel classes of multivariate symbol construction techniques.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 3","pages":"659 - 699"},"PeriodicalIF":1.6,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-022-00520-8.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50038965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Slice weighted average regression 切片加权平均回归
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-09-10 DOI: 10.1007/s11634-023-00551-9
Marina Masioti, Joshua J. Davies, Amanda Shaker, L. Prendergast
{"title":"Slice weighted average regression","authors":"Marina Masioti, Joshua J. Davies, Amanda Shaker, L. Prendergast","doi":"10.1007/s11634-023-00551-9","DOIUrl":"https://doi.org/10.1007/s11634-023-00551-9","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"220 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2022-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89127621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust regression for interval-valued data based on midpoints and log-ranges 基于中点和对数范围的区间值数据鲁棒回归
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-09-05 DOI: 10.1007/s11634-022-00518-2
Qing Zhao, Huiwen Wang, Shanshan Wang

Flexible modelling of interval-valued data is of great practical importance with the development of advanced technologies in current data collection processes. This paper proposes a new robust regression model for interval-valued data based on midpoints and log-ranges of the dependent intervals, and obtains the parameter estimators using Huber loss function to deal with possible outliers in a data set. Besides, the use of logarithm transformation avoids the non-negativity constraints for the traditional modelling of ranges, which is beneficial to the flexible use of various regression methods. We conduct extensive Monte Carlo simulation experiments to compare the finite-sample performance of our model with that of the existing regression methods for interval-valued data. Results indicate that the proposed method shows competitive performance, especially in the data set with the existence of outliers and the scenarios where both midpoints and ranges of independent variables are related to those of the dependent one. Moreover, two empirical interval-valued data sets are applied to illustrate the effectiveness of our method.

随着当前数据采集技术的发展,区间值数据的灵活建模具有重要的实际意义。基于相关区间的中点和对数范围,提出了一种新的区间值数据鲁棒回归模型,并利用Huber损失函数得到了参数估计量来处理数据集中可能的异常值。此外,对数变换的使用避免了传统极差建模的非负性约束,有利于各种回归方法的灵活使用。我们进行了广泛的蒙特卡罗模拟实验,以比较我们的模型与现有的区间值数据回归方法的有限样本性能。结果表明,该方法具有较好的性能,特别是在存在异常值的数据集以及自变量的中点和范围与因变量的中点和范围相关的情况下。此外,还应用了两个经验区间值数据集来说明我们的方法的有效性。
{"title":"Robust regression for interval-valued data based on midpoints and log-ranges","authors":"Qing Zhao,&nbsp;Huiwen Wang,&nbsp;Shanshan Wang","doi":"10.1007/s11634-022-00518-2","DOIUrl":"10.1007/s11634-022-00518-2","url":null,"abstract":"<div><p>Flexible modelling of interval-valued data is of great practical importance with the development of advanced technologies in current data collection processes. This paper proposes a new robust regression model for interval-valued data based on midpoints and log-ranges of the dependent intervals, and obtains the parameter estimators using Huber loss function to deal with possible outliers in a data set. Besides, the use of logarithm transformation avoids the non-negativity constraints for the traditional modelling of ranges, which is beneficial to the flexible use of various regression methods. We conduct extensive Monte Carlo simulation experiments to compare the finite-sample performance of our model with that of the existing regression methods for interval-valued data. Results indicate that the proposed method shows competitive performance, especially in the data set with the existence of outliers and the scenarios where both midpoints and ranges of independent variables are related to those of the dependent one. Moreover, two empirical interval-valued data sets are applied to illustrate the effectiveness of our method.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 3","pages":"583 - 621"},"PeriodicalIF":1.6,"publicationDate":"2022-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-022-00518-2.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50010514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Band depth based initialization of K-means for functional data clustering 基于带深的函数数据聚类K-means初始化
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-09-03 DOI: 10.1007/s11634-022-00510-w
Javier Albert-Smet, Aurora Torrente, Juan Romo

The k-Means algorithm is one of the most popular choices for clustering data but is well-known to be sensitive to the initialization process. There is a substantial number of methods that aim at finding optimal initial seeds for k-Means, though none of them is universally valid. This paper presents an extension to longitudinal data of one of such methods, the BRIk algorithm, that relies on clustering a set of centroids derived from bootstrap replicates of the data and on the use of the versatile Modified Band Depth. In our approach we improve the BRIk method by adding a step where we fit appropriate B-splines to our observations and a resampling process that allows computational feasibility and handling issues such as noise or missing data. We have derived two techniques for providing suitable initial seeds, each of them stressing respectively the multivariate or the functional nature of the data. Our results with simulated and real data sets indicate that our Functional Data Approach to the BRIK method (FABRIk) and our Functional Data Extension of the BRIK method (FDEBRIk) are more effective than previous proposals at providing seeds to initialize k-Means in terms of clustering recovery.

k-Means算法是对数据进行聚类的最流行的选择之一,但众所周知,它对初始化过程很敏感。有大量的方法旨在为k-Means找到最佳初始种子,尽管没有一种是普遍有效的。本文对其中一种方法BRIk算法的纵向数据进行了扩展,该算法依赖于对从数据的bootstrap复制中导出的一组质心进行聚类,并使用通用的修正带深度。在我们的方法中,我们改进了BRIk方法,增加了一个步骤,在该步骤中,我们将适当的B样条拟合到我们的观测值中,并增加了重新采样过程,该过程允许计算可行性和处理噪声或数据丢失等问题。我们推导了两种提供合适初始种子的技术,每种技术都分别强调数据的多变量或函数性质。我们对模拟和真实数据集的结果表明,在提供种子以初始化聚类恢复方面的k-Means方面,我们的BRIK方法函数数据方法(FABRIk)和BRIK方法的函数数据扩展(FDEBRIk)比以前的建议更有效。
{"title":"Band depth based initialization of K-means for functional data clustering","authors":"Javier Albert-Smet,&nbsp;Aurora Torrente,&nbsp;Juan Romo","doi":"10.1007/s11634-022-00510-w","DOIUrl":"10.1007/s11634-022-00510-w","url":null,"abstract":"<div><p>The <i>k</i>-Means algorithm is one of the most popular choices for clustering data but is well-known to be sensitive to the initialization process. There is a substantial number of methods that aim at finding optimal initial seeds for <i>k</i>-Means, though none of them is universally valid. This paper presents an extension to longitudinal data of one of such methods, the BRIk algorithm, that relies on clustering a set of centroids derived from bootstrap replicates of the data and on the use of the versatile Modified Band Depth. In our approach we improve the BRIk method by adding a step where we fit appropriate B-splines to our observations and a resampling process that allows computational feasibility and handling issues such as noise or missing data. We have derived two techniques for providing suitable initial seeds, each of them stressing respectively the multivariate or the functional nature of the data. Our results with simulated and real data sets indicate that our <i>F</i>unctional Data <i>A</i>pproach to the BRIK method (FABRIk) and our <i>F</i>unctional <i>D</i>ata <i>E</i>xtension of the BRIK method (FDEBRIk) are more effective than previous proposals at providing seeds to initialize <i>k</i>-Means in terms of clustering recovery.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 2","pages":"463 - 484"},"PeriodicalIF":1.6,"publicationDate":"2022-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-022-00510-w.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50447089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Nonparametric regression and classification with functional, categorical, and mixed covariates 具有函数、分类和混合协变量的非参数回归和分类
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-09-02 DOI: 10.1007/s11634-022-00513-7
Leonie Selk, Jan Gertheiss

We consider nonparametric prediction with multiple covariates, in particular categorical or functional predictors, or a mixture of both. The method proposed bases on an extension of the Nadaraya-Watson estimator where a kernel function is applied on a linear combination of distance measures each calculated on single covariates, with weights being estimated from the training data. The dependent variable can be categorical (binary or multi-class) or continuous, thus we consider both classification and regression problems. The methodology presented is illustrated and evaluated on artificial and real world data. Particularly it is observed that prediction accuracy can be increased, and irrelevant, noise variables can be identified/removed by ‘downgrading’ the corresponding distance measures in a completely data-driven way.

我们考虑具有多个协变量的非参数预测,特别是分类或函数预测,或两者的混合。所提出的方法基于Nadaraya-WWatson估计器的扩展,其中核函数应用于距离测量的线性组合,每个距离测量都是在单个协变量上计算的,权重是从训练数据中估计的。因变量可以是分类的(二元或多类)或连续的,因此我们同时考虑分类和回归问题。所提出的方法在人工和真实世界的数据上进行了说明和评估。特别是可以观察到,通过以完全数据驱动的方式“降级”相应的距离测量,可以提高预测精度,并且可以识别/去除无关的噪声变量。
{"title":"Nonparametric regression and classification with functional, categorical, and mixed covariates","authors":"Leonie Selk,&nbsp;Jan Gertheiss","doi":"10.1007/s11634-022-00513-7","DOIUrl":"10.1007/s11634-022-00513-7","url":null,"abstract":"<div><p>We consider nonparametric prediction with multiple covariates, in particular categorical or functional predictors, or a mixture of both. The method proposed bases on an extension of the Nadaraya-Watson estimator where a kernel function is applied on a linear combination of distance measures each calculated on single covariates, with weights being estimated from the training data. The dependent variable can be categorical (binary or multi-class) or continuous, thus we consider both classification and regression problems. The methodology presented is illustrated and evaluated on artificial and real world data. Particularly it is observed that prediction accuracy can be increased, and irrelevant, noise variables can be identified/removed by ‘downgrading’ the corresponding distance measures in a completely data-driven way.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 2","pages":"519 - 543"},"PeriodicalIF":1.6,"publicationDate":"2022-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-022-00513-7.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50442918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Clustering with missing data: which equivalent for Rubin’s rules? 丢失数据的聚类:鲁宾规则的等效是什么?
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2022-09-01 DOI: 10.1007/s11634-022-00519-1
Vincent Audigier, Ndèye Niang

Multiple imputation (MI) is a popular method for dealing with missing values. However, the suitable way for applying clustering after MI remains unclear: how to pool partitions? How to assess the clustering instability when data are incomplete? By answering both questions, this paper proposed a complete view of clustering with missing data using MI. The problem of partitions pooling is here addressed using consensus clustering while, based on the bootstrap theory, we explain how to assess the instability related to observed and missing data. The new rules for pooling partitions and instability assessment are theoretically argued and extensively studied by simulation. Partitions pooling improves accuracy, while measuring instability with missing data enlarges the data analysis possibilities: it allows assessment of the dependence of the clustering to the imputation model, as well as a convenient way for choosing the number of clusters when data are incomplete, as illustrated on a real data set.

多重插值(Multiple imputation, MI)是处理缺失值的常用方法。然而,MI之后应用集群的合适方式仍然不清楚:如何池化分区?当数据不完整时,如何评估聚类不稳定性?通过回答这两个问题,本文提出了使用MI对缺失数据进行聚类的完整视图。这里使用共识聚类解决了分区池问题,同时,基于bootstrap理论,我们解释了如何评估与观测数据和缺失数据相关的不稳定性。对池分区和不稳定性评估的新规则进行了理论论证和仿真研究。分区池提高了准确性,而测量缺失数据的不稳定性增加了数据分析的可能性:它允许评估聚类对输入模型的依赖性,以及在数据不完整时选择聚类数量的方便方法,如真实数据集所示。
{"title":"Clustering with missing data: which equivalent for Rubin’s rules?","authors":"Vincent Audigier,&nbsp;Ndèye Niang","doi":"10.1007/s11634-022-00519-1","DOIUrl":"10.1007/s11634-022-00519-1","url":null,"abstract":"<div><p>Multiple imputation (MI) is a popular method for dealing with missing values. However, the suitable way for applying clustering after MI remains unclear: how to pool partitions? How to assess the clustering instability when data are incomplete? By answering both questions, this paper proposed a complete view of clustering with missing data using MI. The problem of partitions pooling is here addressed using consensus clustering while, based on the bootstrap theory, we explain how to assess the instability related to observed and missing data. The new rules for pooling partitions and instability assessment are theoretically argued and extensively studied by simulation. Partitions pooling improves accuracy, while measuring instability with missing data enlarges the data analysis possibilities: it allows assessment of the dependence of the clustering to the imputation model, as well as a convenient way for choosing the number of clusters when data are incomplete, as illustrated on a real data set.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 3","pages":"623 - 657"},"PeriodicalIF":1.6,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50001501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
Advances in Data Analysis and Classification
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1