首页 > 最新文献

Advances in Data Analysis and Classification最新文献

英文 中文
Model-based clustering of functional data via mixtures of t distributions 通过 t 分布混合物对功能数据进行基于模型的聚类
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-05-12 DOI: 10.1007/s11634-023-00542-w
Cristina Anton, Iain Smith

We propose a procedure, called T-funHDDC, for clustering multivariate functional data with outliers which extends the functional high dimensional data clustering (funHDDC) method (Schmutz et al. in Comput Stat 35:1101–1131, 2020) by considering a mixture of multivariate t distributions. We define a family of latent mixture models following the approach used for the parsimonious models considered in funHDDC and also constraining or not the degrees of freedom of the multivariate t distributions to be equal across the mixture components. The parameters of these models are estimated using an expectation maximization algorithm. In addition to proposing the T-funHDDC method, we add a family of parsimonious models to C-funHDDC, which is an alternative method for clustering multivariate functional data with outliers based on a mixture of contaminated normal distributions (Amovin-Assagba et al. in Comput Stat Data Anal 174:107496, 2022). We compare T-funHDDC, C-funHDDC, and other existing methods on simulated functional data with outliers and for real-world data. T-funHDDC outperforms funHDDC when applied to functional data with outliers, and its good performance makes it an alternative to C-funHDDC. We also apply the T-funHDDC method to the analysis of traffic flow in Edmonton, Canada.

我们提出了一种名为 T-funHDDC 的程序,用于对有离群值的多元函数数据进行聚类,该程序通过考虑多元 t 分布的混合物,扩展了函数高维数据聚类(funHDDC)方法(Schmutz 等人,载于 Comput Stat 35:1101-1131, 2020)。我们按照在 funHDDC 中考虑的拟合模型的方法,定义了一系列潜在混合物模型,并限制多元 t 分布的自由度在各混合物成分中是否相等。这些模型的参数使用期望最大化算法进行估计。除了提出 T-funHDDC 方法外,我们还为 C-funHDDC 增加了一个拟合模型系列,C-funHDDC 是基于污染正态分布混合物对有异常值的多元函数数据进行聚类的另一种方法(Amovin-Assagba 等,载于 Comput Stat Data Anal 174:107496, 2022)。我们比较了 T-funHDDC、C-funHDDC 和其他现有方法在有异常值的模拟函数数据和真实世界数据中的应用。当应用于有异常值的函数数据时,T-funHDDC优于funHDDC,其良好的性能使其成为C-funHDDC的替代方法。我们还将 T-funHDDC 方法应用于加拿大埃德蒙顿的交通流分析。
{"title":"Model-based clustering of functional data via mixtures of t distributions","authors":"Cristina Anton,&nbsp;Iain Smith","doi":"10.1007/s11634-023-00542-w","DOIUrl":"10.1007/s11634-023-00542-w","url":null,"abstract":"<div><p>We propose a procedure, called T-funHDDC, for clustering multivariate functional data with outliers which extends the functional high dimensional data clustering (funHDDC) method (Schmutz et al. in Comput Stat 35:1101–1131, 2020) by considering a mixture of multivariate <i>t</i> distributions. We define a family of latent mixture models following the approach used for the parsimonious models considered in funHDDC and also constraining or not the degrees of freedom of the multivariate <i>t</i> distributions to be equal across the mixture components. The parameters of these models are estimated using an expectation maximization algorithm. In addition to proposing the T-funHDDC method, we add a family of parsimonious models to C-funHDDC, which is an alternative method for clustering multivariate functional data with outliers based on a mixture of contaminated normal distributions (Amovin-Assagba et al. in Comput Stat Data Anal 174:107496, 2022). We compare T-funHDDC, C-funHDDC, and other existing methods on simulated functional data with outliers and for real-world data. T-funHDDC outperforms funHDDC when applied to functional data with outliers, and its good performance makes it an alternative to C-funHDDC. We also apply the T-funHDDC method to the analysis of traffic flow in Edmonton, Canada.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"563 - 595"},"PeriodicalIF":1.4,"publicationDate":"2023-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81142509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Finite mixture of hidden Markov models for tensor-variate time series data 张量变量时间序列数据的有限混合隐马尔科夫模型
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-04-29 DOI: 10.1007/s11634-023-00540-y
Abdullah Asilkalkan, Xuwen Zhu, Shuchismita Sarkar

The need to model data with higher dimensions, such as a tensor-variate framework where each observation is considered a three-dimensional object, increases due to rapid improvements in computational power and data storage capabilities. In this study, a finite mixture of hidden Markov model for tensor-variate time series data is developed. Simulation studies demonstrate high classification accuracy for both cluster and regime IDs. To further validate the usefulness of the proposed model, it is applied to real-life data with promising results.

随着计算能力和数据存储能力的迅速提高,对高维度数据建模的需求也在增加,例如在张量变量框架中,每个观测值都被视为一个三维对象。本研究为张量变量时间序列数据建立了有限混合隐马尔科夫模型。模拟研究表明,该模型对集群和系统 ID 的分类准确率都很高。为了进一步验证所提模型的实用性,我们将其应用于现实生活数据,并取得了可喜的成果。
{"title":"Finite mixture of hidden Markov models for tensor-variate time series data","authors":"Abdullah Asilkalkan,&nbsp;Xuwen Zhu,&nbsp;Shuchismita Sarkar","doi":"10.1007/s11634-023-00540-y","DOIUrl":"10.1007/s11634-023-00540-y","url":null,"abstract":"<div><p>The need to model data with higher dimensions, such as a tensor-variate framework where each observation is considered a three-dimensional object, increases due to rapid improvements in computational power and data storage capabilities. In this study, a finite mixture of hidden Markov model for tensor-variate time series data is developed. Simulation studies demonstrate high classification accuracy for both cluster and regime IDs. To further validate the usefulness of the proposed model, it is applied to real-life data with promising results.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"545 - 562"},"PeriodicalIF":1.4,"publicationDate":"2023-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84117395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Application of distance standard deviation in functional data analysis 距离标准差在功能数据分析中的应用
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-04-21 DOI: 10.1007/s11634-023-00538-6
Mirosław Krzyśko, Łukasz Smaga

This paper concerns the measurement and testing of equality of variability of functional data. We apply the distance standard deviation constructed based on distance correlation, which was recently introduced as a measure of spread. For functional data, the distance standard deviation seems to measure different kinds of variability, not only scale differences. Moreover, the distance standard deviation is just one real number, and for this reason, it is of more practical value than the covariance function, which is a more difficult object to interpret. For testing equality of variability in two groups, we propose a permutation method based on centered observations, which controls the type I error level much better than the standard permutation method. We also consider the applicability of other correlations to measure the variability of functional data. The finite sample properties of two-sample tests are investigated in extensive simulation studies. We also illustrate their use in five real data examples based on various data sets.

本文涉及函数数据变异性相等性的测量和检验。我们采用了基于距离相关性构建的距离标准差,它是最近推出的一种差异度量方法。对于函数数据,距离标准差似乎可以测量不同类型的变异性,而不仅仅是尺度差异。此外,距离标准差只是一个实数,因此,它比协方差函数更具实用价值,后者是一个更难解释的对象。为了检验两组变异性的相等性,我们提出了一种基于居中观测值的置换法,它比标准置换法更好地控制了 I 型误差水平。我们还考虑了其他相关性对测量函数数据变异性的适用性。我们通过大量的模拟研究调查了双样本检验的有限样本特性。我们还在五个基于不同数据集的真实数据示例中说明了它们的应用。
{"title":"Application of distance standard deviation in functional data analysis","authors":"Mirosław Krzyśko,&nbsp;Łukasz Smaga","doi":"10.1007/s11634-023-00538-6","DOIUrl":"10.1007/s11634-023-00538-6","url":null,"abstract":"<div><p>This paper concerns the measurement and testing of equality of variability of functional data. We apply the distance standard deviation constructed based on distance correlation, which was recently introduced as a measure of spread. For functional data, the distance standard deviation seems to measure different kinds of variability, not only scale differences. Moreover, the distance standard deviation is just one real number, and for this reason, it is of more practical value than the covariance function, which is a more difficult object to interpret. For testing equality of variability in two groups, we propose a permutation method based on centered observations, which controls the type I error level much better than the standard permutation method. We also consider the applicability of other correlations to measure the variability of functional data. The finite sample properties of two-sample tests are investigated in extensive simulation studies. We also illustrate their use in five real data examples based on various data sets.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 2","pages":"431 - 454"},"PeriodicalIF":1.4,"publicationDate":"2023-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00538-6.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90955483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An enhanced version of the SSA-HJ-biplot for time series with complex structure 针对具有复杂结构的时间序列的 SSA-HJ-iplot 增强版
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-04-18 DOI: 10.1007/s11634-023-00541-x
Alberto Silva, Adelaide Freitas

HJ-biplots can be used with singular spectral analysis to visualize and identify patterns in univariate time series. Named SSA-HJ-biplots, these graphs guarantee the simultaneous representation of the trajectory matrix’s rows and columns with maximum quality in the same factorial axes system and allow visualization of the separation of the time series components. Structural changes in the time series can make it challenging to visualize the components’ separation and lead to erroneous conclusions. This paper discusses an improved version of the SSA-HJ-biplot capable of handling this type of complexity. After separating the series’ signal and identifying points where structural changes occurred using multivariate techniques, the SSA-HJ-biplot is applied separately to the series’ homogeneous intervals, which is why some improvement in the visualization of the components’ separation is intended.

HJ 双线图可以与奇异谱分析一起使用,以可视化方式识别单变量时间序列中的模式。这些图形被命名为 SSA-HJ-双曲线,可确保在同一因子轴系中以最高质量同时表示轨迹矩阵的行和列,并实现时间序列成分分离的可视化。时间序列中的结构变化会给可视化成分分离带来挑战,并导致错误的结论。本文讨论了能够处理此类复杂性的 SSA-HJ 双轴图的改进版本。在使用多元技术分离序列信号并确定发生结构变化的点之后,SSA-HJ-双线图将分别应用于序列的同质区间,这就是为什么要在可视化成分分离方面进行一些改进。
{"title":"An enhanced version of the SSA-HJ-biplot for time series with complex structure","authors":"Alberto Silva,&nbsp;Adelaide Freitas","doi":"10.1007/s11634-023-00541-x","DOIUrl":"10.1007/s11634-023-00541-x","url":null,"abstract":"<div><p>HJ-biplots can be used with singular spectral analysis to visualize and identify patterns in univariate time series. Named SSA-HJ-biplots, these graphs guarantee the simultaneous representation of the trajectory matrix’s rows and columns with maximum quality in the same factorial axes system and allow visualization of the separation of the time series components. Structural changes in the time series can make it challenging to visualize the components’ separation and lead to erroneous conclusions. This paper discusses an improved version of the SSA-HJ-biplot capable of handling this type of complexity. After separating the series’ signal and identifying points where structural changes occurred using multivariate techniques, the SSA-HJ-biplot is applied separately to the series’ homogeneous intervals, which is why some improvement in the visualization of the components’ separation is intended.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 2","pages":"409 - 430"},"PeriodicalIF":1.4,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87510976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Composite likelihood methods for parsimonious model-based clustering of mixed-type data 基于模型对混合型数据进行解析聚类的复合似然法
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-04-09 DOI: 10.1007/s11634-023-00539-5
Monia Ranalli, Roberto Rocci

In this paper, we propose twelve parsimonious models for clustering mixed-type (ordinal and continuous) data. The dependence among the different types of variables is modeled by assuming that ordinal and continuous data follow a multivariate finite mixture of Gaussians, where the ordinal variables are a discretization of some continuous variates of the mixture. The general class of parsimonious models is based on a factor decomposition of the component-specific covariance matrices. Parameter estimation is carried out using a EM-type algorithm based on composite likelihood. The proposal is evaluated through a simulation study and an application to real data.

在本文中,我们提出了十二种对混合类型(序数和连续)数据进行聚类的简明模型。不同类型变量之间的依赖关系是通过假设序数和连续数据遵循多元有限高斯混合物来建模的,其中序数变量是混合物中某些连续变量的离散化。一般的拟合模型是基于对特定成分协方差矩阵的因子分解。参数估计采用基于复合似然的 EM 型算法。通过模拟研究和对真实数据的应用对该建议进行了评估。
{"title":"Composite likelihood methods for parsimonious model-based clustering of mixed-type data","authors":"Monia Ranalli,&nbsp;Roberto Rocci","doi":"10.1007/s11634-023-00539-5","DOIUrl":"10.1007/s11634-023-00539-5","url":null,"abstract":"<div><p>In this paper, we propose twelve parsimonious models for clustering mixed-type (ordinal and continuous) data. The dependence among the different types of variables is modeled by assuming that ordinal and continuous data follow a multivariate finite mixture of Gaussians, where the ordinal variables are a discretization of some continuous variates of the mixture. The general class of parsimonious models is based on a factor decomposition of the component-specific covariance matrices. Parameter estimation is carried out using a EM-type algorithm based on composite likelihood. The proposal is evaluated through a simulation study and an application to real data.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 2","pages":"381 - 407"},"PeriodicalIF":1.4,"publicationDate":"2023-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00539-5.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75109945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identification of representative trees in random forests based on a new tree-based distance measure 基于新的基于树的距离测量法识别随机森林中的代表性树
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-03-16 DOI: 10.1007/s11634-023-00537-7
Björn-Hergen Laabs, Ana Westenberger, Inke R. König

In life sciences, random forests are often used to train predictive models. However, gaining any explanatory insight into the mechanics leading to a specific outcome is rather complex, which impedes the implementation of random forests into clinical practice. By simplifying a complex ensemble of decision trees to a single most representative tree, it is assumed to be possible to observe common tree structures, the importance of specific features and variable interactions. Thus, representative trees could also help to understand interactions between genetic variants. Intuitively, representative trees are those with the minimal distance to all other trees, which requires a proper definition of the distance between two trees. Thus, we developed a new tree-based distance measure, which incorporates more of the underlying tree structure than other metrics. We compared our new method with the existing metrics in an extensive simulation study and applied it to predict the age at onset based on a set of genetic risk factors in a clinical data set. In our simulation study we were able to show the advantages of our weighted splitting variable approach. Our real data application revealed that representative trees are not only able to replicate the results from a recent genome-wide association study, but also can give additional explanations of the genetic mechanisms. Finally, we implemented all compared distance measures in R and made them publicly available in the R package timbR (https://github.com/imbs-hl/timbR).

在生命科学领域,随机森林常用于训练预测模型。然而,对导致特定结果的机理进行任何解释性洞察都相当复杂,这阻碍了随机森林在临床实践中的应用。通过将复杂的决策树组合简化为一棵最具代表性的树,我们假定可以观察到常见的树结构、特定特征的重要性以及变量之间的相互作用。因此,代表性树也有助于了解遗传变异之间的相互作用。直观地说,具有代表性的树是那些与所有其他树的距离最小的树,这就需要对两棵树之间的距离进行适当的定义。因此,我们开发了一种新的基于树的距离度量方法,与其他度量方法相比,它包含了更多的底层树结构。我们在广泛的模拟研究中将新方法与现有指标进行了比较,并将其应用于根据临床数据集中的一组遗传风险因素预测发病年龄。在模拟研究中,我们展示了加权分割变量方法的优势。我们的实际数据应用表明,代表性树不仅能够复制最近一项全基因组关联研究的结果,还能对遗传机制做出额外的解释。最后,我们用 R 语言实现了所有比较过的距离测量方法,并在 R 软件包 timbR 中公开发布(https://github.com/imbs-hl/timbR)。
{"title":"Identification of representative trees in random forests based on a new tree-based distance measure","authors":"Björn-Hergen Laabs,&nbsp;Ana Westenberger,&nbsp;Inke R. König","doi":"10.1007/s11634-023-00537-7","DOIUrl":"10.1007/s11634-023-00537-7","url":null,"abstract":"<div><p>In life sciences, random forests are often used to train predictive models. However, gaining any explanatory insight into the mechanics leading to a specific outcome is rather complex, which impedes the implementation of random forests into clinical practice. By simplifying a complex ensemble of decision trees to a single most representative tree, it is assumed to be possible to observe common tree structures, the importance of specific features and variable interactions. Thus, representative trees could also help to understand interactions between genetic variants. Intuitively, representative trees are those with the minimal distance to all other trees, which requires a proper definition of the distance between two trees. Thus, we developed a new tree-based distance measure, which incorporates more of the underlying tree structure than other metrics. We compared our new method with the existing metrics in an extensive simulation study and applied it to predict the age at onset based on a set of genetic risk factors in a clinical data set. In our simulation study we were able to show the advantages of our weighted splitting variable approach. Our real data application revealed that representative trees are not only able to replicate the results from a recent genome-wide association study, but also can give additional explanations of the genetic mechanisms. Finally, we implemented all compared distance measures in R and made them publicly available in the R package timbR (https://github.com/imbs-hl/timbR).</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 2","pages":"363 - 380"},"PeriodicalIF":1.4,"publicationDate":"2023-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00537-7.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135553965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Threshold-based Naïve Bayes classifier 基于阈值的奈夫贝叶斯分类器
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-03-14 DOI: 10.1007/s11634-023-00536-8
Maurizio Romano, Giulia Contu, Francesco Mola, Claudio Conversano

The Threshold-based Naïve Bayes (Tb-NB) classifier is introduced as a (simple) improved version of the original Naïve Bayes classifier. Tb-NB extracts the sentiment from a Natural Language text corpus and allows the user not only to predict how much a sentence is positive (negative) but also to quantify a sentiment with a numeric value. It is based on the estimation of a single threshold value that concurs to define a decision rule that classifies a text into a positive (negative) opinion based on its content. One of the main advantage deriving from Tb-NB is the possibility to utilize its results as the input of post-hoc analysis aimed at observing how the quality associated to the different dimensions of a product or a service or, in a mirrored fashion, the different dimensions of customer satisfaction evolve in time or change with respect to different locations. The effectiveness of Tb-NB is evaluated analyzing data concerning the tourism industry and, specifically, hotel guests’ reviews from all hotels located in the Sardinian region and available on Booking.com. Moreover, Tb-NB is compared with other popular classifiers used in sentiment analysis in terms of model accuracy, resistance to noise and computational efficiency.

基于阈值的奈夫贝叶斯(Tb-NB)分类器是原始奈夫贝叶斯分类器的(简单)改进版。Tb-NB 从自然语言文本语料库中提取情感,用户不仅可以预测句子的正面(负面)程度,还可以用数值量化情感。它的基础是对单一阈值的估计,该阈值可以定义一条决策规则,根据文本内容将其归类为正面(负面)观点。Tb-NB 的主要优势之一是可以利用其结果作为事后分析的输入,目的是观察与产品或服务的不同维度相关的质量,或者以镜像方式观察客户满意度的不同维度是如何随时间演变或随不同地点变化的。对 Tb-NB 的有效性进行了评估,分析了与旅游业相关的数据,特别是撒丁岛地区所有酒店(Booking.com 上提供)的客人评价。此外,Tb-NB 还在模型准确性、抗干扰性和计算效率方面与情感分析中使用的其他流行分类器进行了比较。
{"title":"Threshold-based Naïve Bayes classifier","authors":"Maurizio Romano,&nbsp;Giulia Contu,&nbsp;Francesco Mola,&nbsp;Claudio Conversano","doi":"10.1007/s11634-023-00536-8","DOIUrl":"10.1007/s11634-023-00536-8","url":null,"abstract":"<div><p>The Threshold-based Naïve Bayes (Tb-NB) classifier is introduced as a (simple) improved version of the original Naïve Bayes classifier. Tb-NB extracts the sentiment from a Natural Language text corpus and allows the user not only to predict how much a sentence is positive (negative) but also to quantify a sentiment with a numeric value. It is based on the estimation of a single threshold value that concurs to define a decision rule that classifies a text into a positive (negative) opinion based on its content. One of the main advantage deriving from Tb-NB is the possibility to utilize its results as the input of post-hoc analysis aimed at observing how the quality associated to the different dimensions of a product or a service or, in a mirrored fashion, the different dimensions of customer satisfaction evolve in time or change with respect to different locations. The effectiveness of Tb-NB is evaluated analyzing data concerning the tourism industry and, specifically, hotel guests’ reviews from all hotels located in the Sardinian region and available on Booking.com. Moreover, Tb-NB is compared with other popular classifiers used in sentiment analysis in terms of model accuracy, resistance to noise and computational efficiency.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 2","pages":"325 - 361"},"PeriodicalIF":1.4,"publicationDate":"2023-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00536-8.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83512919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Editorial for ADAC issue 1 of volume 17 (2023) ADAC第17卷第1期编辑(2023)
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-02-17 DOI: 10.1007/s11634-023-00535-9
Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs
{"title":"Editorial for ADAC issue 1 of volume 17 (2023)","authors":"Maurizio Vichi,&nbsp;Andrea Cerioli,&nbsp;Hans A. Kestler,&nbsp;Akinori Okada,&nbsp;Claus Weihs","doi":"10.1007/s11634-023-00535-9","DOIUrl":"10.1007/s11634-023-00535-9","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 1","pages":"1 - 4"},"PeriodicalIF":1.6,"publicationDate":"2023-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00535-9.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50489816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clustering data with non-ignorable missingness using semi-parametric mixture models assuming independence within components 采用半参数混合模型对具有不可忽略缺失的数据进行聚类
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-02-12 DOI: 10.1007/s11634-023-00534-w
Marie du Roy de Chaumaray, Matthieu Marbac

We propose a semi-parametric clustering model assuming conditional independence given the component. One advantage is that this model can handle non-ignorable missingness. The model defines each component as a product of univariate probability distributions but makes no assumption on the form of each univariate density. Note that the mixture model is used for clustering but not for estimating the density of the full variables (observed and unobserved). Estimation is performed by maximizing an extension of the smoothed likelihood allowing missingness. This optimization is achieved by a Majorization-Minorization algorithm. We illustrate the relevance of our approach by numerical experiments conducted on simulated data. Under mild assumptions, we show the identifiability of the model defining the distribution of the observed data and the monotonicity of the algorithm. We also propose an extension of this new method to the case of mixed-type data that we illustrate on a real data set. The proposed method is implemented in the R package MNARclust available on CRAN.

我们提出了一种假设条件独立的半参数聚类模型。这个模型的一个优点是可以处理不可忽略的缺失。该模型将每个分量定义为单变量概率分布的乘积,但没有对每个单变量密度的形式进行假设。注意,混合模型用于聚类,而不是用于估计完整变量(观察到的和未观察到的)的密度。估计是通过最大化允许缺失的平滑似然的扩展来执行的。这种优化是通过一个多数化-少数化算法实现的。我们通过在模拟数据上进行的数值实验来说明我们方法的相关性。在温和的假设下,我们证明了定义观测数据分布的模型的可辨识性和算法的单调性。我们还提出了将这种新方法扩展到混合类型数据的情况,并在实际数据集上进行了说明。该方法在CRAN上可用的R包MNARclust中实现。
{"title":"Clustering data with non-ignorable missingness using semi-parametric mixture models assuming independence within components","authors":"Marie du Roy de Chaumaray,&nbsp;Matthieu Marbac","doi":"10.1007/s11634-023-00534-w","DOIUrl":"10.1007/s11634-023-00534-w","url":null,"abstract":"<div><p>We propose a semi-parametric clustering model assuming conditional independence given the component. One advantage is that this model can handle non-ignorable missingness. The model defines each component as a product of univariate probability distributions but makes no assumption on the form of each univariate density. Note that the mixture model is used for clustering but not for estimating the density of the full variables (observed and unobserved). Estimation is performed by maximizing an extension of the smoothed likelihood allowing missingness. This optimization is achieved by a Majorization-Minorization algorithm. We illustrate the relevance of our approach by numerical experiments conducted on simulated data. Under mild assumptions, we show the identifiability of the model defining the distribution of the observed data and the monotonicity of the algorithm. We also propose an extension of this new method to the case of mixed-type data that we illustrate on a real data set. The proposed method is implemented in the R package <span>MNARclust</span> available on CRAN.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 4","pages":"1081 - 1122"},"PeriodicalIF":1.6,"publicationDate":"2023-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50020807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust instance-dependent cost-sensitive classification 健壮的依赖实例的成本敏感分类
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-01-07 DOI: 10.1007/s11634-022-00533-3
Simon De Vos, Toon Vanderschueren, Tim Verdonck, Wouter Verbeke

Instance-dependent cost-sensitive (IDCS) learning methods have proven useful for binary classification tasks where individual instances are associated with variable misclassification costs. However, we demonstrate in this paper by means of a series of experiments that IDCS methods are sensitive to noise and outliers in relation to instance-dependent misclassification costs and their performance strongly depends on the cost distribution of the data sample. Therefore, we propose a generic three-step framework to make IDCS methods more robust: (i) detect outliers automatically, (ii) correct outlying cost information in a data-driven way, and (iii) construct an IDCS learning method using the adjusted cost information. We apply this framework to cslogit, a logistic regression-based IDCS method, to obtain its robust version, which we name r-cslogit. The robustness of this approach is introduced in steps (i) and (ii), where we make use of robust estimators to detect and impute outlying costs of individual instances. The newly proposed r-cslogit method is tested on synthetic and semi-synthetic data and proven to be superior in terms of savings compared to its non-robust counterpart for variable levels of noise and outliers. All our code is made available online at https://github.com/SimonDeVos/Robust-IDCS.

实例相关成本敏感(IDCS)学习方法已被证明可用于二进制分类任务,其中单个实例与可变的错误分类成本相关。然而,我们在本文中通过一系列实验证明,IDCS方法对与实例相关的错误分类成本的噪声和异常值敏感,并且它们的性能在很大程度上取决于数据样本的成本分布。因此,我们提出了一个通用的三步框架,使IDCS方法更加稳健:(i)自动检测异常值,(ii)以数据驱动的方式校正异常成本信息,以及(iii)使用调整后的成本信息构建IDCS学习方法。我们将该框架应用于cslogit,一种基于逻辑回归的IDCS方法,以获得其稳健版本,我们将其命名为r-cslogit。在步骤(i)和(ii)中介绍了这种方法的稳健性,其中我们使用稳健估计量来检测和估算个别实例的异常成本。新提出的r-cslogit方法在合成和半合成数据上进行了测试,并被证明在可变噪声水平和异常值的情况下,与非鲁棒方法相比,在节省方面是优越的。我们的所有代码都可在线获取,网址为https://github.com/SimonDeVos/Robust-IDCS.
{"title":"Robust instance-dependent cost-sensitive classification","authors":"Simon De Vos,&nbsp;Toon Vanderschueren,&nbsp;Tim Verdonck,&nbsp;Wouter Verbeke","doi":"10.1007/s11634-022-00533-3","DOIUrl":"10.1007/s11634-022-00533-3","url":null,"abstract":"<div><p>Instance-dependent cost-sensitive (IDCS) learning methods have proven useful for binary classification tasks where individual instances are associated with variable misclassification costs. However, we demonstrate in this paper by means of a series of experiments that IDCS methods are sensitive to noise and outliers in relation to instance-dependent misclassification costs and their performance strongly depends on the cost distribution of the data sample. Therefore, we propose a generic three-step framework to make IDCS methods more robust: (i) detect outliers automatically, (ii) correct outlying cost information in a data-driven way, and (iii) construct an IDCS learning method using the adjusted cost information. We apply this framework to cslogit, a logistic regression-based IDCS method, to obtain its robust version, which we name r-cslogit. The robustness of this approach is introduced in steps (i) and (ii), where we make use of robust estimators to detect and impute outlying costs of individual instances. The newly proposed r-cslogit method is tested on synthetic and semi-synthetic data and proven to be superior in terms of savings compared to its non-robust counterpart for variable levels of noise and outliers. All our code is made available online at https://github.com/SimonDeVos/Robust-IDCS.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 4","pages":"1057 - 1079"},"PeriodicalIF":1.6,"publicationDate":"2023-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50023687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Advances in Data Analysis and Classification
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1