Advances in Data Analysis and Classification最新文献

英文中文

Editorial for ADAC issue 4 of volume 18 (2024) 为 ADAC 第 18 卷第 4 期（2024 年）撰写的社论

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-11-15 DOI: 10.1007/s11634-024-00615-4

Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs

引用次数: 0

Special issue on “New methodologies in clustering and classification for complex and/or big data” 关于 "复杂和/或海量数据的聚类和分类新方法 "的特刊

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-09-04 DOI: 10.1007/s11634-024-00605-6

Paula Brito, Andrea Cerioli, Luis Angel García-Escudero, Gilbert Saporta

引用次数: 0

Marginal models with individual-specific effects for the analysis of longitudinal bipartite networks 用于分析纵向双方位网络的具有特定个体效应的边际模型

IF 1.6 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-09-03 DOI: 10.1007/s11634-024-00604-7

Francesco Bartolucci, Antonietta Mira, Stefano Peluso

A new modeling framework for bipartite social networks arising from a sequence of partially time-ordered relational events is proposed. We directly model the joint distribution of the binary variables indicating if each single actor is involved or not in an event. The adopted parametrization is based on first- and second-order effects, formulated as in marginal models for categorical data and free higher order effects. In particular, second-order effects are log-odds ratios with meaningful interpretation from the social perspective in terms of tendency to cooperate, in contrast to first-order effects interpreted in terms of tendency of each single actor to participate in an event. These effects are parametrized on the basis of the event times, so that suitable latent trajectories of individual behaviors may be represented. Inference is based on a composite likelihood function, maximized by an algorithm with numerical complexity proportional to the square of the number of units in the network. A classification composite likelihood is used to cluster the actors, simplifying the interpretation of the data structure. The proposed approach is illustrated on simulated data and on a dataset of scientific articles published in four top statistical journals from 2003 to 2012.

本文提出了一个新的建模框架，用于建立由一连串部分时间有序的关系事件所产生的二元社会网络。我们直接对二元变量的联合分布进行建模，这些二元变量表明每个参与者是否参与了某一事件。所采用的参数基于一阶和二阶效应，与分类数据的边际模型和自由高阶效应一样。特别是，二阶效应是对数比率，从社会的角度来看，可以从合作倾向的角度进行有意义的解释，而一阶效应则是从每个参与者参与事件的倾向的角度进行解释。这些效应以事件时间为基础进行参数化，从而可以表示个人行为的适当潜在轨迹。推理以综合似然函数为基础，通过数值复杂度与网络中单元数量的平方成正比的算法使其最大化。分类复合似然用于对行动者进行聚类，从而简化数据结构的解释。建议的方法在模拟数据和 2003 年至 2012 年在四种顶级统计期刊上发表的科学文章数据集上进行了说明。

{"title":"Marginal models with individual-specific effects for the analysis of longitudinal bipartite networks","authors":"Francesco Bartolucci, Antonietta Mira, Stefano Peluso","doi":"10.1007/s11634-024-00604-7","DOIUrl":"https://doi.org/10.1007/s11634-024-00604-7","url":null,"abstract":"A new modeling framework for bipartite social networks arising from a sequence of partially time-ordered relational events is proposed. We directly model the joint distribution of the binary variables indicating if each single actor is involved or not in an event. The adopted parametrization is based on first- and second-order effects, formulated as in marginal models for categorical data and free higher order effects. In particular, second-order effects are log-odds ratios with meaningful interpretation from the social perspective in terms of tendency to cooperate, in contrast to first-order effects interpreted in terms of tendency of each single actor to participate in an event. These effects are parametrized on the basis of the event times, so that suitable latent trajectories of individual behaviors may be represented. Inference is based on a composite likelihood function, maximized by an algorithm with numerical complexity proportional to the square of the number of units in the network. A classification composite likelihood is used to cluster the actors, simplifying the interpretation of the data structure. The proposed approach is illustrated on simulated data and on a dataset of scientific articles published in four top statistical journals from 2003 to 2012.","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"61 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142209751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Using Bagging to improve clustering methods in the context of three-dimensional shapes 在三维图形中使用套袋法改进聚类方法

IF 1.6 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-08-21 DOI: 10.1007/s11634-024-00602-9

Inácio Nascimento, Raydonal Ospina, Getúlio Amorim

Cluster Analysis techniques are a common approach to classifying objects within a dataset into distinct clusters. The clustering of geometric shapes of objects holds significant importance in various fields of study. To analyze the geometric shapes of objects, researchers often employ Statistical Shape Analysis methods, which retain crucial information after accounting for scaling, locating, and rotating an object. Consequently, several researchers have focused on adapting clustering algorithms for shape analysis. Recently, three-dimensional (3D) shape clustering has become crucial for analyzing, interpreting, and effectively utilizing 3D data across diverse industries, including medicine, robotics, civil engineering, and paleontology. In this study, we adapt the K-means, CLARANS and Hill Climbing methods using an approach based on the Bagging procedure to achieve enhanced clustering accuracy. We conduct simulation experiments for both isotropy and anisotropy scenarios, considering various dispersion variations. Furthermore, we apply the proposed approach to real datasets from relevant literature. We evaluate the obtained clusters using cluster validation measures, specifically the Rand Index and the Fowlkes-Mallows Index. Our results demonstrate substantial improvements in clustering quality when implementing the Bagging approach in conjunction with the K-means, CLARANS and Hill Climbing methods. The combination of the Bagging method and clustering algorithms provided substantial gains in the quality of the clusters.

聚类分析技术是将数据集中的对象划分为不同群组的常用方法。物体几何形状的聚类在各个研究领域都具有重要意义。为了分析物体的几何形状，研究人员通常采用统计形状分析方法，这种方法在对物体进行缩放、定位和旋转后，仍能保留关键信息。因此，一些研究人员专注于将聚类算法应用于形状分析。最近，三维（3D）形状聚类对于分析、解释和有效利用各行各业（包括医学、机器人、土木工程和古生物学）的三维数据变得至关重要。在本研究中，我们使用基于袋式程序的方法对 K-means、CLARANS 和 Hill Climbing 方法进行了调整，以实现更高的聚类精度。我们针对各向同性和各向异性情况进行了模拟实验，并考虑了各种分散变化。此外，我们还将提议的方法应用于相关文献中的真实数据集。我们使用聚类验证方法，特别是兰德指数和福克斯-马洛斯指数，对获得的聚类进行评估。我们的研究结果表明，当 Bagging 方法与 K-means、CLARANS 和 Hill Climbing 方法结合使用时，聚类质量得到了大幅提高。Bagging 方法与聚类算法的结合大大提高了聚类的质量。

{"title":"Using Bagging to improve clustering methods in the context of three-dimensional shapes","authors":"Inácio Nascimento, Raydonal Ospina, Getúlio Amorim","doi":"10.1007/s11634-024-00602-9","DOIUrl":"https://doi.org/10.1007/s11634-024-00602-9","url":null,"abstract":"Cluster Analysis techniques are a common approach to classifying objects within a dataset into distinct clusters. The clustering of geometric shapes of objects holds significant importance in various fields of study. To analyze the geometric shapes of objects, researchers often employ Statistical Shape Analysis methods, which retain crucial information after accounting for scaling, locating, and rotating an object. Consequently, several researchers have focused on adapting clustering algorithms for shape analysis. Recently, three-dimensional (3D) shape clustering has become crucial for analyzing, interpreting, and effectively utilizing 3D data across diverse industries, including medicine, robotics, civil engineering, and paleontology. In this study, we adapt the K-means, CLARANS and Hill Climbing methods using an approach based on the Bagging procedure to achieve enhanced clustering accuracy. We conduct simulation experiments for both isotropy and anisotropy scenarios, considering various dispersion variations. Furthermore, we apply the proposed approach to real datasets from relevant literature. We evaluate the obtained clusters using cluster validation measures, specifically the Rand Index and the Fowlkes-Mallows Index. Our results demonstrate substantial improvements in clustering quality when implementing the Bagging approach in conjunction with the K-means, CLARANS and Hill Climbing methods. The combination of the Bagging method and clustering algorithms provided substantial gains in the quality of the clusters.","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"58 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142209752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The chiPower transformation: a valid alternative to logratio transformations in compositional data analysis chiPower 转换：组合数据分析中对数转换的有效替代方案

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-08-01 DOI: 10.1007/s11634-024-00600-x

Michael Greenacre

The approach to analysing compositional data has been dominated by the use of logratio transformations, to ensure exact subcompositional coherence and, in some situations, exact isometry as well. A problem with this approach is that data zeros, found in most applications, have to be replaced to allow the logarithmic transformation. An alternative new approach, called the ‘chiPower’ transformation, which allows data zeros, is to combine the standardization inherent in the chi-square distance in correspondence analysis, with the essential elements of the Box-Cox power transformation. The chiPower transformation is justified because it defines between-sample distances that tend to logratio distances for strictly positive data as the power parameter tends to zero, and are then equivalent to transforming to logratios. For data with zeros, a value of the power can be identified that brings the chiPower transformation as close as possible to a logratio transformation, without having to substitute the zeros. Especially in the area of high-dimensional data, this alternative approach can present such a high level of coherence and isometry as to be a valid approach to the analysis of compositional data. Furthermore, in a supervised learning context, if the compositional variables serve as predictors of a response in a modelling framework, for example generalized linear models, then the power can be used as a tuning parameter in optimizing the accuracy of prediction through cross-validation. The chiPower-transformed variables have a straightforward interpretation, since they are identified with single compositional parts, not ratios.

分析组合数据的方法主要是使用对数变换，以确保精确的子组合一致性，在某些情况下还能确保精确的等距性。这种方法存在的一个问题是，为了实现对数变换，必须替换大多数应用中的数据零点。另一种允许数据为零的新方法，即 "chiPower "变换，是将对应分析中的chi-square距离固有的标准化与Box-Cox幂变换的基本要素相结合。秩方变换之所以合理，是因为它定义了样本间距离，当幂次参数趋于零时，严格正数据的样本间距离趋于对数比例距离，然后等价于变换为对数比例。对于有零的数据，可以确定一个幂值，使 chiPower 变换尽可能接近对数比例变换，而无需替换零。特别是在高维数据领域，这种替代方法可以呈现高度的一致性和等距性，成为分析组合数据的有效方法。此外，在有监督学习的背景下，如果组合变量在建模框架（例如广义线性模型）中作为反应的预测因子，那么幂值就可以作为一个调整参数，通过交叉验证来优化预测的准确性。经过 chiPower 转换的变量有一个简单明了的解释，因为它们与单一的成分部分而不是比率相一致。

{"title":"The chiPower transformation: a valid alternative to logratio transformations in compositional data analysis","authors":"Michael Greenacre","doi":"10.1007/s11634-024-00600-x","DOIUrl":"10.1007/s11634-024-00600-x","url":null,"abstract":"<div>The approach to analysing compositional data has been dominated by the use of logratio transformations, to ensure exact subcompositional coherence and, in some situations, exact isometry as well. A problem with this approach is that data zeros, found in most applications, have to be replaced to allow the logarithmic transformation. An alternative new approach, called the ‘chiPower’ transformation, which allows data zeros, is to combine the standardization inherent in the chi-square distance in correspondence analysis, with the essential elements of the Box-Cox power transformation. The chiPower transformation is justified because it defines between-sample distances that tend to logratio distances for strictly positive data as the power parameter tends to zero, and are then equivalent to transforming to logratios. For data with zeros, a value of the power can be identified that brings the chiPower transformation as close as possible to a logratio transformation, without having to substitute the zeros. Especially in the area of high-dimensional data, this alternative approach can present such a high level of coherence and isometry as to be a valid approach to the analysis of compositional data. Furthermore, in a supervised learning context, if the compositional variables serve as predictors of a response in a modelling framework, for example generalized linear models, then the power can be used as a tuning parameter in optimizing the accuracy of prediction through cross-validation. The chiPower-transformed variables have a straightforward interpretation, since they are identified with single compositional parts, not ratios.</div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"769 - 796"},"PeriodicalIF":1.4,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141886870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On some properties of Cronbach’s α coefficient for interval-valued data in questionnaires 关于调查问卷中区间值数据的克朗巴赫 α 系数的一些特性

IF 1.6 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-07-26 DOI: 10.1007/s11634-024-00601-w

José García-García, María Ángeles Gil, María Asunción Lubiano

Along recent years, interval-valued rating scales have been considered as an alternative to traditional single-point psychometric tools for human evaluations, such as Likert-type or visual analogue scales. More concretely, in answering to intrinsically imprecise items in a questionnaire, interval-valued scales seem to allow capturing a richer information than conventional ones. When analyzing data from given performances of questionnaires, one of the main targets is that of ensuring the internal consistency of the items in a construct or latent variable. The most popular indicator of internal consistency, whenever answers to items are given in accordance with a numerically based/encoded scale, is the well-known Cronbach α coefficient. This paper aims to extend such a coefficient to the case of interval-valued answers and to analyze some of its main statistical properties. For this purpose, after presenting some formal preliminaries for interval-valued data, firstly Cronbach’s α coefficient is extended to the case in which the constructs of a questionnaire allow interval-valued answers to their items. The range of the potential values of the extended coefficient is then discussed. Furthermore, the asymptotic distribution of the sample Cronbach α coefficient along with its bias and consistency properties, are examined from a theoretical perspective. Finally, the preceding asymptotic distribution of the sample coefficient as well as the influence of the number of respondents to the questionnaire and the number of items in the constructs are empirically illustrated through simulation-based studies.

近年来，区间值评定量表被认为是传统单点心理测量工具（如李克特量表或视觉类比量表）的替代品。更具体地说，在回答问卷中本质上不精确的项目时，区间值量表似乎比传统量表能捕捉到更丰富的信息。在分析来自特定问卷的数据时，主要目标之一是确保构造或潜在变量中项目的内部一致性。在根据基于数字/编码的量表给出项目答案的情况下，最常用的内部一致性指标是著名的 Cronbach α 系数。本文旨在将该系数扩展到区间值答案的情况，并分析其一些主要的统计特性。为此，在对区间值数据进行了一些形式上的初步介绍后，首先将 Cronbach α 系数扩展到问卷的构造允许对其项目给出区间值答案的情况。然后讨论了扩展系数的潜在值范围。此外，还从理论角度研究了样本 Cronbach α 系数的渐近分布及其偏差和一致性特性。最后，通过模拟研究对样本系数的渐近分布以及问卷受访者人数和建构项数量的影响进行了实证说明。

{"title":"On some properties of Cronbach’s α coefficient for interval-valued data in questionnaires","authors":"José García-García, María Ángeles Gil, María Asunción Lubiano","doi":"10.1007/s11634-024-00601-w","DOIUrl":"https://doi.org/10.1007/s11634-024-00601-w","url":null,"abstract":"Along recent years, interval-valued rating scales have been considered as an alternative to traditional single-point psychometric tools for human evaluations, such as Likert-type or visual analogue scales. More concretely, in answering to intrinsically imprecise items in a questionnaire, interval-valued scales seem to allow capturing a richer information than conventional ones. When analyzing data from given performances of questionnaires, one of the main targets is that of ensuring the internal consistency of the items in a construct or latent variable. The most popular indicator of internal consistency, whenever answers to items are given in accordance with a numerically based/encoded scale, is the well-known Cronbach α coefficient. This paper aims to extend such a coefficient to the case of interval-valued answers and to analyze some of its main statistical properties. For this purpose, after presenting some formal preliminaries for interval-valued data, firstly Cronbach’s α coefficient is extended to the case in which the constructs of a questionnaire allow interval-valued answers to their items. The range of the potential values of the extended coefficient is then discussed. Furthermore, the asymptotic distribution of the sample Cronbach α coefficient along with its bias and consistency properties, are examined from a theoretical perspective. Finally, the preceding asymptotic distribution of the sample coefficient as well as the influence of the number of respondents to the questionnaire and the number of items in the constructs are empirically illustrated through simulation-based studies.","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"59 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141770279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scalable Bayesian p-generalized probit and logistic regression 可扩展的贝叶斯 p 广义概率和逻辑回归

IF 1.6 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-07-04 DOI: 10.1007/s11634-024-00599-1

Zeyu Ding, Simon Omlor, Katja Ickstadt, Alexander Munteanu

The logit and probit link functions are arguably the two most common choices for binary regression models. Many studies have extended the choice of link functions to avoid possible misspecification and to improve the model fit to the data. We introduce the p-generalized Gaussian distribution (p-GGD) to binary regression in a Bayesian framework. The p-GGD has received considerable attention due to its flexibility in modeling the tails, while generalizing, for instance, over the standard normal distribution where (p=2) or the Laplace distribution where (p=1). Here, we extend from maximum likelihood estimation (MLE) to Bayesian posterior estimation using Markov Chain Monte Carlo (MCMC) sampling for the model parameters (beta) and the link function parameter p. We use simulated and real-world data to verify the effect of different parameters p on the estimation results, and how logistic regression and probit regression can be incorporated into a broader framework. To make our Bayesian methods scalable in the case of large data, we also incorporate coresets to reduce the data before running the complex and time-consuming MCMC analysis. This allows us to perform very efficient calculations while retaining the original posterior parameter distributions up to little distortions both, in practice, and with theoretical guarantees.

logit 和 probit 连接函数可以说是二元回归模型中最常见的两种选择。许多研究对链接函数的选择进行了扩展，以避免可能的错误规范，并改善模型与数据的拟合。我们在贝叶斯框架下为二元回归引入了 p 广义高斯分布（p-GGD）。p-GGD因其对尾部建模的灵活性而受到广泛关注，例如，它可以泛化标准正态分布（(p=2）或拉普拉斯分布（(p=1））。在此，我们将最大似然估计（MLE）扩展到贝叶斯后验估计，使用马尔可夫链蒙特卡罗（MCMC）采样对模型参数（beta）和链接函数参数 p 进行估计。我们使用模拟数据和实际数据来验证不同参数 p 对估计结果的影响，以及如何将逻辑回归和概率回归纳入更广泛的框架。为了使我们的贝叶斯方法在大数据情况下具有可扩展性，我们还在运行复杂而耗时的 MCMC 分析之前加入了核心集来减少数据。这使我们能够执行非常高效的计算，同时保留原始的后验参数分布，无论是在实践中还是在理论保证上，都不会出现太大的扭曲。

{"title":"Scalable Bayesian p-generalized probit and logistic regression","authors":"Zeyu Ding, Simon Omlor, Katja Ickstadt, Alexander Munteanu","doi":"10.1007/s11634-024-00599-1","DOIUrl":"https://doi.org/10.1007/s11634-024-00599-1","url":null,"abstract":"The logit and probit link functions are arguably the two most common choices for binary regression models. Many studies have extended the choice of link functions to avoid possible misspecification and to improve the model fit to the data. We introduce the p-generalized Gaussian distribution (p-GGD) to binary regression in a Bayesian framework. The p-GGD has received considerable attention due to its flexibility in modeling the tails, while generalizing, for instance, over the standard normal distribution where (p=2) or the Laplace distribution where (p=1). Here, we extend from maximum likelihood estimation (MLE) to Bayesian posterior estimation using Markov Chain Monte Carlo (MCMC) sampling for the model parameters (beta) and the link function parameter p. We use simulated and real-world data to verify the effect of different parameters p on the estimation results, and how logistic regression and probit regression can be incorporated into a broader framework. To make our Bayesian methods scalable in the case of large data, we also incorporate coresets to reduce the data before running the complex and time-consuming MCMC analysis. This allows us to perform very efficient calculations while retaining the original posterior parameter distributions up to little distortions both, in practice, and with theoretical guarantees.","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"3 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dirichlet compound negative multinomial mixture models and applications Dirichlet 复合负多叉混合物模型及其应用

IF 1.6 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-06-25 DOI: 10.1007/s11634-024-00598-2

Ornela Bregu, Nizar Bouguila

In this paper, we consider an alternative parametrization of Dirichlet Compound Negative Multinomial (DCNM) using rising polynomials. The new parametrization gets rid of Gamma functions and allows us to derive the Exact Fisher Information Matrix, which brings significant improvements to model performance due to feature correlation consideration. Second, we propose to improve the computation efficiency by approximating the DCNM model as a member of the exponential family of distributions, called EDCNM. The novel EDCNM model brings several advantages as compared to the DCNM model, such as a closed-form solution for maximum likelihood estimation, higher efficiency due to computational time reduction for sparse datasets, etc. Third, we implement Agglomerative Hierarchical clustering, where Kullback–Leibler divergence is derived and used to measure the distance between two EDCNM probability distributions. Finally, we integrate the Minimum Message Length criterion in our algorithm to estimate the optimal number of components of the mixture model. The merits of our proposed models are validated via challenging real-world applications in Natural Language Processing and Image/Video Recognition. Results reveal that the exponential approximation of the DCNM model has reduced significantly the computational complexity in high-dimensional feature spaces.

在本文中，我们考虑使用上升多项式对狄利克特复合负多项式（DCNM）进行另一种参数化。新的参数化摆脱了伽马函数，使我们能够推导出精确费雪信息矩阵，由于考虑了特征相关性，模型性能有了显著提高。其次，我们建议将 DCNM 模型近似为指数分布族的一个成员，即 EDCNM，从而提高计算效率。与 DCNM 模型相比，新颖的 EDCNM 模型具有多种优势，如最大似然估计的闭式解、稀疏数据集计算时间减少带来的更高效率等。第三，我们实现了聚合分层聚类（Agglomerative Hierarchical clustering），并在此基础上推导出库尔贝-莱布勒发散（Kullback-Leibler divergence），用于测量两个 EDCNM 概率分布之间的距离。最后，我们在算法中整合了最小信息长度标准，以估算混合物模型的最佳成分数量。我们提出的模型的优点通过自然语言处理和图像/视频识别中具有挑战性的实际应用得到了验证。结果表明，DCNM 模型的指数近似大大降低了高维特征空间的计算复杂度。

{"title":"Dirichlet compound negative multinomial mixture models and applications","authors":"Ornela Bregu, Nizar Bouguila","doi":"10.1007/s11634-024-00598-2","DOIUrl":"https://doi.org/10.1007/s11634-024-00598-2","url":null,"abstract":"In this paper, we consider an alternative parametrization of Dirichlet Compound Negative Multinomial (DCNM) using rising polynomials. The new parametrization gets rid of Gamma functions and allows us to derive the Exact Fisher Information Matrix, which brings significant improvements to model performance due to feature correlation consideration. Second, we propose to improve the computation efficiency by approximating the DCNM model as a member of the exponential family of distributions, called EDCNM. The novel EDCNM model brings several advantages as compared to the DCNM model, such as a closed-form solution for maximum likelihood estimation, higher efficiency due to computational time reduction for sparse datasets, etc. Third, we implement Agglomerative Hierarchical clustering, where Kullback–Leibler divergence is derived and used to measure the distance between two EDCNM probability distributions. Finally, we integrate the Minimum Message Length criterion in our algorithm to estimate the optimal number of components of the mixture model. The merits of our proposed models are validated via challenging real-world applications in Natural Language Processing and Image/Video Recognition. Results reveal that the exponential approximation of the DCNM model has reduced significantly the computational complexity in high-dimensional feature spaces.","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"25 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141506099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Natural language processing and financial markets: semi-supervised modelling of coronavirus and economic news 自然语言处理和金融市场：冠状病毒和经济新闻的半监督建模

IF 1.6 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-06-19 DOI: 10.1007/s11634-024-00596-4

Carlos Moreno-Pérez, Marco Minozzo

This paper investigates the reactions of US financial markets to press news from January 2019 to 1 May 2020. To this end, we deduce the content and uncertainty of the news by developing apposite indices from the headlines and snippets of The New York Times, using unsupervised machine learning techniques. In particular, we use Latent Dirichlet Allocation to infer the content (topics) of the articles, and Word Embedding (implemented with the Skip-gram model) and K-Means to measure their uncertainty. In this way, we arrive at the definition of a set of daily topic-specific uncertainty indices. These indices are then used to find explanations for the behavior of the US financial markets by implementing a batch of EGARCH models. In substance, we find that two topic-specific uncertainty indices, one related to COVID-19 news and the other to trade war news, explain the bulk of the movements in the financial markets from the beginning of 2019 to end-April 2020. Moreover, we find that the topic-specific uncertainty index related to the economy and the Federal Reserve is positively related to the financial markets, meaning that our index is able to capture the actions of the Federal Reserve during periods of uncertainty.

本文研究了 2019 年 1 月至 2020 年 5 月 1 日期间美国金融市场对新闻的反应。为此，我们利用无监督机器学习技术，从《纽约时报》的头条新闻和片段中开发出相应的指数，从而推断出新闻的内容和不确定性。特别是，我们使用 Latent Dirichlet Allocation 来推断文章的内容（主题），并使用 Word Embedding（使用 Skip-gram 模型实现）和 K-Means 来衡量其不确定性。通过这种方法，我们定义了一组每日特定主题的不确定性指数。然后，通过实施一系列 EGARCH 模型，利用这些指数来寻找美国金融市场行为的解释。实质上，我们发现两个特定主题的不确定性指数，一个与 COVID-19 新闻有关，另一个与贸易战新闻有关，解释了 2019 年初至 2020 年 4 月底金融市场的大部分走势。此外，我们还发现，与经济和美联储相关的特定主题不确定性指数与金融市场呈正相关，这意味着我们的指数能够捕捉到美联储在不确定时期的行动。

{"title":"Natural language processing and financial markets: semi-supervised modelling of coronavirus and economic news","authors":"Carlos Moreno-Pérez, Marco Minozzo","doi":"10.1007/s11634-024-00596-4","DOIUrl":"https://doi.org/10.1007/s11634-024-00596-4","url":null,"abstract":"This paper investigates the reactions of US financial markets to press news from January 2019 to 1 May 2020. To this end, we deduce the content and uncertainty of the news by developing apposite indices from the headlines and snippets of The New York Times, using unsupervised machine learning techniques. In particular, we use Latent Dirichlet Allocation to infer the content (topics) of the articles, and Word Embedding (implemented with the Skip-gram model) and K-Means to measure their uncertainty. In this way, we arrive at the definition of a set of daily topic-specific uncertainty indices. These indices are then used to find explanations for the behavior of the US financial markets by implementing a batch of EGARCH models. In substance, we find that two topic-specific uncertainty indices, one related to COVID-19 news and the other to trade war news, explain the bulk of the movements in the financial markets from the beginning of 2019 to end-April 2020. Moreover, we find that the topic-specific uncertainty index related to the economy and the Federal Reserve is positively related to the financial markets, meaning that our index is able to capture the actions of the Federal Reserve during periods of uncertainty.","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"82 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141506100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Editorial for ADAC issue 2 of volume 18 (2024) ADAC 第 18 卷（2024 年）第 2 期社论

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-06-10 DOI: 10.1007/s11634-024-00597-3

Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Advances in Data Analysis and Classification

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀