Pub Date : 2024-11-15DOI: 10.1007/s11634-024-00615-4
Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs
{"title":"Editorial for ADAC issue 4 of volume 18 (2024)","authors":"Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs","doi":"10.1007/s11634-024-00615-4","DOIUrl":"10.1007/s11634-024-00615-4","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 4","pages":"823 - 826"},"PeriodicalIF":1.4,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142679608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-04DOI: 10.1007/s11634-024-00605-6
Paula Brito, Andrea Cerioli, Luis Angel García-Escudero, Gilbert Saporta
{"title":"Special issue on “New methodologies in clustering and classification for complex and/or big data”","authors":"Paula Brito, Andrea Cerioli, Luis Angel García-Escudero, Gilbert Saporta","doi":"10.1007/s11634-024-00605-6","DOIUrl":"10.1007/s11634-024-00605-6","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"539 - 543"},"PeriodicalIF":1.4,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142409860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-03DOI: 10.1007/s11634-024-00604-7
Francesco Bartolucci, Antonietta Mira, Stefano Peluso
A new modeling framework for bipartite social networks arising from a sequence of partially time-ordered relational events is proposed. We directly model the joint distribution of the binary variables indicating if each single actor is involved or not in an event. The adopted parametrization is based on first- and second-order effects, formulated as in marginal models for categorical data and free higher order effects. In particular, second-order effects are log-odds ratios with meaningful interpretation from the social perspective in terms of tendency to cooperate, in contrast to first-order effects interpreted in terms of tendency of each single actor to participate in an event. These effects are parametrized on the basis of the event times, so that suitable latent trajectories of individual behaviors may be represented. Inference is based on a composite likelihood function, maximized by an algorithm with numerical complexity proportional to the square of the number of units in the network. A classification composite likelihood is used to cluster the actors, simplifying the interpretation of the data structure. The proposed approach is illustrated on simulated data and on a dataset of scientific articles published in four top statistical journals from 2003 to 2012.
{"title":"Marginal models with individual-specific effects for the analysis of longitudinal bipartite networks","authors":"Francesco Bartolucci, Antonietta Mira, Stefano Peluso","doi":"10.1007/s11634-024-00604-7","DOIUrl":"https://doi.org/10.1007/s11634-024-00604-7","url":null,"abstract":"<p>A new modeling framework for bipartite social networks arising from a sequence of partially time-ordered relational events is proposed. We directly model the joint distribution of the binary variables indicating if each single actor is involved or not in an event. The adopted parametrization is based on first- and second-order effects, formulated as in marginal models for categorical data and free higher order effects. In particular, second-order effects are log-odds ratios with meaningful interpretation from the social perspective in terms of tendency to cooperate, in contrast to first-order effects interpreted in terms of tendency of each single actor to participate in an event. These effects are parametrized on the basis of the event times, so that suitable latent trajectories of individual behaviors may be represented. Inference is based on a composite likelihood function, maximized by an algorithm with numerical complexity proportional to the square of the number of units in the network. A classification composite likelihood is used to cluster the actors, simplifying the interpretation of the data structure. The proposed approach is illustrated on simulated data and on a dataset of scientific articles published in four top statistical journals from 2003 to 2012.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"61 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142209751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cluster Analysis techniques are a common approach to classifying objects within a dataset into distinct clusters. The clustering of geometric shapes of objects holds significant importance in various fields of study. To analyze the geometric shapes of objects, researchers often employ Statistical Shape Analysis methods, which retain crucial information after accounting for scaling, locating, and rotating an object. Consequently, several researchers have focused on adapting clustering algorithms for shape analysis. Recently, three-dimensional (3D) shape clustering has become crucial for analyzing, interpreting, and effectively utilizing 3D data across diverse industries, including medicine, robotics, civil engineering, and paleontology. In this study, we adapt the K-means, CLARANS and Hill Climbing methods using an approach based on the Bagging procedure to achieve enhanced clustering accuracy. We conduct simulation experiments for both isotropy and anisotropy scenarios, considering various dispersion variations. Furthermore, we apply the proposed approach to real datasets from relevant literature. We evaluate the obtained clusters using cluster validation measures, specifically the Rand Index and the Fowlkes-Mallows Index. Our results demonstrate substantial improvements in clustering quality when implementing the Bagging approach in conjunction with the K-means, CLARANS and Hill Climbing methods. The combination of the Bagging method and clustering algorithms provided substantial gains in the quality of the clusters.
聚类分析技术是将数据集中的对象划分为不同群组的常用方法。物体几何形状的聚类在各个研究领域都具有重要意义。为了分析物体的几何形状,研究人员通常采用统计形状分析方法,这种方法在对物体进行缩放、定位和旋转后,仍能保留关键信息。因此,一些研究人员专注于将聚类算法应用于形状分析。最近,三维(3D)形状聚类对于分析、解释和有效利用各行各业(包括医学、机器人、土木工程和古生物学)的三维数据变得至关重要。在本研究中,我们使用基于袋式程序的方法对 K-means、CLARANS 和 Hill Climbing 方法进行了调整,以实现更高的聚类精度。我们针对各向同性和各向异性情况进行了模拟实验,并考虑了各种分散变化。此外,我们还将提议的方法应用于相关文献中的真实数据集。我们使用聚类验证方法,特别是兰德指数和福克斯-马洛斯指数,对获得的聚类进行评估。我们的研究结果表明,当 Bagging 方法与 K-means、CLARANS 和 Hill Climbing 方法结合使用时,聚类质量得到了大幅提高。Bagging 方法与聚类算法的结合大大提高了聚类的质量。
{"title":"Using Bagging to improve clustering methods in the context of three-dimensional shapes","authors":"Inácio Nascimento, Raydonal Ospina, Getúlio Amorim","doi":"10.1007/s11634-024-00602-9","DOIUrl":"https://doi.org/10.1007/s11634-024-00602-9","url":null,"abstract":"<p>Cluster Analysis techniques are a common approach to classifying objects within a dataset into distinct clusters. The clustering of geometric shapes of objects holds significant importance in various fields of study. To analyze the geometric shapes of objects, researchers often employ Statistical Shape Analysis methods, which retain crucial information after accounting for scaling, locating, and rotating an object. Consequently, several researchers have focused on adapting clustering algorithms for shape analysis. Recently, three-dimensional (3D) shape clustering has become crucial for analyzing, interpreting, and effectively utilizing 3D data across diverse industries, including medicine, robotics, civil engineering, and paleontology. In this study, we adapt the <i>K-means</i>, <i>CLARANS</i> and <i>Hill Climbing</i> methods using an approach based on the <i>Bagging</i> procedure to achieve enhanced clustering accuracy. We conduct simulation experiments for both isotropy and anisotropy scenarios, considering various dispersion variations. Furthermore, we apply the proposed approach to real datasets from relevant literature. We evaluate the obtained clusters using cluster validation measures, specifically the Rand Index and the Fowlkes-Mallows Index. Our results demonstrate substantial improvements in clustering quality when implementing the <i>Bagging</i> approach in conjunction with the <i>K-means</i>, <i>CLARANS</i> and <i>Hill Climbing</i> methods. The combination of the Bagging method and clustering algorithms provided substantial gains in the quality of the clusters.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"58 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142209752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1007/s11634-024-00600-x
Michael Greenacre
The approach to analysing compositional data has been dominated by the use of logratio transformations, to ensure exact subcompositional coherence and, in some situations, exact isometry as well. A problem with this approach is that data zeros, found in most applications, have to be replaced to allow the logarithmic transformation. An alternative new approach, called the ‘chiPower’ transformation, which allows data zeros, is to combine the standardization inherent in the chi-square distance in correspondence analysis, with the essential elements of the Box-Cox power transformation. The chiPower transformation is justified because it defines between-sample distances that tend to logratio distances for strictly positive data as the power parameter tends to zero, and are then equivalent to transforming to logratios. For data with zeros, a value of the power can be identified that brings the chiPower transformation as close as possible to a logratio transformation, without having to substitute the zeros. Especially in the area of high-dimensional data, this alternative approach can present such a high level of coherence and isometry as to be a valid approach to the analysis of compositional data. Furthermore, in a supervised learning context, if the compositional variables serve as predictors of a response in a modelling framework, for example generalized linear models, then the power can be used as a tuning parameter in optimizing the accuracy of prediction through cross-validation. The chiPower-transformed variables have a straightforward interpretation, since they are identified with single compositional parts, not ratios.
{"title":"The chiPower transformation: a valid alternative to logratio transformations in compositional data analysis","authors":"Michael Greenacre","doi":"10.1007/s11634-024-00600-x","DOIUrl":"10.1007/s11634-024-00600-x","url":null,"abstract":"<div><p>The approach to analysing compositional data has been dominated by the use of logratio transformations, to ensure exact subcompositional coherence and, in some situations, exact isometry as well. A problem with this approach is that data zeros, found in most applications, have to be replaced to allow the logarithmic transformation. An alternative new approach, called the ‘chiPower’ transformation, which allows data zeros, is to combine the standardization inherent in the chi-square distance in correspondence analysis, with the essential elements of the Box-Cox power transformation. The chiPower transformation is justified because it defines between-sample distances that tend to logratio distances for strictly positive data as the power parameter tends to zero, and are then equivalent to transforming to logratios. For data with zeros, a value of the power can be identified that brings the chiPower transformation as close as possible to a logratio transformation, without having to substitute the zeros. Especially in the area of high-dimensional data, this alternative approach can present such a high level of coherence and isometry as to be a valid approach to the analysis of compositional data. Furthermore, in a supervised learning context, if the compositional variables serve as predictors of a response in a modelling framework, for example generalized linear models, then the power can be used as a tuning parameter in optimizing the accuracy of prediction through cross-validation. The chiPower-transformed variables have a straightforward interpretation, since they are identified with single compositional parts, not ratios.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"769 - 796"},"PeriodicalIF":1.4,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141886870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-26DOI: 10.1007/s11634-024-00601-w
José García-García, María Ángeles Gil, María Asunción Lubiano
Along recent years, interval-valued rating scales have been considered as an alternative to traditional single-point psychometric tools for human evaluations, such as Likert-type or visual analogue scales. More concretely, in answering to intrinsically imprecise items in a questionnaire, interval-valued scales seem to allow capturing a richer information than conventional ones. When analyzing data from given performances of questionnaires, one of the main targets is that of ensuring the internal consistency of the items in a construct or latent variable. The most popular indicator of internal consistency, whenever answers to items are given in accordance with a numerically based/encoded scale, is the well-known Cronbach α coefficient. This paper aims to extend such a coefficient to the case of interval-valued answers and to analyze some of its main statistical properties. For this purpose, after presenting some formal preliminaries for interval-valued data, firstly Cronbach’s α coefficient is extended to the case in which the constructs of a questionnaire allow interval-valued answers to their items. The range of the potential values of the extended coefficient is then discussed. Furthermore, the asymptotic distribution of the sample Cronbach α coefficient along with its bias and consistency properties, are examined from a theoretical perspective. Finally, the preceding asymptotic distribution of the sample coefficient as well as the influence of the number of respondents to the questionnaire and the number of items in the constructs are empirically illustrated through simulation-based studies.
{"title":"On some properties of Cronbach’s α coefficient for interval-valued data in questionnaires","authors":"José García-García, María Ángeles Gil, María Asunción Lubiano","doi":"10.1007/s11634-024-00601-w","DOIUrl":"https://doi.org/10.1007/s11634-024-00601-w","url":null,"abstract":"<p>Along recent years, interval-valued rating scales have been considered as an alternative to traditional single-point psychometric tools for human evaluations, such as Likert-type or visual analogue scales. More concretely, in answering to intrinsically imprecise items in a questionnaire, interval-valued scales seem to allow capturing a richer information than conventional ones. When analyzing data from given performances of questionnaires, one of the main targets is that of ensuring the internal consistency of the items in a construct or latent variable. The most popular indicator of internal consistency, whenever answers to items are given in accordance with a numerically based/encoded scale, is the well-known Cronbach <i> α</i> coefficient. This paper aims to extend such a coefficient to the case of interval-valued answers and to analyze some of its main statistical properties. For this purpose, after presenting some formal preliminaries for interval-valued data, firstly Cronbach’s <i> α</i> coefficient is extended to the case in which the constructs of a questionnaire allow interval-valued answers to their items. The range of the potential values of the extended coefficient is then discussed. Furthermore, the asymptotic distribution of the sample Cronbach <i> α</i> coefficient along with its bias and consistency properties, are examined from a theoretical perspective. Finally, the preceding asymptotic distribution of the sample coefficient as well as the influence of the number of respondents to the questionnaire and the number of items in the constructs are empirically illustrated through simulation-based studies.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"59 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141770279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-04DOI: 10.1007/s11634-024-00599-1
Zeyu Ding, Simon Omlor, Katja Ickstadt, Alexander Munteanu
The logit and probit link functions are arguably the two most common choices for binary regression models. Many studies have extended the choice of link functions to avoid possible misspecification and to improve the model fit to the data. We introduce the p-generalized Gaussian distribution (p-GGD) to binary regression in a Bayesian framework. The p-GGD has received considerable attention due to its flexibility in modeling the tails, while generalizing, for instance, over the standard normal distribution where (p=2) or the Laplace distribution where (p=1). Here, we extend from maximum likelihood estimation (MLE) to Bayesian posterior estimation using Markov Chain Monte Carlo (MCMC) sampling for the model parameters (beta) and the link function parameter p. We use simulated and real-world data to verify the effect of different parameters p on the estimation results, and how logistic regression and probit regression can be incorporated into a broader framework. To make our Bayesian methods scalable in the case of large data, we also incorporate coresets to reduce the data before running the complex and time-consuming MCMC analysis. This allows us to perform very efficient calculations while retaining the original posterior parameter distributions up to little distortions both, in practice, and with theoretical guarantees.
logit 和 probit 连接函数可以说是二元回归模型中最常见的两种选择。许多研究对链接函数的选择进行了扩展,以避免可能的错误规范,并改善模型与数据的拟合。我们在贝叶斯框架下为二元回归引入了 p 广义高斯分布(p-GGD)。p-GGD因其对尾部建模的灵活性而受到广泛关注,例如,它可以泛化标准正态分布((p=2)或拉普拉斯分布((p=1))。在此,我们将最大似然估计(MLE)扩展到贝叶斯后验估计,使用马尔可夫链蒙特卡罗(MCMC)采样对模型参数(beta)和链接函数参数 p 进行估计。我们使用模拟数据和实际数据来验证不同参数 p 对估计结果的影响,以及如何将逻辑回归和概率回归纳入更广泛的框架。为了使我们的贝叶斯方法在大数据情况下具有可扩展性,我们还在运行复杂而耗时的 MCMC 分析之前加入了核心集来减少数据。这使我们能够执行非常高效的计算,同时保留原始的后验参数分布,无论是在实践中还是在理论保证上,都不会出现太大的扭曲。
{"title":"Scalable Bayesian p-generalized probit and logistic regression","authors":"Zeyu Ding, Simon Omlor, Katja Ickstadt, Alexander Munteanu","doi":"10.1007/s11634-024-00599-1","DOIUrl":"https://doi.org/10.1007/s11634-024-00599-1","url":null,"abstract":"<p>The logit and probit link functions are arguably the two most common choices for binary regression models. Many studies have extended the choice of link functions to avoid possible misspecification and to improve the model fit to the data. We introduce the <i>p</i>-generalized Gaussian distribution (<i>p</i>-GGD) to binary regression in a Bayesian framework. The <i>p</i>-GGD has received considerable attention due to its flexibility in modeling the tails, while generalizing, for instance, over the standard normal distribution where <span>(p=2)</span> or the Laplace distribution where <span>(p=1)</span>. Here, we extend from maximum likelihood estimation (MLE) to Bayesian posterior estimation using Markov Chain Monte Carlo (MCMC) sampling for the model parameters <span>(beta)</span> and the link function parameter <i>p</i>. We use simulated and real-world data to verify the effect of different parameters <i>p</i> on the estimation results, and how logistic regression and probit regression can be incorporated into a broader framework. To make our Bayesian methods scalable in the case of large data, we also incorporate coresets to reduce the data before running the complex and time-consuming MCMC analysis. This allows us to perform very efficient calculations while retaining the original posterior parameter distributions up to little distortions both, in practice, and with theoretical guarantees.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"3 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-25DOI: 10.1007/s11634-024-00598-2
Ornela Bregu, Nizar Bouguila
In this paper, we consider an alternative parametrization of Dirichlet Compound Negative Multinomial (DCNM) using rising polynomials. The new parametrization gets rid of Gamma functions and allows us to derive the Exact Fisher Information Matrix, which brings significant improvements to model performance due to feature correlation consideration. Second, we propose to improve the computation efficiency by approximating the DCNM model as a member of the exponential family of distributions, called EDCNM. The novel EDCNM model brings several advantages as compared to the DCNM model, such as a closed-form solution for maximum likelihood estimation, higher efficiency due to computational time reduction for sparse datasets, etc. Third, we implement Agglomerative Hierarchical clustering, where Kullback–Leibler divergence is derived and used to measure the distance between two EDCNM probability distributions. Finally, we integrate the Minimum Message Length criterion in our algorithm to estimate the optimal number of components of the mixture model. The merits of our proposed models are validated via challenging real-world applications in Natural Language Processing and Image/Video Recognition. Results reveal that the exponential approximation of the DCNM model has reduced significantly the computational complexity in high-dimensional feature spaces.
{"title":"Dirichlet compound negative multinomial mixture models and applications","authors":"Ornela Bregu, Nizar Bouguila","doi":"10.1007/s11634-024-00598-2","DOIUrl":"https://doi.org/10.1007/s11634-024-00598-2","url":null,"abstract":"<p>In this paper, we consider an alternative parametrization of Dirichlet Compound Negative Multinomial (DCNM) using rising polynomials. The new parametrization gets rid of Gamma functions and allows us to derive the Exact Fisher Information Matrix, which brings significant improvements to model performance due to feature correlation consideration. Second, we propose to improve the computation efficiency by approximating the DCNM model as a member of the exponential family of distributions, called EDCNM. The novel EDCNM model brings several advantages as compared to the DCNM model, such as a closed-form solution for maximum likelihood estimation, higher efficiency due to computational time reduction for sparse datasets, etc. Third, we implement Agglomerative Hierarchical clustering, where Kullback–Leibler divergence is derived and used to measure the distance between two EDCNM probability distributions. Finally, we integrate the Minimum Message Length criterion in our algorithm to estimate the optimal number of components of the mixture model. The merits of our proposed models are validated via challenging real-world applications in Natural Language Processing and Image/Video Recognition. Results reveal that the exponential approximation of the DCNM model has reduced significantly the computational complexity in high-dimensional feature spaces.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"25 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141506099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-19DOI: 10.1007/s11634-024-00596-4
Carlos Moreno-Pérez, Marco Minozzo
This paper investigates the reactions of US financial markets to press news from January 2019 to 1 May 2020. To this end, we deduce the content and uncertainty of the news by developing apposite indices from the headlines and snippets of The New York Times, using unsupervised machine learning techniques. In particular, we use Latent Dirichlet Allocation to infer the content (topics) of the articles, and Word Embedding (implemented with the Skip-gram model) and K-Means to measure their uncertainty. In this way, we arrive at the definition of a set of daily topic-specific uncertainty indices. These indices are then used to find explanations for the behavior of the US financial markets by implementing a batch of EGARCH models. In substance, we find that two topic-specific uncertainty indices, one related to COVID-19 news and the other to trade war news, explain the bulk of the movements in the financial markets from the beginning of 2019 to end-April 2020. Moreover, we find that the topic-specific uncertainty index related to the economy and the Federal Reserve is positively related to the financial markets, meaning that our index is able to capture the actions of the Federal Reserve during periods of uncertainty.
{"title":"Natural language processing and financial markets: semi-supervised modelling of coronavirus and economic news","authors":"Carlos Moreno-Pérez, Marco Minozzo","doi":"10.1007/s11634-024-00596-4","DOIUrl":"https://doi.org/10.1007/s11634-024-00596-4","url":null,"abstract":"<p>This paper investigates the reactions of US financial markets to press news from January 2019 to 1 May 2020. To this end, we deduce the content and uncertainty of the news by developing apposite indices from the headlines and snippets of The New York Times, using unsupervised machine learning techniques. In particular, we use Latent Dirichlet Allocation to infer the content (topics) of the articles, and Word Embedding (implemented with the Skip-gram model) and K-Means to measure their uncertainty. In this way, we arrive at the definition of a set of daily topic-specific uncertainty indices. These indices are then used to find explanations for the behavior of the US financial markets by implementing a batch of EGARCH models. In substance, we find that two topic-specific uncertainty indices, one related to COVID-19 news and the other to trade war news, explain the bulk of the movements in the financial markets from the beginning of 2019 to end-April 2020. Moreover, we find that the topic-specific uncertainty index related to the economy and the Federal Reserve is positively related to the financial markets, meaning that our index is able to capture the actions of the Federal Reserve during periods of uncertainty.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"82 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141506100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-10DOI: 10.1007/s11634-024-00597-3
Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs
{"title":"Editorial for ADAC issue 2 of volume 18 (2024)","authors":"Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs","doi":"10.1007/s11634-024-00597-3","DOIUrl":"10.1007/s11634-024-00597-3","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 2","pages":"245 - 249"},"PeriodicalIF":1.4,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141366538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}