Pub Date : 2024-03-12DOI: 10.1007/s11634-024-00583-9
Dingge Liang, Marco Corneli, Charles Bouveyron, Pierre Latouche
With the significant increase of interactions between individuals through numeric means, clustering of nodes in graphs has become a fundamental approach for analyzing large and complex networks. In this work, we propose the deep latent position model (DeepLPM), an end-to-end generative clustering approach which combines the widely used latent position model (LPM) for network analysis with a graph convolutional network encoding strategy. Moreover, an original estimation algorithm is introduced to integrate the explicit optimization of the posterior clustering probabilities via variational inference and the implicit optimization using stochastic gradient descent for graph reconstruction. Numerical experiments on simulated scenarios highlight the ability of DeepLPM to self-penalize the evidence lower bound for selecting the number of clusters, demonstrating its clustering capabilities compared to state-of-the-art methods. Finally, DeepLPM is further applied to an ecclesiastical network in Merovingian Gaul and to a citation network Cora to illustrate the practical interest in exploring large and complex real-world networks.
{"title":"Clustering by deep latent position model with graph convolutional network","authors":"Dingge Liang, Marco Corneli, Charles Bouveyron, Pierre Latouche","doi":"10.1007/s11634-024-00583-9","DOIUrl":"https://doi.org/10.1007/s11634-024-00583-9","url":null,"abstract":"<p>With the significant increase of interactions between individuals through numeric means, clustering of nodes in graphs has become a fundamental approach for analyzing large and complex networks. In this work, we propose the deep latent position model (DeepLPM), an end-to-end generative clustering approach which combines the widely used latent position model (LPM) for network analysis with a graph convolutional network encoding strategy. Moreover, an original estimation algorithm is introduced to integrate the explicit optimization of the posterior clustering probabilities via variational inference and the implicit optimization using stochastic gradient descent for graph reconstruction. Numerical experiments on simulated scenarios highlight the ability of DeepLPM to self-penalize the evidence lower bound for selecting the number of clusters, demonstrating its clustering capabilities compared to state-of-the-art methods. Finally, DeepLPM is further applied to an ecclesiastical network in Merovingian Gaul and to a citation network Cora to illustrate the practical interest in exploring large and complex real-world networks.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"35 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140126978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-07DOI: 10.1007/s11634-024-00582-w
Jianhua Zhao, Changchun Shang, Shulan Li, Ling Xin, Philip L. H. Yu
The Bayesian information criterion (BIC), defined as the observed data log likelihood minus a penalty term based on the sample size N, is a popular model selection criterion for factor analysis with complete data. This definition has also been suggested for incomplete data. However, the penalty term based on the ‘complete’ sample size N is the same no matter whether in a complete or incomplete data case. For incomplete data, there are often only (N_i<N) observations for variable i, which means that using the ‘complete’ sample size N implausibly ignores the amounts of missing information inherent in incomplete data. Given this observation, a novel hierarchical BIC (HBIC) criterion is proposed for factor analysis with incomplete data, which is denoted by HBICinc. The novelty is that HBICinc only uses the actual amounts of observed information, namely (N_i)’s, in the penalty term. Theoretically, it is shown that HBICinc is a large sample approximation of variational Bayesian (VB) lower bound, and BIC is a further approximation of HBICinc, which means that HBICinc shares the theoretical consistency of BIC. Experiments on synthetic and real data sets are conducted to access the finite sample performance of HBICinc, BIC, and related criteria with various missing rates. The results show that HBICinc and BIC perform similarly when the missing rate is small, but HBICinc is more accurate when the missing rate is not small.
贝叶斯信息准则(BIC)的定义是观测数据对数似然值减去基于样本量 N 的惩罚项,它是完整数据因素分析中常用的模型选择准则。这一定义也适用于不完整数据。然而,基于 "完整 "样本量 N 的惩罚项无论在完整数据还是不完整数据情况下都是一样的。对于不完整数据,变量 i 通常只有 (N_i<N) 个观测值,这意味着使用 "完整 "样本量 N 会难以置信地忽略不完整数据中固有的缺失信息量。鉴于此,我们提出了一种新的分层 BIC(HBIC)准则,用于不完整数据的因子分析,用 HBICinc 表示。其新颖之处在于,HBICinc 只在惩罚项中使用观察到的实际信息量,即 (N_i)。从理论上讲,HBICinc 是变异贝叶斯(VB)下限的大样本近似,而 BIC 是 HBICinc 的进一步近似,这意味着 HBICinc 与 BIC 具有相同的理论一致性。我们在合成数据集和真实数据集上进行了实验,以了解 HBICinc、BIC 和相关准则在不同缺失率下的有限样本性能。结果表明,当缺失率较小时,HBICinc 和 BIC 的性能相似,但当缺失率不大时,HBICinc 更准确。
{"title":"Choosing the number of factors in factor analysis with incomplete data via a novel hierarchical Bayesian information criterion","authors":"Jianhua Zhao, Changchun Shang, Shulan Li, Ling Xin, Philip L. H. Yu","doi":"10.1007/s11634-024-00582-w","DOIUrl":"https://doi.org/10.1007/s11634-024-00582-w","url":null,"abstract":"<p>The Bayesian information criterion (BIC), defined as the observed data log likelihood minus a penalty term based on the sample size <i>N</i>, is a popular model selection criterion for factor analysis with complete data. This definition has also been suggested for incomplete data. However, the penalty term based on the ‘complete’ sample size <i>N</i> is the same no matter whether in a complete or incomplete data case. For incomplete data, there are often only <span>(N_i<N)</span> observations for variable <i>i</i>, which means that using the ‘complete’ sample size <i>N</i> implausibly ignores the amounts of missing information inherent in incomplete data. Given this observation, a novel hierarchical BIC (HBIC) criterion is proposed for factor analysis with incomplete data, which is denoted by HBIC<sub>inc</sub>. The novelty is that HBIC<sub>inc</sub> only uses the actual amounts of observed information, namely <span>(N_i)</span>’s, in the penalty term. Theoretically, it is shown that HBIC<sub>inc</sub> is a large sample approximation of variational Bayesian (VB) lower bound, and BIC is a further approximation of HBIC<sub>inc</sub>, which means that HBIC<sub>inc</sub> shares the theoretical consistency of BIC. Experiments on synthetic and real data sets are conducted to access the finite sample performance of HBIC<sub>inc</sub>, BIC, and related criteria with various missing rates. The results show that HBIC<sub>inc</sub> and BIC perform similarly when the missing rate is small, but HBIC<sub>inc</sub> is more accurate when the missing rate is not small.\u0000</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"92 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140057185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-06DOI: 10.1007/s11634-024-00581-x
A. Martín Andrés, M. Álvarez Hernández
To measure the degree of agreement between R observers who independently classify n subjects within K categories, various kappa-type coefficients are often used. When R = 2, it is common to use the Cohen' kappa, Scott's pi, Gwet’s AC1/2, and Krippendorf's alpha coefficients (weighted or not). When R > 2, some pairwise version based on the aforementioned coefficients is normally used; with the same order as above: Hubert's kappa, Fleiss's kappa, Gwet's AC1/2, and Krippendorf's alpha. However, all these statistics are based on biased estimators of the expected index of agreements, since they estimate the product of two population proportions through the product of their sample estimators. The aims of this article are three. First, to provide statistics based on unbiased estimators of the expected index of agreements and determine their variance based on the variance of the original statistic. Second, to make pairwise extensions of some measures. And third, to show that the old and new estimators of the Cohen’s kappa and Hubert’s kappa coefficients match the well-known estimators of concordance and intraclass correlation coefficients, if the former are defined by assuming quadratic weights. The article shows that the new estimators are always greater than or equal the classic ones, except for the case of Gwet where it is the other way around, although these differences are only relevant with small sample sizes (e.g. n ≤ 30).
为了测量在 K 个类别中独立对 n 个受试者进行分类的 R 个观察者之间的一致程度,通常会使用各种卡帕类型的系数。当 R = 2 时,通常使用 Cohen' kappa、Scott's pi、Gwet's AC1/2 和 Krippendorf's alpha 系数(加权或不加权)。当 R > 2 时,通常使用基于上述系数的成对版本;顺序与上述相同:休伯特卡帕、弗莱斯卡帕、Gwet AC1/2 和 Krippendorf α。然而,所有这些统计都是基于有偏差的预期一致指数估计值,因为它们通过样本估计值的乘积来估计两个人口比例的乘积。本文的目的有三。首先,提供基于预期一致指数无偏估计值的统计量,并根据原始统计量的方差确定其方差。第二,对一些测量方法进行成对扩展。第三,证明科恩卡帕系数和休伯特卡帕系数的新旧估计值与众所周知的一致性和类内相关系数估计值相匹配,如果前者是通过假设二次加权来定义的话。文章表明,新估计值总是大于或等于经典估计值,除了 Gwet 的情况正好相反,不过这些差异只与小样本量(例如 n≤ 30)有关。
{"title":"Estimators of various kappa coefficients based on the unbiased estimator of the expected index of agreements","authors":"A. Martín Andrés, M. Álvarez Hernández","doi":"10.1007/s11634-024-00581-x","DOIUrl":"https://doi.org/10.1007/s11634-024-00581-x","url":null,"abstract":"<p>To measure the degree of agreement between <i>R</i> observers who independently classify <i>n</i> subjects within <i>K</i> categories, various <i>kappa</i>-type coefficients are often used. When <i>R</i> = 2, it is common to use the Cohen' <i>kappa</i>, Scott's <i>pi</i>, Gwet’s <i>AC1/2</i>, and Krippendorf's <i>alpha</i> coefficients (weighted or not). When <i>R</i> > 2, some pairwise version based on the aforementioned coefficients is normally used; with the same order as above: Hubert's <i>kappa</i>, Fleiss's <i>kappa</i>, Gwet's <i>AC1/2,</i> and Krippendorf's <i>alpha</i>. However, all these statistics are based on biased estimators of the expected index of agreements, since they estimate the product of two population proportions through the product of their sample estimators. The aims of this article are three. First, to provide statistics based on unbiased estimators of the expected index of agreements and determine their variance based on the variance of the original statistic. Second, to make pairwise extensions of some measures. And third, to show that the old and new estimators of the Cohen’s <i>kappa</i> and Hubert’s <i>kappa</i> coefficients match the well-known estimators of concordance and intraclass correlation coefficients, if the former are defined by assuming quadratic weights. The article shows that the new estimators are always greater than or equal the classic ones, except for the case of Gwet where it is the other way around, although these differences are only relevant with small sample sizes (e.g. <i>n</i> ≤ 30).</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"57 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140044390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-27DOI: 10.1007/s11634-024-00584-8
Luis-Angel García-Escudero, Salvatore Ingrassia, T. Brendan Murphy
{"title":"Special issue on “advances in models and learning for clustering and classification”","authors":"Luis-Angel García-Escudero, Salvatore Ingrassia, T. Brendan Murphy","doi":"10.1007/s11634-024-00584-8","DOIUrl":"10.1007/s11634-024-00584-8","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 1","pages":"1 - 4"},"PeriodicalIF":1.4,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142414305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-22DOI: 10.1007/s11634-024-00580-y
Carlo Gaetan, Paolo Girardi, Victor Muthama Musau
In the era of climate change, the distribution of climate variables evolves with changes not limited to the mean value. Consequently, clustering algorithms based on central tendency could produce misleading results when used to summarize spatial and/or temporal patterns. We present a novel approach to spatial clustering of time series based on quantiles using a Bayesian framework that incorporates a spatial dependence layer based on a Markov random field. A series of simulations tested the proposal, then applied to the sea surface temperature of the Mediterranean Sea, one of the first seas to be affected by the effects of climate change.
{"title":"Spatial quantile clustering of climate data","authors":"Carlo Gaetan, Paolo Girardi, Victor Muthama Musau","doi":"10.1007/s11634-024-00580-y","DOIUrl":"https://doi.org/10.1007/s11634-024-00580-y","url":null,"abstract":"<p>In the era of climate change, the distribution of climate variables evolves with changes not limited to the mean value. Consequently, clustering algorithms based on central tendency could produce misleading results when used to summarize spatial and/or temporal patterns. We present a novel approach to spatial clustering of time series based on quantiles using a Bayesian framework that incorporates a spatial dependence layer based on a Markov random field. A series of simulations tested the proposal, then applied to the sea surface temperature of the Mediterranean Sea, one of the first seas to be affected by the effects of climate change.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"198 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139946363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-12DOI: 10.1007/s11634-023-00577-z
Berkay Akturk, Ufuk Beyaztas, Han Lin Shang, Abhijit Mandal
Functional logistic regression is a popular model to capture a linear relationship between binary response and functional predictor variables. However, many methods used for parameter estimation in functional logistic regression are sensitive to outliers, which may lead to inaccurate parameter estimates and inferior classification accuracy. We propose a robust estimation procedure for functional logistic regression, in which the observations of the functional predictor are projected onto a set of finite-dimensional subspaces via robust functional principal component analysis. This dimension-reduction step reduces the outlying effects in the functional predictor. The logistic regression coefficient is estimated using an M-type estimator based on binary response and robust principal component scores. In doing so, we provide robust estimates by minimizing the effects of outliers in the binary response and functional predictor variables. Via a series of Monte-Carlo simulations and using hand radiograph data, we examine the parameter estimation and classification accuracy for the response variable. We find that the robust procedure outperforms some existing robust and non-robust methods when outliers are present, while producing competitive results when outliers are absent. In addition, the proposed method is computationally more efficient than some existing robust alternatives.
功能逻辑回归是一种常用的模型,用于捕捉二元响应与功能预测变量之间的线性关系。然而,用于函数逻辑回归参数估计的许多方法对异常值都很敏感,这可能导致参数估计不准确和分类准确性降低。我们提出了一种稳健的函数逻辑回归估计程序,通过稳健的函数主成分分析,将函数预测变量的观测值投影到一组有限维子空间上。这一降维步骤减少了功能预测因子中的离群效应。使用基于二元响应和稳健主成分得分的 M 型估计器来估计逻辑回归系数。在此过程中,我们将二元响应和功能预测变量中离群值的影响降至最低,从而提供稳健的估计值。通过一系列蒙特卡罗模拟并使用手部 X 射线照片数据,我们检验了响应变量的参数估计和分类准确性。我们发现,当出现异常值时,稳健程序优于一些现有的稳健和非稳健方法,而当没有异常值时,稳健程序也能产生有竞争力的结果。此外,与现有的一些稳健替代方法相比,所提出的方法在计算上更加高效。
{"title":"Robust functional logistic regression","authors":"Berkay Akturk, Ufuk Beyaztas, Han Lin Shang, Abhijit Mandal","doi":"10.1007/s11634-023-00577-z","DOIUrl":"https://doi.org/10.1007/s11634-023-00577-z","url":null,"abstract":"<p>Functional logistic regression is a popular model to capture a linear relationship between binary response and functional predictor variables. However, many methods used for parameter estimation in functional logistic regression are sensitive to outliers, which may lead to inaccurate parameter estimates and inferior classification accuracy. We propose a robust estimation procedure for functional logistic regression, in which the observations of the functional predictor are projected onto a set of finite-dimensional subspaces via robust functional principal component analysis. This dimension-reduction step reduces the outlying effects in the functional predictor. The logistic regression coefficient is estimated using an M-type estimator based on binary response and robust principal component scores. In doing so, we provide robust estimates by minimizing the effects of outliers in the binary response and functional predictor variables. Via a series of Monte-Carlo simulations and using hand radiograph data, we examine the parameter estimation and classification accuracy for the response variable. We find that the robust procedure outperforms some existing robust and non-robust methods when outliers are present, while producing competitive results when outliers are absent. In addition, the proposed method is computationally more efficient than some existing robust alternatives.\u0000</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"2018 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139771456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-07DOI: 10.1007/s11634-024-00579-5
Kateřina Pawlasová, Iva Karafiátová, Jiří Dvořák
A spatial point pattern is a collection of points observed in a bounded region of the Euclidean plane or space. With the dynamic development of modern imaging methods, large datasets of point patterns are available representing for example sub-cellular location patterns for human proteins or large forest populations. The main goal of this paper is to show the possibility of solving the supervised multi-class classification task for this particular type of complex data via functional neural networks. To predict the class membership for a newly observed point pattern, we compute an empirical estimate of a selected functional characteristic. Then, we consider such estimated function to be a functional variable entering the network. In a simulation study, we show that the neural network approach outperforms the kernel regression classifier that we consider a benchmark method in the point pattern setting. We also analyse a real dataset of point patterns of intramembranous particles and illustrate the practical applicability of the proposed method.
{"title":"Neural networks with functional inputs for multi-class supervised classification of replicated point patterns","authors":"Kateřina Pawlasová, Iva Karafiátová, Jiří Dvořák","doi":"10.1007/s11634-024-00579-5","DOIUrl":"10.1007/s11634-024-00579-5","url":null,"abstract":"<div><p>A spatial point pattern is a collection of points observed in a bounded region of the Euclidean plane or space. With the dynamic development of modern imaging methods, large datasets of point patterns are available representing for example sub-cellular location patterns for human proteins or large forest populations. The main goal of this paper is to show the possibility of solving the supervised multi-class classification task for this particular type of complex data via functional neural networks. To predict the class membership for a newly observed point pattern, we compute an empirical estimate of a selected functional characteristic. Then, we consider such estimated function to be a functional variable entering the network. In a simulation study, we show that the neural network approach outperforms the kernel regression classifier that we consider a benchmark method in the point pattern setting. We also analyse a real dataset of point patterns of intramembranous particles and illustrate the practical applicability of the proposed method.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"705 - 721"},"PeriodicalIF":1.4,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-024-00579-5.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139771644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-31DOI: 10.1007/s11634-023-00578-y
Yueqi Cao, Prudence Leung, Anthea Monod
Persistent homology is a methodology central to topological data analysis that extracts and summarizes the topological features within a dataset as a persistence diagram. It has recently gained much popularity from its myriad successful applications to many domains, however, its algebraic construction induces a metric space of persistence diagrams with a highly complex geometry. In this paper, we prove convergence of the k-means clustering algorithm on persistence diagram space and establish theoretical properties of the solution to the optimization problem in the Karush–Kuhn–Tucker framework. Additionally, we perform numerical experiments on both simulated and real data of various representations of persistent homology, including embeddings of persistence diagrams as well as diagrams themselves and their generalizations as persistence measures. We find that k-means clustering performance directly on persistence diagrams and measures outperform their vectorized representations.
{"title":"k-means clustering for persistent homology","authors":"Yueqi Cao, Prudence Leung, Anthea Monod","doi":"10.1007/s11634-023-00578-y","DOIUrl":"https://doi.org/10.1007/s11634-023-00578-y","url":null,"abstract":"<p>Persistent homology is a methodology central to topological data analysis that extracts and summarizes the topological features within a dataset as a persistence diagram. It has recently gained much popularity from its myriad successful applications to many domains, however, its algebraic construction induces a metric space of persistence diagrams with a highly complex geometry. In this paper, we prove convergence of the <i>k</i>-means clustering algorithm on persistence diagram space and establish theoretical properties of the solution to the optimization problem in the Karush–Kuhn–Tucker framework. Additionally, we perform numerical experiments on both simulated and real data of various representations of persistent homology, including embeddings of persistence diagrams as well as diagrams themselves and their generalizations as persistence measures. We find that <i>k</i>-means clustering performance directly on persistence diagrams and measures outperform their vectorized representations.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"77 4 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139644821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-17DOI: 10.1007/s11634-023-00574-2
Paolo Giudici, Emanuela Raffinetti
A key point to assess statistical forecasts is the evaluation of their predictive accuracy. Recently, a new measure, called Rank Graduation Accuracy (RGA), based on the concordance between the ranks of the predicted values and the ranks of the actual values of a series of observations to be forecast, was proposed for the assessment of the quality of the predictions. In this paper, we demonstrate that, in a classification perspective, when the response to be predicted is binary, the RGA coincides both with the AUROC and the Wilcoxon-Mann–Whitney statistic, and can be employed to evaluate the accuracy of probability forecasts. When the response to be predicted is real valued, the RGA can still be applied, differently from the AUROC, and similarly to measures such as the RMSE. Differently from the RMSE, the RGA measure evaluates point predictions in terms of their ranks, rather than in terms of their values, improving robustness.
{"title":"RGA: a unified measure of predictive accuracy","authors":"Paolo Giudici, Emanuela Raffinetti","doi":"10.1007/s11634-023-00574-2","DOIUrl":"https://doi.org/10.1007/s11634-023-00574-2","url":null,"abstract":"<p>A key point to assess statistical forecasts is the evaluation of their predictive accuracy. Recently, a new measure, called Rank Graduation Accuracy (RGA), based on the concordance between the ranks of the predicted values and the ranks of the actual values of a series of observations to be forecast, was proposed for the assessment of the quality of the predictions. In this paper, we demonstrate that, in a classification perspective, when the response to be predicted is binary, the RGA coincides both with the AUROC and the Wilcoxon-Mann–Whitney statistic, and can be employed to evaluate the accuracy of probability forecasts. When the response to be predicted is real valued, the RGA can still be applied, differently from the AUROC, and similarly to measures such as the RMSE. Differently from the RMSE, the RGA measure evaluates point predictions in terms of their ranks, rather than in terms of their values, improving robustness.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"1 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139481072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-18DOI: 10.1007/s11634-023-00576-0
Hanning Chen, Qiang Zhao, Jingjing Wu
This paper addresses the two-class classification problem for data with rare and weak signals, under the modern high-dimension setup (p>>n). Considering the two-component mixture of Gaussian features with different random mean vector of rare and weak signals but common covariance matrix (homoscedastic Gaussian), Fan (AS 41:2537-2571, 2013) investigated the optimality of linear discriminant analysis (LDA) and proposed an efficient variable selection and classification procedure. We extend their work by incorporating the more general scenario that the two components have different random covariance matrices with difference of rare and weak signals, in order to assess the effect of difference in covariance matrix on classification. Under this model, we investigated the behaviour of quadratic discriminant analysis (QDA) classifier. In theoretical aspect, we derived the successful and unsuccessful classification regions of QDA. For data of rare signals, variable selection will mostly improve the performance of statistical procedures. Thus in implementation aspect, we proposed a variable selection procedure for QDA based on the Higher Criticism Thresholding (HCT) that was proved efficient for LDA. In addition, we conducted extensive simulation studies to demonstrate the successful and unsuccessful classification regions of QDA and evaluate the effectiveness of the proposed HCT thresholded QDA.
{"title":"QDA classification of high-dimensional data with rare and weak signals","authors":"Hanning Chen, Qiang Zhao, Jingjing Wu","doi":"10.1007/s11634-023-00576-0","DOIUrl":"https://doi.org/10.1007/s11634-023-00576-0","url":null,"abstract":"<p>This paper addresses the two-class classification problem for data with rare and weak signals, under the modern high-dimension setup <span>(p>>n)</span>. Considering the two-component mixture of Gaussian features with different random mean vector of rare and weak signals but common covariance matrix (homoscedastic Gaussian), Fan (AS 41:2537-2571, 2013) investigated the optimality of linear discriminant analysis (LDA) and proposed an efficient variable selection and classification procedure. We extend their work by incorporating the more general scenario that the two components have different random covariance matrices with difference of rare and weak signals, in order to assess the effect of difference in covariance matrix on classification. Under this model, we investigated the behaviour of quadratic discriminant analysis (QDA) classifier. In theoretical aspect, we derived the successful and unsuccessful classification regions of QDA. For data of rare signals, variable selection will mostly improve the performance of statistical procedures. Thus in implementation aspect, we proposed a variable selection procedure for QDA based on the Higher Criticism Thresholding (HCT) that was proved efficient for LDA. In addition, we conducted extensive simulation studies to demonstrate the successful and unsuccessful classification regions of QDA and evaluate the effectiveness of the proposed HCT thresholded QDA.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"72 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138745929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}