Advances in Data Analysis and Classification最新文献

英文中文

Clustering by deep latent position model with graph convolutional network 利用图卷积网络的深度潜位置模型进行聚类

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-03-12 DOI: 10.1007/s11634-024-00583-9

Dingge Liang, Marco Corneli, Charles Bouveyron, Pierre Latouche

With the significant increase of interactions between individuals through numeric means, clustering of nodes in graphs has become a fundamental approach for analyzing large and complex networks. In this work, we propose the deep latent position model (DeepLPM), an end-to-end generative clustering approach which combines the widely used latent position model (LPM) for network analysis with a graph convolutional network encoding strategy. Moreover, an original estimation algorithm is introduced to integrate the explicit optimization of the posterior clustering probabilities via variational inference and the implicit optimization using stochastic gradient descent for graph reconstruction. Numerical experiments on simulated scenarios highlight the ability of DeepLPM to self-penalize the evidence lower bound for selecting the number of clusters, demonstrating its clustering capabilities compared to state-of-the-art methods. Finally, DeepLPM is further applied to an ecclesiastical network in Merovingian Gaul and to a citation network Cora to illustrate the practical interest in exploring large and complex real-world networks.

随着个体间通过数字手段进行交互的显著增加，图中节点的聚类已成为分析大型复杂网络的基本方法。在这项工作中，我们提出了深度潜在位置模型（DeepLPM），这是一种端到端的生成聚类方法，它将广泛用于网络分析的潜在位置模型（LPM）与图卷积网络编码策略相结合。此外，还引入了一种独创的估计算法，将通过变异推理对后验聚类概率的显式优化和使用随机梯度下降进行图重构的隐式优化整合在一起。在模拟场景上进行的数值实验凸显了 DeepLPM 在选择聚类数量时对证据下限进行自我惩罚的能力，证明了它与最先进方法相比的聚类能力。最后，DeepLPM 进一步应用于梅罗文高卢的教会网络和科拉的引文网络，以说明探索大型复杂现实世界网络的实际意义。

{"title":"Clustering by deep latent position model with graph convolutional network","authors":"Dingge Liang, Marco Corneli, Charles Bouveyron, Pierre Latouche","doi":"10.1007/s11634-024-00583-9","DOIUrl":"10.1007/s11634-024-00583-9","url":null,"abstract":"<div>With the significant increase of interactions between individuals through numeric means, clustering of nodes in graphs has become a fundamental approach for analyzing large and complex networks. In this work, we propose the deep latent position model (DeepLPM), an end-to-end generative clustering approach which combines the widely used latent position model (LPM) for network analysis with a graph convolutional network encoding strategy. Moreover, an original estimation algorithm is introduced to integrate the explicit optimization of the posterior clustering probabilities via variational inference and the implicit optimization using stochastic gradient descent for graph reconstruction. Numerical experiments on simulated scenarios highlight the ability of DeepLPM to self-penalize the evidence lower bound for selecting the number of clusters, demonstrating its clustering capabilities compared to state-of-the-art methods. Finally, DeepLPM is further applied to an ecclesiastical network in Merovingian Gaul and to a citation network Cora to illustrate the practical interest in exploring large and complex real-world networks.</div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 1","pages":"237 - 270"},"PeriodicalIF":1.4,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140126978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Choosing the number of factors in factor analysis with incomplete data via a novel hierarchical Bayesian information criterion 通过新型分层贝叶斯信息准则选择不完整数据因子分析中的因子数量

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-03-07 DOI: 10.1007/s11634-024-00582-w

Jianhua Zhao, Changchun Shang, Shulan Li, Ling Xin, Philip L. H. Yu

The Bayesian information criterion (BIC), defined as the observed data log likelihood minus a penalty term based on the sample size N, is a popular model selection criterion for factor analysis with complete data. This definition has also been suggested for incomplete data. However, the penalty term based on the ‘complete’ sample size N is the same no matter whether in a complete or incomplete data case. For incomplete data, there are often only (N_i<N) observations for variable i, which means that using the ‘complete’ sample size N implausibly ignores the amounts of missing information inherent in incomplete data. Given this observation, a novel hierarchical BIC (HBIC) criterion is proposed for factor analysis with incomplete data, which is denoted by HBIC_inc. The novelty is that HBIC_inc only uses the actual amounts of observed information, namely (N_i)’s, in the penalty term. Theoretically, it is shown that HBIC_inc is a large sample approximation of variational Bayesian (VB) lower bound, and BIC is a further approximation of HBIC_inc, which means that HBIC_inc shares the theoretical consistency of BIC. Experiments on synthetic and real data sets are conducted to access the finite sample performance of HBIC_inc, BIC, and related criteria with various missing rates. The results show that HBIC_inc and BIC perform similarly when the missing rate is small, but HBIC_inc is more accurate when the missing rate is not small.

贝叶斯信息准则（BIC）的定义是观测数据对数似然值减去基于样本量 N 的惩罚项，它是完整数据因素分析中常用的模型选择准则。这一定义也适用于不完整数据。然而，基于 "完整 "样本量 N 的惩罚项无论在完整数据还是不完整数据情况下都是一样的。对于不完整数据，变量 i 通常只有 (N_i<N) 个观测值，这意味着使用 "完整 "样本量 N 会难以置信地忽略不完整数据中固有的缺失信息量。鉴于此，我们提出了一种新的分层 BIC（HBIC）准则，用于不完整数据的因子分析，用 HBICinc 表示。其新颖之处在于，HBICinc 只在惩罚项中使用观察到的实际信息量，即 (N_i)。从理论上讲，HBICinc 是变异贝叶斯（VB）下限的大样本近似，而 BIC 是 HBICinc 的进一步近似，这意味着 HBICinc 与 BIC 具有相同的理论一致性。我们在合成数据集和真实数据集上进行了实验，以了解 HBICinc、BIC 和相关准则在不同缺失率下的有限样本性能。结果表明，当缺失率较小时，HBICinc 和 BIC 的性能相似，但当缺失率不大时，HBICinc 更准确。

{"title":"Choosing the number of factors in factor analysis with incomplete data via a novel hierarchical Bayesian information criterion","authors":"Jianhua Zhao, Changchun Shang, Shulan Li, Ling Xin, Philip L. H. Yu","doi":"10.1007/s11634-024-00582-w","DOIUrl":"10.1007/s11634-024-00582-w","url":null,"abstract":"<div>The Bayesian information criterion (BIC), defined as the observed data log likelihood minus a penalty term based on the sample size N, is a popular model selection criterion for factor analysis with complete data. This definition has also been suggested for incomplete data. However, the penalty term based on the ‘complete’ sample size N is the same no matter whether in a complete or incomplete data case. For incomplete data, there are often only (N_i<N) observations for variable i, which means that using the ‘complete’ sample size N implausibly ignores the amounts of missing information inherent in incomplete data. Given this observation, a novel hierarchical BIC (HBIC) criterion is proposed for factor analysis with incomplete data, which is denoted by HBICinc. The novelty is that HBICinc only uses the actual amounts of observed information, namely (N_i)’s, in the penalty term. Theoretically, it is shown that HBICinc is a large sample approximation of variational Bayesian (VB) lower bound, and BIC is a further approximation of HBICinc, which means that HBICinc shares the theoretical consistency of BIC. Experiments on synthetic and real data sets are conducted to access the finite sample performance of HBICinc, BIC, and related criteria with various missing rates. The results show that HBICinc and BIC perform similarly when the missing rate is small, but HBICinc is more accurate when the missing rate is not small.\u0000</div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 1","pages":"209 - 235"},"PeriodicalIF":1.4,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-024-00582-w.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140057185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Estimators of various kappa coefficients based on the unbiased estimator of the expected index of agreements 基于预期一致指数无偏估计器的各种卡帕系数估计器

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-03-06 DOI: 10.1007/s11634-024-00581-x

A. Martín Andrés, M. Álvarez Hernández

To measure the degree of agreement between R observers who independently classify n subjects within K categories, various kappa-type coefficients are often used. When R = 2, it is common to use the Cohen' kappa, Scott's pi, Gwet’s AC1/2, and Krippendorf's alpha coefficients (weighted or not). When R > 2, some pairwise version based on the aforementioned coefficients is normally used; with the same order as above: Hubert's kappa, Fleiss's kappa, Gwet's AC1/2, and Krippendorf's alpha. However, all these statistics are based on biased estimators of the expected index of agreements, since they estimate the product of two population proportions through the product of their sample estimators. The aims of this article are three. First, to provide statistics based on unbiased estimators of the expected index of agreements and determine their variance based on the variance of the original statistic. Second, to make pairwise extensions of some measures. And third, to show that the old and new estimators of the Cohen’s kappa and Hubert’s kappa coefficients match the well-known estimators of concordance and intraclass correlation coefficients, if the former are defined by assuming quadratic weights. The article shows that the new estimators are always greater than or equal the classic ones, except for the case of Gwet where it is the other way around, although these differences are only relevant with small sample sizes (e.g. n ≤ 30).

为了测量在 K 个类别中独立对 n 个受试者进行分类的 R 个观察者之间的一致程度，通常会使用各种卡帕类型的系数。当 R = 2 时，通常使用 Cohen' kappa、Scott's pi、Gwet's AC1/2 和 Krippendorf's alpha 系数（加权或不加权）。当 R > 2 时，通常使用基于上述系数的成对版本；顺序与上述相同：休伯特卡帕、弗莱斯卡帕、Gwet AC1/2 和 Krippendorf α。然而，所有这些统计都是基于有偏差的预期一致指数估计值，因为它们通过样本估计值的乘积来估计两个人口比例的乘积。本文的目的有三。首先，提供基于预期一致指数无偏估计值的统计量，并根据原始统计量的方差确定其方差。第二，对一些测量方法进行成对扩展。第三，证明科恩卡帕系数和休伯特卡帕系数的新旧估计值与众所周知的一致性和类内相关系数估计值相匹配，如果前者是通过假设二次加权来定义的话。文章表明，新估计值总是大于或等于经典估计值，除了 Gwet 的情况正好相反，不过这些差异只与小样本量（例如 n≤ 30）有关。

{"title":"Estimators of various kappa coefficients based on the unbiased estimator of the expected index of agreements","authors":"A. Martín Andrés, M. Álvarez Hernández","doi":"10.1007/s11634-024-00581-x","DOIUrl":"10.1007/s11634-024-00581-x","url":null,"abstract":"<div>To measure the degree of agreement between R observers who independently classify n subjects within K categories, various kappa-type coefficients are often used. When R = 2, it is common to use the Cohen' kappa, Scott's pi, Gwet’s AC1/2, and Krippendorf's alpha coefficients (weighted or not). When R > 2, some pairwise version based on the aforementioned coefficients is normally used; with the same order as above: Hubert's kappa, Fleiss's kappa, Gwet's AC1/2, and Krippendorf's alpha. However, all these statistics are based on biased estimators of the expected index of agreements, since they estimate the product of two population proportions through the product of their sample estimators. The aims of this article are three. First, to provide statistics based on unbiased estimators of the expected index of agreements and determine their variance based on the variance of the original statistic. Second, to make pairwise extensions of some measures. And third, to show that the old and new estimators of the Cohen’s kappa and Hubert’s kappa coefficients match the well-known estimators of concordance and intraclass correlation coefficients, if the former are defined by assuming quadratic weights. The article shows that the new estimators are always greater than or equal the classic ones, except for the case of Gwet where it is the other way around, although these differences are only relevant with small sample sizes (e.g. n ≤ 30).</div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 1","pages":"177 - 207"},"PeriodicalIF":1.4,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-024-00581-x.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140044390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Special issue on “advances in models and learning for clustering and classification” "聚类和分类模型与学习的进展 "特刊

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-02-27 DOI: 10.1007/s11634-024-00584-8

Luis-Angel García-Escudero, Salvatore Ingrassia, T. Brendan Murphy

引用次数: 0

Spatial quantile clustering of climate data 气候数据的空间量化聚类

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-02-22 DOI: 10.1007/s11634-024-00580-y

Carlo Gaetan, Paolo Girardi, Victor Muthama Musau

In the era of climate change, the distribution of climate variables evolves with changes not limited to the mean value. Consequently, clustering algorithms based on central tendency could produce misleading results when used to summarize spatial and/or temporal patterns. We present a novel approach to spatial clustering of time series based on quantiles using a Bayesian framework that incorporates a spatial dependence layer based on a Markov random field. A series of simulations tested the proposal, then applied to the sea surface temperature of the Mediterranean Sea, one of the first seas to be affected by the effects of climate change.

在气候变化的时代，气候变量的分布随着变化而变化，并不局限于平均值。因此，基于中心倾向的聚类算法在用于总结空间和/或时间模式时可能会产生误导性结果。我们提出了一种基于定量的时间序列空间聚类新方法，该方法采用贝叶斯框架，在马尔可夫随机场的基础上加入了空间依赖层。一系列模拟测试了这一建议，然后将其应用于地中海的海面温度，地中海是最早受到气候变化影响的海域之一。

引用次数: 0

Robust functional logistic regression 稳健功能逻辑回归

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-02-12 DOI: 10.1007/s11634-023-00577-z

Berkay Akturk, Ufuk Beyaztas, Han Lin Shang, Abhijit Mandal

Functional logistic regression is a popular model to capture a linear relationship between binary response and functional predictor variables. However, many methods used for parameter estimation in functional logistic regression are sensitive to outliers, which may lead to inaccurate parameter estimates and inferior classification accuracy. We propose a robust estimation procedure for functional logistic regression, in which the observations of the functional predictor are projected onto a set of finite-dimensional subspaces via robust functional principal component analysis. This dimension-reduction step reduces the outlying effects in the functional predictor. The logistic regression coefficient is estimated using an M-type estimator based on binary response and robust principal component scores. In doing so, we provide robust estimates by minimizing the effects of outliers in the binary response and functional predictor variables. Via a series of Monte-Carlo simulations and using hand radiograph data, we examine the parameter estimation and classification accuracy for the response variable. We find that the robust procedure outperforms some existing robust and non-robust methods when outliers are present, while producing competitive results when outliers are absent. In addition, the proposed method is computationally more efficient than some existing robust alternatives.

功能逻辑回归是一种常用的模型，用于捕捉二元响应与功能预测变量之间的线性关系。然而，用于函数逻辑回归参数估计的许多方法对异常值都很敏感，这可能导致参数估计不准确和分类准确性降低。我们提出了一种稳健的函数逻辑回归估计程序，通过稳健的函数主成分分析，将函数预测变量的观测值投影到一组有限维子空间上。这一降维步骤减少了功能预测因子中的离群效应。使用基于二元响应和稳健主成分得分的 M 型估计器来估计逻辑回归系数。在此过程中，我们将二元响应和功能预测变量中离群值的影响降至最低，从而提供稳健的估计值。通过一系列蒙特卡罗模拟并使用手部 X 射线照片数据，我们检验了响应变量的参数估计和分类准确性。我们发现，当出现异常值时，稳健程序优于一些现有的稳健和非稳健方法，而当没有异常值时，稳健程序也能产生有竞争力的结果。此外，与现有的一些稳健替代方法相比，所提出的方法在计算上更加高效。

{"title":"Robust functional logistic regression","authors":"Berkay Akturk, Ufuk Beyaztas, Han Lin Shang, Abhijit Mandal","doi":"10.1007/s11634-023-00577-z","DOIUrl":"10.1007/s11634-023-00577-z","url":null,"abstract":"<div>Functional logistic regression is a popular model to capture a linear relationship between binary response and functional predictor variables. However, many methods used for parameter estimation in functional logistic regression are sensitive to outliers, which may lead to inaccurate parameter estimates and inferior classification accuracy. We propose a robust estimation procedure for functional logistic regression, in which the observations of the functional predictor are projected onto a set of finite-dimensional subspaces via robust functional principal component analysis. This dimension-reduction step reduces the outlying effects in the functional predictor. The logistic regression coefficient is estimated using an M-type estimator based on binary response and robust principal component scores. In doing so, we provide robust estimates by minimizing the effects of outliers in the binary response and functional predictor variables. Via a series of Monte-Carlo simulations and using hand radiograph data, we examine the parameter estimation and classification accuracy for the response variable. We find that the robust procedure outperforms some existing robust and non-robust methods when outliers are present, while producing competitive results when outliers are absent. In addition, the proposed method is computationally more efficient than some existing robust alternatives.\u0000</div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 1","pages":"121 - 145"},"PeriodicalIF":1.4,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00577-z.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139771456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Neural networks with functional inputs for multi-class supervised classification of replicated point patterns 用于复制点模式多类监督分类的功能输入神经网络

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-02-07 DOI: 10.1007/s11634-024-00579-5

Kateřina Pawlasová, Iva Karafiátová, Jiří Dvořák

A spatial point pattern is a collection of points observed in a bounded region of the Euclidean plane or space. With the dynamic development of modern imaging methods, large datasets of point patterns are available representing for example sub-cellular location patterns for human proteins or large forest populations. The main goal of this paper is to show the possibility of solving the supervised multi-class classification task for this particular type of complex data via functional neural networks. To predict the class membership for a newly observed point pattern, we compute an empirical estimate of a selected functional characteristic. Then, we consider such estimated function to be a functional variable entering the network. In a simulation study, we show that the neural network approach outperforms the kernel regression classifier that we consider a benchmark method in the point pattern setting. We also analyse a real dataset of point patterns of intramembranous particles and illustrate the practical applicability of the proposed method.

空间点模式是在欧几里得平面或空间的有界区域内观察到的点的集合。随着现代成像方法的蓬勃发展，出现了大量的点模式数据集，例如人类蛋白质的亚细胞位置模式或大型森林种群。本文的主要目标是展示通过功能神经网络解决这类特殊复杂数据的多类分类任务的可能性。为了预测新观察到的点模式的类别成员资格，我们计算了所选功能特征的经验估计值。然后，我们将这种估计函数视为进入网络的函数变量。在模拟研究中，我们发现神经网络方法优于核回归分类器，我们认为核回归分类器是点模式设置中的基准方法。我们还分析了膜内颗粒点模式的真实数据集，并说明了所提方法的实际适用性。

{"title":"Neural networks with functional inputs for multi-class supervised classification of replicated point patterns","authors":"Kateřina Pawlasová, Iva Karafiátová, Jiří Dvořák","doi":"10.1007/s11634-024-00579-5","DOIUrl":"10.1007/s11634-024-00579-5","url":null,"abstract":"<div>A spatial point pattern is a collection of points observed in a bounded region of the Euclidean plane or space. With the dynamic development of modern imaging methods, large datasets of point patterns are available representing for example sub-cellular location patterns for human proteins or large forest populations. The main goal of this paper is to show the possibility of solving the supervised multi-class classification task for this particular type of complex data via functional neural networks. To predict the class membership for a newly observed point pattern, we compute an empirical estimate of a selected functional characteristic. Then, we consider such estimated function to be a functional variable entering the network. In a simulation study, we show that the neural network approach outperforms the kernel regression classifier that we consider a benchmark method in the point pattern setting. We also analyse a real dataset of point patterns of intramembranous particles and illustrate the practical applicability of the proposed method.</div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"705 - 721"},"PeriodicalIF":1.4,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-024-00579-5.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139771644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

k-means clustering for persistent homology 针对持久同源性的 k-means 聚类方法

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-01-31 DOI: 10.1007/s11634-023-00578-y

Yueqi Cao, Prudence Leung, Anthea Monod

Persistent homology is a methodology central to topological data analysis that extracts and summarizes the topological features within a dataset as a persistence diagram. It has recently gained much popularity from its myriad successful applications to many domains, however, its algebraic construction induces a metric space of persistence diagrams with a highly complex geometry. In this paper, we prove convergence of the k-means clustering algorithm on persistence diagram space and establish theoretical properties of the solution to the optimization problem in the Karush–Kuhn–Tucker framework. Additionally, we perform numerical experiments on both simulated and real data of various representations of persistent homology, including embeddings of persistence diagrams as well as diagrams themselves and their generalizations as persistence measures. We find that k-means clustering performance directly on persistence diagrams and measures outperform their vectorized representations.

持久同源性是拓扑数据分析的一种核心方法，它能以持久图的形式提取和总结数据集的拓扑特征。最近，这种方法在许多领域都得到了成功应用，因而大受欢迎。然而，这种方法的代数构造会产生一个具有高度复杂几何形状的持久图度量空间。在本文中，我们证明了 k-means 聚类算法在持久图空间上的收敛性，并建立了卡鲁什-库恩-塔克框架中优化问题解决方案的理论属性。此外，我们还对持久性同源性的各种表示方法（包括持久性图的嵌入、图本身及其作为持久性度量的概括）的模拟和真实数据进行了数值实验。我们发现，直接对持久图和持久度量进行 k-means 聚类的性能优于它们的矢量化表示。

引用次数: 0

RGA: a unified measure of predictive accuracy RGA：预测准确性的统一衡量标准

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-01-17 DOI: 10.1007/s11634-023-00574-2

Paolo Giudici, Emanuela Raffinetti

A key point to assess statistical forecasts is the evaluation of their predictive accuracy. Recently, a new measure, called Rank Graduation Accuracy (RGA), based on the concordance between the ranks of the predicted values and the ranks of the actual values of a series of observations to be forecast, was proposed for the assessment of the quality of the predictions. In this paper, we demonstrate that, in a classification perspective, when the response to be predicted is binary, the RGA coincides both with the AUROC and the Wilcoxon-Mann–Whitney statistic, and can be employed to evaluate the accuracy of probability forecasts. When the response to be predicted is real valued, the RGA can still be applied, differently from the AUROC, and similarly to measures such as the RMSE. Differently from the RMSE, the RGA measure evaluates point predictions in terms of their ranks, rather than in terms of their values, improving robustness.

评估统计预测的一个关键点是评价其预测准确性。最近，有人提出了一种新的评估预测质量的方法，称为 "等级渐变准确度"（RGA），它基于一系列待预测观测值的预测值等级与实际值等级之间的一致性。在本文中，我们从分类的角度证明，当要预测的响应是二元响应时，RGA 与 AUROC 和 Wilcoxon-Mann-Whitney 统计量相吻合，可用于评估概率预测的准确性。当要预测的响应是实值响应时，RGA 仍可应用，与 AUROC 不同，但与 RMSE 等指标类似。与 RMSE 不同的是，RGA 用等级而非数值来评估点预测，从而提高了稳健性。

引用次数: 0

QDA classification of high-dimensional data with rare and weak signals 对具有稀有和微弱信号的高维数据进行 QDA 分类

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2023-12-18 DOI: 10.1007/s11634-023-00576-0

Hanning Chen, Qiang Zhao, Jingjing Wu

This paper addresses the two-class classification problem for data with rare and weak signals, under the modern high-dimension setup (p>>n). Considering the two-component mixture of Gaussian features with different random mean vector of rare and weak signals but common covariance matrix (homoscedastic Gaussian), Fan (AS 41:2537-2571, 2013) investigated the optimality of linear discriminant analysis (LDA) and proposed an efficient variable selection and classification procedure. We extend their work by incorporating the more general scenario that the two components have different random covariance matrices with difference of rare and weak signals, in order to assess the effect of difference in covariance matrix on classification. Under this model, we investigated the behaviour of quadratic discriminant analysis (QDA) classifier. In theoretical aspect, we derived the successful and unsuccessful classification regions of QDA. For data of rare signals, variable selection will mostly improve the performance of statistical procedures. Thus in implementation aspect, we proposed a variable selection procedure for QDA based on the Higher Criticism Thresholding (HCT) that was proved efficient for LDA. In addition, we conducted extensive simulation studies to demonstrate the successful and unsuccessful classification regions of QDA and evaluate the effectiveness of the proposed HCT thresholded QDA.

本文探讨了现代高维设置下稀疏信号和弱信号数据的两类分类问题。考虑到具有不同随机均值向量的稀疏和微弱信号但具有共同协方差矩阵（同序高斯）的高斯特征双分量混合物，Fan（AS 41:2537-2571, 2013）研究了线性判别分析（LDA）的最优性，并提出了一种高效的变量选择和分类程序。我们扩展了他们的工作，将两个成分具有不同的随机协方差矩阵、稀有信号和微弱信号存在差异的更一般情况纳入其中，以评估协方差矩阵的差异对分类的影响。在这一模型下，我们研究了二次判别分析（QDA）分类器的行为。在理论方面，我们得出了 QDA 的成功和失败分类区域。对于稀有信号数据，变量选择大多会提高统计程序的性能。因此，在实施方面，我们提出了一种基于高批评阈值（HCT）的 QDA 变量选择程序，该程序在 LDA 中被证明是有效的。此外，我们还进行了大量的模拟研究，以展示 QDA 成功和失败的分类区域，并评估所提出的 HCT 门限 QDA 的有效性。

{"title":"QDA classification of high-dimensional data with rare and weak signals","authors":"Hanning Chen, Qiang Zhao, Jingjing Wu","doi":"10.1007/s11634-023-00576-0","DOIUrl":"10.1007/s11634-023-00576-0","url":null,"abstract":"<div>This paper addresses the two-class classification problem for data with rare and weak signals, under the modern high-dimension setup (p>>n). Considering the two-component mixture of Gaussian features with different random mean vector of rare and weak signals but common covariance matrix (homoscedastic Gaussian), Fan (AS 41:2537-2571, 2013) investigated the optimality of linear discriminant analysis (LDA) and proposed an efficient variable selection and classification procedure. We extend their work by incorporating the more general scenario that the two components have different random covariance matrices with difference of rare and weak signals, in order to assess the effect of difference in covariance matrix on classification. Under this model, we investigated the behaviour of quadratic discriminant analysis (QDA) classifier. In theoretical aspect, we derived the successful and unsuccessful classification regions of QDA. For data of rare signals, variable selection will mostly improve the performance of statistical procedures. Thus in implementation aspect, we proposed a variable selection procedure for QDA based on the Higher Criticism Thresholding (HCT) that was proved efficient for LDA. In addition, we conducted extensive simulation studies to demonstrate the successful and unsuccessful classification regions of QDA and evaluate the effectiveness of the proposed HCT thresholded QDA.</div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 1","pages":"31 - 65"},"PeriodicalIF":1.4,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00576-0.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138745929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Advances in Data Analysis and Classification

全部 Geobiology Appl. Clay Sci. Geochim. Cosmochim. Acta J. Hydrol. Org. Geochem. Carbon Balance Manage. Contrib. Mineral. Petrol. Int. J. Biometeorol. IZV-PHYS SOLID EART+ J. Atmos. Chem. Acta Oceanolog. Sin. Acta Geophys. ACTA GEOL POL ACTA PETROL SIN ACTA GEOL SIN-ENGL AAPG Bull. Acta Geochimica Adv. Atmos. Sci. Adv. Meteorol. Am. J. Phys. Anthropol. Am. J. Sci. Am. Mineral. Annu. Rev. Earth Planet. Sci. Appl. Geochem. Aquat. Geochem. Ann. Glaciol. Archaeol. Anthropol. Sci. ARCHAEOMETRY ARCT ANTARCT ALP RES Asia-Pac. J. Atmos. Sci. ATMOSPHERE-BASEL Atmos. Res. Aust. J. Earth Sci. Atmos. Chem. Phys. Atmos. Meas. Tech. Basin Res. Big Earth Data BIOGEOSCIENCES Geostand. Geoanal. Res. GEOLOGY Geosci. J. Geochem. J. Geochem. Trans. Geosci. Front. Geol. Ore Deposits Global Biogeochem. Cycles Gondwana Res. Geochem. Int. Geol. J. Geophys. Prospect. Geosci. Model Dev. GEOL BELG GROUNDWATER Hydrogeol. J. Hydrol. Earth Syst. Sci. Hydrol. Processes Int. J. Climatol. Int. J. Earth Sci. Int. Geol. Rev. Int. J. Disaster Risk Reduct. Int. J. Geomech. Int. J. Geog. Inf. Sci. Isl. Arc J. Afr. Earth. Sci. J. Adv. Model. Earth Syst. J APPL METEOROL CLIM J. Atmos. Oceanic Technol. J. Atmos. Sol. Terr. Phys. J. Clim. J. Earth Sci. J. Earth Syst. Sci. J. Environ. Eng. Geophys. J. Geog. Sci. Mineral. Mag. Miner. Deposita Mon. Weather Rev. Nat. Hazards Earth Syst. Sci. Nat. Clim. Change Nat. Geosci. Ocean Dyn. Ocean and Coastal Research npj Clim. Atmos. Sci. Ocean Modell. Ocean Sci. Ore Geol. Rev. OCEAN SCI J Paleontol. J. PALAEOGEOGR PALAEOCL PERIOD MINERAL PETROLOGY+ Phys. Chem. Miner. Polar Sci. Prog. Oceanogr. Quat. Sci. Rev. Q. J. Eng. Geol. Hydrogeol. RADIOCARBON Pure Appl. Geophys. Resour. Geol. Rev. Geophys. Sediment. Geol.

﹀