Journal of Multivariate Analysis最新文献

英文中文

Bias correction for kernel density estimation with spherical data 球形数据核密度估计的偏差修正

IF 1.6 3区数学 Q2 STATISTICS & PROBABILITY

Journal of Multivariate Analysis

Pub Date : 2024-06-01 DOI: 10.1016/j.jmva.2024.105338

Yasuhito Tsuruta

Kernel density estimations with spherical data can flexibly estimate the shape of an underlying density, including rotationally symmetric, skewed, and multimodal distributions. Standard estimators are generally based on rotationally symmetric kernel functions such as the von Mises kernel function. Unfortunately, their mean integrated squared error does not have root- $n$ consistency and increasing the dimension slows its convergence rate. Therefore, this study aims to improve its accuracy by correcting this bias. It proposes bias correction methods by applying the generalized jackknifing method that can be generated from the von Mises kernel function. We also obtain the asymptotic mean integrated squared errors of the proposed estimators. We find that the convergence rates of the proposed estimators are higher than those of previous estimators. Further, a numerical experiment shows that the proposed estimators perform better than the von Mises kernel density estimators in finite samples in scenarios that are mixtures of von Mises densities.

球形数据的核密度估计可以灵活地估计基础密度的形状，包括旋转对称、倾斜和多模态分布。标准估计器一般基于旋转对称核函数，如 von Mises 核函数。遗憾的是，它们的平均综合平方误差不具有根 n 一致性，而且维度的增加会减慢其收敛速度。因此，本研究旨在通过纠正这一偏差来提高其精度。本研究通过应用可由 von Mises 核函数生成的广义千斤顶分度法，提出了偏差修正方法。我们还获得了所提估计器的渐近平均积分平方误差。我们发现，所提出的估计器的收敛率高于之前的估计器。此外，数值实验表明，在有限样本中，在 von Mises 密度混合的情况下，所提出的估计器比 von Mises 核密度估计器的性能更好。

引用次数: 0

Adaptive directional estimator of the density in Rd for independent and mixing sequences 独立序列和混合序列的 Rd 中密度的自适应定向估计器

IF 1.6 3区数学 Q2 STATISTICS & PROBABILITY

Journal of Multivariate Analysis

Pub Date : 2024-05-28 DOI: 10.1016/j.jmva.2024.105332

Sinda Ammous , Jérôme Dedecker , Céline Duval

A new multivariate density estimator for stationary sequences is obtained by Fourier inversion of the thresholded empirical characteristic function. This estimator does not depend on the choice of parameters related to the smoothness of the density; it is directly adaptive. We establish oracle inequalities valid for independent, $α$ -mixing and $τ$ -mixing sequences, which allows us to derive optimal convergence rates, up to a logarithmic loss. On general anisotropic Sobolev classes, the estimator adapts to the regularity of the unknown density but also achieves directional adaptivity. More precisely, the estimator is able to reach the convergence rate induced by the best Sobolev regularity of the density of $A X$ , where $A$ belongs to a class of invertible matrices describing all the possible directions. The estimator is easy to implement and numerically efficient. It depends on the calibration of a parameter for which we propose an innovative numerical selection procedure, using the Euler characteristic of the thresholded areas.

通过对阈值经验特征函数进行傅立叶反演，可以获得一种新的静态序列多元密度估算器。该估计器不依赖于与密度平滑性相关的参数选择；它是直接自适应的。我们建立了适用于独立、α 混合和 τ 混合序列的 oracle 不等式，从而得出了最佳收敛率，但损失不超过对数。在一般各向异性的索博列夫类上，估计器不仅能适应未知密度的规则性，还能实现方向适应性。更准确地说，估计器能够达到 AX 密度的最佳索博列夫正则性所引起的收敛率，其中 A 属于描述所有可能方向的一类可逆矩阵。该估计器易于实现，数值效率高。它取决于一个参数的校准，为此我们提出了一个创新的数值选择程序，使用阈值区域的欧拉特性。

{"title":"Adaptive directional estimator of the density in Rd for independent and mixing sequences","authors":"Sinda Ammous , Jérôme Dedecker , Céline Duval","doi":"10.1016/j.jmva.2024.105332","DOIUrl":"https://doi.org/10.1016/j.jmva.2024.105332","url":null,"abstract":"<div><p>A new multivariate density estimator for stationary sequences is obtained by Fourier inversion of the thresholded empirical characteristic function. This estimator does not depend on the choice of parameters related to the smoothness of the density; it is directly adaptive. We establish oracle inequalities valid for independent, <span><math><mi>α</mi></math></span>-mixing and <span><math><mi>τ</mi></math></span>-mixing sequences, which allows us to derive optimal convergence rates, up to a logarithmic loss. On general anisotropic Sobolev classes, the estimator adapts to the regularity of the unknown density but also achieves directional adaptivity. More precisely, the estimator is able to reach the convergence rate induced by the <em>best</em> Sobolev regularity of the density of <span><math><mrow><mi>A</mi><mi>X</mi></mrow></math></span>, where <span><math><mi>A</mi></math></span> belongs to a class of invertible matrices describing all the possible directions. The estimator is easy to implement and numerically efficient. It depends on the calibration of a parameter for which we propose an innovative numerical selection procedure, using the Euler characteristic of the thresholded areas.</p></div>","PeriodicalId":16431,"journal":{"name":"Journal of Multivariate Analysis","volume":"203 ","pages":"Article 105332"},"PeriodicalIF":1.6,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141290044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ordinal pattern dependence and multivariate measures of dependence 序数模式依赖性和多元依赖性测量法

IF 1.6 3区数学 Q2 STATISTICS & PROBABILITY

Journal of Multivariate Analysis

Pub Date : 2024-05-28 DOI: 10.1016/j.jmva.2024.105337

Angelika Silbernagel, Alexander Schnurr

Ordinal pattern dependence has been introduced in order to capture co-monotonic behavior between two time series. This concept has several features one would intuitively demand from a dependence measure. It was believed that ordinal pattern dependence satisfies the axioms which Grothe et al. (2014) proclaimed for a multivariate measure of dependence. In the present article we show that this is not true and that there is a mistake in the article by Betken et al. (2021). Furthermore, we show that ordinal pattern dependence satisfies a slightly modified set of axioms.

引入序数模式依赖性是为了捕捉两个时间序列之间的共单调行为。这一概念具有人们对隶属度量直观要求的几个特征。人们认为，序数模式依赖性满足 Grothe 等人（2014 年）提出的多元依赖性度量公理。在本文中，我们将证明事实并非如此，Betken 等人（2021 年）的文章存在错误。此外，我们还证明了序数模式依赖性满足一组稍加修改的公理。

引用次数: 0

Parametric dependence between random vectors via copula-based divergence measures 通过基于 copula 的发散度量随机向量之间的参数依赖性

IF 1.6 3区数学 Q2 STATISTICS & PROBABILITY

Journal of Multivariate Analysis

Pub Date : 2024-05-24 DOI: 10.1016/j.jmva.2024.105336

Steven De Keyser, Irène Gijbels

This article proposes copula-based dependence quantification between multiple groups of random variables of possibly different sizes via the family of $Φ$ -divergences. An axiomatic framework for this purpose is provided, after which we focus on the absolutely continuous setting assuming copula densities exist. We consider parametric and semi-parametric frameworks, discuss estimation procedures, and report on asymptotic properties of the proposed estimators. In particular, we first concentrate on a Gaussian copula approach yielding explicit and attractive dependence coefficients for specific choices of $Φ$ , which are more amenable for estimation. Next, general parametric copula families are considered, with special attention to nested Archimedean copulas, being a natural choice for dependence modelling of random vectors. The results are illustrated by means of examples. Simulations and a real-world application on financial data are provided as well.

本文通过 Φ-divergences 系列提出了基于 copula 的大小可能不同的多组随机变量之间的依赖量化。本文为此提供了一个公理框架，之后我们将重点放在假设存在 copula 密度的绝对连续环境上。我们考虑了参数和半参数框架，讨论了估计程序，并报告了所提估计器的渐近特性。特别是，我们首先集中讨论了高斯共轭方法，这种方法对于特定的 Φ 选择具有明确而有吸引力的依赖系数，更适于估计。接下来，我们考虑了一般参数 copula 系列，并特别关注嵌套阿基米德 copulas，它是随机向量依赖性建模的自然选择。我们通过实例对结果进行了说明。此外，还提供了金融数据的模拟和实际应用。

引用次数: 0

Tensor recovery in high-dimensional Ising models 高维伊辛模型中的张量恢复

IF 1.6 3区数学 Q2 STATISTICS & PROBABILITY

Journal of Multivariate Analysis

Pub Date : 2024-05-23 DOI: 10.1016/j.jmva.2024.105335

Tianyu Liu , Somabha Mukherjee , Rahul Biswas

The $k$ -tensor Ising model is a multivariate exponential family on a $p$ -dimensional binary hypercube for modeling dependent binary data, where the sufficient statistic consists of all $k$ -fold products of the observations, and the parameter is an unknown $k$ -fold tensor, designed to capture higher-order interactions between the binary variables. In this paper, we describe an approach based on a penalization technique that helps us recover the signed support of the tensor parameter with high probability, assuming that no entry of the true tensor is too close to zero. The method is based on an $ℓ_{1}$ -regularized node-wise logistic regression, that recovers the signed neighborhood of each node with high probability. Our analysis is carried out in the high-dimensional regime, that allows the dimension $p$ of the Ising model, as well as the interaction factor $k$ to potentially grow to $\infty$ with the sample size $n$ . We show that if the minimum interaction strength is not too small, then consistent recovery of the entire signed support is possible if one takes $n = Ω ({(k!)}^{8} d^{3} log (\frac{p - 1}{k - 1}))$ samples, where $d$ denotes the maximum degree of the hypernetwork in question. Our results are validated in two simulation settings, and applied on a real neurobiological dataset consisting of multi-array electro-physiological recordings from the mouse visual cortex, to model higher-order interactions between the brain regions.

k 张量 Ising 模型是 p 维二元超立方体上的多元指数族，用于对依赖性二元数据建模，其中充分统计量由观测值的所有 k 倍乘积组成，而参数是一个未知的 k 倍张量，旨在捕捉二元变量之间的高阶交互作用。在本文中，我们介绍了一种基于惩罚技术的方法，假设真实张量的任何条目都不太接近零，该方法可以帮助我们高概率地恢复张量参数的符号支持。该方法基于 ℓ1-regularized 节点逻辑回归，能高概率地恢复每个节点的有符号邻域。我们的分析是在高维条件下进行的，这使得伊辛模型的维数 p 以及交互因子 k 有可能随着样本量 n 的增大而增长到 ∞。我们的研究表明，如果最小交互强度不太小，那么只要采取 n=Ω((k!)8d3logp-1k-1) 样本（其中 d 表示相关超网络的最大度数），就有可能一致地恢复整个有符号支持。我们的结果在两个模拟环境中得到了验证，并应用于由小鼠视觉皮层多阵列电生理记录组成的真实神经生物学数据集，以模拟大脑区域之间的高阶交互。

{"title":"Tensor recovery in high-dimensional Ising models","authors":"Tianyu Liu , Somabha Mukherjee , Rahul Biswas","doi":"10.1016/j.jmva.2024.105335","DOIUrl":"https://doi.org/10.1016/j.jmva.2024.105335","url":null,"abstract":"<div><p>The <span><math><mi>k</mi></math></span>-tensor Ising model is a multivariate exponential family on a <span><math><mi>p</mi></math></span>-dimensional binary hypercube for modeling dependent binary data, where the sufficient statistic consists of all <span><math><mi>k</mi></math></span>-fold products of the observations, and the parameter is an unknown <span><math><mi>k</mi></math></span>-fold tensor, designed to capture higher-order interactions between the binary variables. In this paper, we describe an approach based on a penalization technique that helps us recover the signed support of the tensor parameter with high probability, assuming that no entry of the true tensor is too close to zero. The method is based on an <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span>-regularized node-wise logistic regression, that recovers the signed neighborhood of each node with high probability. Our analysis is carried out in the high-dimensional regime, that allows the dimension <span><math><mi>p</mi></math></span> of the Ising model, as well as the interaction factor <span><math><mi>k</mi></math></span> to potentially grow to <span><math><mi>∞</mi></math></span> with the sample size <span><math><mi>n</mi></math></span>. We show that if the minimum interaction strength is not too small, then consistent recovery of the entire signed support is possible if one takes <span><math><mrow><mi>n</mi><mo>=</mo><mi>Ω</mi><mrow><mo>(</mo><msup><mrow><mrow><mo>(</mo><mi>k</mi><mo>!</mo><mo>)</mo></mrow></mrow><mrow><mn>8</mn></mrow></msup><msup><mrow><mi>d</mi></mrow><mrow><mn>3</mn></mrow></msup><mo>log</mo><mfenced><mrow><mfrac><mrow><mi>p</mi><mo>−</mo><mn>1</mn></mrow><mrow><mi>k</mi><mo>−</mo><mn>1</mn></mrow></mfrac></mrow></mfenced><mo>)</mo></mrow></mrow></math></span> samples, where <span><math><mi>d</mi></math></span> denotes the maximum degree of the hypernetwork in question. Our results are validated in two simulation settings, and applied on a real neurobiological dataset consisting of multi-array electro-physiological recordings from the mouse visual cortex, to model higher-order interactions between the brain regions.</p></div>","PeriodicalId":16431,"journal":{"name":"Journal of Multivariate Analysis","volume":"203 ","pages":"Article 105335"},"PeriodicalIF":1.6,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141164340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distribution-on-distribution regression with Wasserstein metric: Multivariate Gaussian case 使用 Wasserstein 度量的分布对分布回归：多变量高斯情况

IF 1.6 3区数学 Q2 STATISTICS & PROBABILITY

Journal of Multivariate Analysis

Pub Date : 2024-05-22 DOI: 10.1016/j.jmva.2024.105334

Ryo Okano , Masaaki Imaizumi

Distribution data refer to a data set in which each sample is represented as a probability distribution, a subject area that has received increasing interest in the field of statistics. Although several studies have developed distribution-to-distribution regression models for univariate variables, the multivariate scenario remains under-explored due to technical complexities. In this study, we introduce models for regression from one Gaussian distribution to another, using the Wasserstein metric. These models are constructed using the geometry of the Wasserstein space, which enables the transformation of Gaussian distributions into components of a linear matrix space. Owing to their linear regression frameworks, our models are intuitively understandable, and their implementation is simplified because of the optimal transport problem’s analytical solution between Gaussian distributions. We also explore a generalization of our models to encompass non-Gaussian scenarios. We establish the convergence rates of in-sample prediction errors for the empirical risk minimizations in our models. In comparative simulation experiments, our models demonstrate superior performance over a simpler alternative method that transforms Gaussian distributions into matrices. We present an application of our methodology using weather data for illustration purposes.

分布数据是指每个样本都表示为概率分布的数据集，这是统计学领域越来越受关注的一个主题领域。尽管已有多项研究针对单变量建立了分布到分布的回归模型，但由于技术复杂性，对多变量情况的研究仍然不足。在本研究中，我们使用 Wasserstein 度量引入了从一个高斯分布到另一个高斯分布的回归模型。这些模型是利用瓦瑟斯坦空间的几何结构构建的，它能将高斯分布转化为线性矩阵空间的分量。由于采用了线性回归框架，我们的模型直观易懂，而且由于高斯分布之间的最优传输问题有了解析解，模型的实现也得到了简化。我们还探索了模型的一般化，以涵盖非高斯情况。我们确定了模型中经验风险最小化的样本内预测误差收敛率。在比较模拟实验中，与将高斯分布转换为矩阵的更简单替代方法相比，我们的模型表现出更优越的性能。我们介绍了我们的方法在天气数据中的应用，以作说明。

{"title":"Distribution-on-distribution regression with Wasserstein metric: Multivariate Gaussian case","authors":"Ryo Okano , Masaaki Imaizumi","doi":"10.1016/j.jmva.2024.105334","DOIUrl":"https://doi.org/10.1016/j.jmva.2024.105334","url":null,"abstract":"<div><p>Distribution data refer to a data set in which each sample is represented as a probability distribution, a subject area that has received increasing interest in the field of statistics. Although several studies have developed distribution-to-distribution regression models for univariate variables, the multivariate scenario remains under-explored due to technical complexities. In this study, we introduce models for regression from one Gaussian distribution to another, using the Wasserstein metric. These models are constructed using the geometry of the Wasserstein space, which enables the transformation of Gaussian distributions into components of a linear matrix space. Owing to their linear regression frameworks, our models are intuitively understandable, and their implementation is simplified because of the optimal transport problem’s analytical solution between Gaussian distributions. We also explore a generalization of our models to encompass non-Gaussian scenarios. We establish the convergence rates of in-sample prediction errors for the empirical risk minimizations in our models. In comparative simulation experiments, our models demonstrate superior performance over a simpler alternative method that transforms Gaussian distributions into matrices. We present an application of our methodology using weather data for illustration purposes.</p></div>","PeriodicalId":16431,"journal":{"name":"Journal of Multivariate Analysis","volume":"203 ","pages":"Article 105334"},"PeriodicalIF":1.6,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0047259X24000411/pdfft?md5=dea43975f3758fd74adfc88e822be366&pid=1-s2.0-S0047259X24000411-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141239836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sparse subspace clustering in diverse multiplex network model 多样化多路复用网络模型中的稀疏子空间聚类

IF 1.6 3区数学 Q2 STATISTICS & PROBABILITY

Journal of Multivariate Analysis

Pub Date : 2024-05-17 DOI: 10.1016/j.jmva.2024.105333

Majid Noroozi , Marianna Pensky

The paper considers the DIverse MultiPLEx (DIMPLE) network model, where all layers of the network have the same collection of nodes and are equipped with the Stochastic Block Models. In addition, all layers can be partitioned into groups with the same community structures, although the layers in the same group may have different matrices of block connection probabilities. To the best of our knowledge, the DIMPLE model, introduced in Pensky and Wang (2021), presents the most broad SBM-equipped binary multilayer network model on the same set of nodes and, thus, generalizes a multitude of papers that study more restrictive settings. Under the DIMPLE model, the main task is to identify the groups of layers with the same community structures since the matrices of block connection probabilities act as nuisance parameters under the DIMPLE paradigm. The main contribution of the paper is achieving the strongly consistent between-layer clustering by using Sparse Subspace Clustering (SSC), the well-developed technique in computer vision. In addition, SSC allows to handle much larger networks than spectral clustering, and is perfectly suitable for application of parallel computing. Moreover, our paper is the first one to obtain precision guarantees for SSC when it is applied to binary data.

本文考虑的是反向多 PLEx（DIMPLE）网络模型，其中网络的所有层都有相同的节点集合，并配备随机块模型。此外，所有层都可以划分为具有相同群落结构的组，尽管同一组中的层可能具有不同的块连接概率矩阵。据我们所知，Pensky 和 Wang（2021 年）提出的 DIMPLE 模型是同一节点集上最广泛的配备 SBM 的二元多层网络模型，因此，它概括了许多研究限制性更强的设置的论文。在 DIMPLE 模型下，主要任务是识别具有相同群落结构的层组，因为在 DIMPLE 范式下，块连接概率矩阵是干扰参数。本文的主要贡献在于通过使用稀疏子空间聚类（SSC）这一计算机视觉领域的成熟技术，实现了层间强一致性聚类。此外，与光谱聚类相比，稀疏子空间聚类可以处理更大的网络，而且非常适合并行计算的应用。此外，我们的论文是第一篇为 SSC 应用于二进制数据时获得精度保证的论文。

{"title":"Sparse subspace clustering in diverse multiplex network model","authors":"Majid Noroozi , Marianna Pensky","doi":"10.1016/j.jmva.2024.105333","DOIUrl":"https://doi.org/10.1016/j.jmva.2024.105333","url":null,"abstract":"<div><p>The paper considers the DIverse MultiPLEx (DIMPLE) network model, where all layers of the network have the same collection of nodes and are equipped with the Stochastic Block Models. In addition, all layers can be partitioned into groups with the same community structures, although the layers in the same group may have different matrices of block connection probabilities. To the best of our knowledge, the DIMPLE model, introduced in Pensky and Wang (2021), presents the most broad SBM-equipped binary multilayer network model on the same set of nodes and, thus, generalizes a multitude of papers that study more restrictive settings. Under the DIMPLE model, the main task is to identify the groups of layers with the same community structures since the matrices of block connection probabilities act as nuisance parameters under the DIMPLE paradigm. The main contribution of the paper is achieving the strongly consistent between-layer clustering by using Sparse Subspace Clustering (SSC), the well-developed technique in computer vision. In addition, SSC allows to handle much larger networks than spectral clustering, and is perfectly suitable for application of parallel computing. Moreover, our paper is the first one to obtain precision guarantees for SSC when it is applied to binary data.</p></div>","PeriodicalId":16431,"journal":{"name":"Journal of Multivariate Analysis","volume":"203 ","pages":"Article 105333"},"PeriodicalIF":1.6,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141095842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the Mai–Wang stochastic decomposition for ℓp-norm symmetric survival functions on the positive orthant 论 p 上 ℓp 正态对称生存函数的麦-王随机分解

IF 1.6 3区数学 Q2 STATISTICS & PROBABILITY

Journal of Multivariate Analysis

Pub Date : 2024-05-17 DOI: 10.1016/j.jmva.2024.105331

Christian Genest , Johanna G. Nešlehová

Recently, Mai and Wang (2021) investigated a class of $ℓ_{p}$ -norm symmetric survival functions on the positive orthant. In their paper, they claim that the generator of these functions must be $d$ -monotone. This note explains that this is not true in general. Luckily, most of the results in Mai and Wang (2021) are not affected by this oversight.

最近，Mai 和 Wang（2021 年）研究了一类正正交上的ℓp 准则对称生存函数。在他们的论文中，他们声称这些函数的生成器必须是 d 单调的。本注释解释了这在一般情况下并非如此。幸运的是，Mai 和 Wang (2021) 中的大部分结果并没有受到这一疏忽的影响。

引用次数: 0

Tuning-free sparse clustering via alternating hard-thresholding 通过交替硬阈值进行无调谐稀疏聚类

IF 1.6 3区数学 Q2 STATISTICS & PROBABILITY

Journal of Multivariate Analysis

Pub Date : 2024-05-15 DOI: 10.1016/j.jmva.2024.105330

Wei Dong , Chen Xu , Jinhan Xie , Niansheng Tang

Model-based clustering is a commonly-used technique to partition heterogeneous data into homogeneous groups. When the analysis is to be conducted with a large number of features, analysts face simultaneous challenges in model interpretability, clustering accuracy, and computational efficiency. Several Bayesian and penalization methods have been proposed to select important features for model-based clustering. However, the performance of those methods relies on a careful algorithmic tuning, which can be time-consuming for high-dimensional cases. In this paper, we propose a new sparse clustering method based on alternating hard-thresholding. The new method is conceptually simple and tuning-free. With a user-specified sparsity level, it efficiently detects a set of key features by eliminating a large number of features that are less useful for clustering. Based on the selected key features, one can readily obtain an effective clustering of the original high-dimensional data under a general sparse covariance structure. Under mild conditions, we show that the new method leads to clusters with a misclassification rate consistent to the optimal rate as if the underlying true model were used. The promising performance of the new method is supported by both simulated and real data examples.

基于模型的聚类是将异质数据划分为同质组的常用技术。当需要使用大量特征进行分析时，分析人员同时面临着模型可解释性、聚类准确性和计算效率方面的挑战。目前已经提出了几种贝叶斯方法和惩罚方法来为基于模型的聚类选择重要特征。然而，这些方法的性能依赖于仔细的算法调整，这对于高维情况来说可能非常耗时。在本文中，我们提出了一种基于交替硬阈值的新稀疏聚类方法。新方法概念简单，无需调整。在用户指定的稀疏程度下，它能通过剔除大量对聚类作用较小的特征，高效地检测出一组关键特征。根据所选的关键特征，我们可以在一般稀疏协方差结构下轻松获得原始高维数据的有效聚类。在温和的条件下，我们发现新方法得到的聚类的误分类率与最佳误分类率一致，就像使用了底层真实模型一样。模拟和真实数据实例都证明了新方法的良好性能。

{"title":"Tuning-free sparse clustering via alternating hard-thresholding","authors":"Wei Dong , Chen Xu , Jinhan Xie , Niansheng Tang","doi":"10.1016/j.jmva.2024.105330","DOIUrl":"10.1016/j.jmva.2024.105330","url":null,"abstract":"<div><p>Model-based clustering is a commonly-used technique to partition heterogeneous data into homogeneous groups. When the analysis is to be conducted with a large number of features, analysts face simultaneous challenges in model interpretability, clustering accuracy, and computational efficiency. Several Bayesian and penalization methods have been proposed to select important features for model-based clustering. However, the performance of those methods relies on a careful algorithmic tuning, which can be time-consuming for high-dimensional cases. In this paper, we propose a new sparse clustering method based on alternating hard-thresholding. The new method is conceptually simple and tuning-free. With a user-specified sparsity level, it efficiently detects a set of key features by eliminating a large number of features that are less useful for clustering. Based on the selected key features, one can readily obtain an effective clustering of the original high-dimensional data under a general sparse covariance structure. Under mild conditions, we show that the new method leads to clusters with a misclassification rate consistent to the optimal rate as if the underlying true model were used. The promising performance of the new method is supported by both simulated and real data examples.</p></div>","PeriodicalId":16431,"journal":{"name":"Journal of Multivariate Analysis","volume":"203 ","pages":"Article 105330"},"PeriodicalIF":1.6,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141050885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bayesian inference of graph-based dependencies from mixed-type data 从混合型数据中对基于图的依赖关系进行贝叶斯推断

IF 1.6 3区数学 Q2 STATISTICS & PROBABILITY

Journal of Multivariate Analysis

Pub Date : 2024-05-06 DOI: 10.1016/j.jmva.2024.105323

Chiara Galimberti , Stefano Peluso , Federico Castelletti

Mixed data comprise measurements of different types, with both categorical and continuous variables, and can be found in various areas, such as in life science or industrial processes. Inferring conditional independencies from the data is crucial to understand how these variables relate to each other. To this end, graphical models provide an effective framework, which adopts a graph-based representation of the joint distribution to encode such dependence relations. This framework has been extensively studied in the Gaussian and categorical settings separately; on the other hand, the literature addressing this problem in presence of mixed data is still narrow. We propose a Bayesian model for the analysis of mixed data based on the notion of Conditional Gaussian (CG) distribution. Our method is based on a canonical parameterization of the CG distribution, which allows for posterior inference of parameters indexing the (marginal) distributions of continuous and categorical variables, as well as expressing the interactions between the two types of variables. We derive the limiting Gaussian distributions, centered on the correct unknown value and with vanishing variance, for the Bayesian estimators of the canonical parameters expressing continuous, discrete and mixed interactions. In addition, we implement the proposed method for structure learning purposes, namely to infer the underlying graph of conditional independencies. When compared to alternative frequentist methods, our approach shows favorable results both in a simulation setting and in real-data applications, besides allowing for a coherent uncertainty quantification around parameter estimates.

混合数据由不同类型的测量数据组成，既有分类变量，也有连续变量，可用于生命科学或工业流程等多个领域。从数据中推断条件独立性对于理解这些变量之间的关系至关重要。为此，图形模型提供了一个有效的框架，它采用基于图形的联合分布表示法来编码这种依赖关系。这一框架已分别在高斯和分类设置中得到广泛研究；另一方面，解决混合数据问题的文献仍然很少。我们提出了一种基于条件高斯分布（CG）概念的贝叶斯模型，用于分析混合数据。我们的方法基于条件高斯分布的规范参数化，它允许对连续变量和分类变量（边际）分布的参数进行后验推断，并表达两类变量之间的交互作用。我们为表达连续、离散和混合交互作用的典型参数的贝叶斯估计值推导出了以正确未知值为中心且方差消失的极限高斯分布。此外，我们还将所提出的方法用于结构学习目的，即推断条件独立性的底层图。与其他频数主义方法相比，我们的方法在模拟环境和实际数据应用中都显示出良好的效果，而且还允许对参数估计进行连贯的不确定性量化。

{"title":"Bayesian inference of graph-based dependencies from mixed-type data","authors":"Chiara Galimberti , Stefano Peluso , Federico Castelletti","doi":"10.1016/j.jmva.2024.105323","DOIUrl":"https://doi.org/10.1016/j.jmva.2024.105323","url":null,"abstract":"<div><p>Mixed data comprise measurements of different types, with both categorical and continuous variables, and can be found in various areas, such as in life science or industrial processes. Inferring conditional independencies from the data is crucial to understand how these variables relate to each other. To this end, graphical models provide an effective framework, which adopts a graph-based representation of the joint distribution to encode such dependence relations. This framework has been extensively studied in the Gaussian and categorical settings separately; on the other hand, the literature addressing this problem in presence of mixed data is still narrow. We propose a Bayesian model for the analysis of mixed data based on the notion of Conditional Gaussian (CG) distribution. Our method is based on a canonical parameterization of the CG distribution, which allows for posterior inference of parameters indexing the (marginal) distributions of continuous and categorical variables, as well as expressing the interactions between the two types of variables. We derive the limiting Gaussian distributions, centered on the correct unknown value and with vanishing variance, for the Bayesian estimators of the canonical parameters expressing continuous, discrete and mixed interactions. In addition, we implement the proposed method for structure learning purposes, namely to infer the underlying graph of conditional independencies. When compared to alternative frequentist methods, our approach shows favorable results both in a simulation setting and in real-data applications, besides allowing for a coherent uncertainty quantification around parameter estimates.</p></div>","PeriodicalId":16431,"journal":{"name":"Journal of Multivariate Analysis","volume":"203 ","pages":"Article 105323"},"PeriodicalIF":1.6,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140906825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Journal of Multivariate Analysis

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀