基于能量距离或最大均值差异统计的多变量双样本问题计算效率高的置换检验

arXiv - STAT - Computation Pub Date : 2024-06-10 DOI:arxiv-2406.06488

Elias Chaibub Neto

{"title":"基于能量距离或最大均值差异统计的多变量双样本问题计算效率高的置换检验","authors":"Elias Chaibub Neto","doi":"arxiv-2406.06488","DOIUrl":null,"url":null,"abstract":"Non-parametric two-sample tests based on energy distance or maximum mean\ndiscrepancy are widely used statistical tests for comparing multivariate data\nfrom two populations. While these tests enjoy desirable statistical properties,\ntheir test statistics can be expensive to compute as they require the\ncomputation of 3 distinct Euclidean distance (or kernel) matrices between\nsamples, where the time complexity of each of these computations (namely,\n$O(n_{x}^2 p)$, $O(n_{y}^2 p)$, and $O(n_{x} n_{y} p)$) scales quadratically\nwith the number of samples ($n_x$, $n_y$) and linearly with the number of\nvariables ($p$). Since the standard permutation test requires repeated\nre-computations of these expensive statistics it's application to large\ndatasets can become unfeasible. While several statistical approaches have been\nproposed to mitigate this issue, they all sacrifice desirable statistical\nproperties to decrease the computational cost (e.g., trade computation speed by\na decrease in statistical power). A better computational strategy is to first\npre-compute the Euclidean distance (kernel) matrix of the concatenated data,\nand then permute indexes and retrieve the corresponding elements to compute the\nre-sampled statistics. While this strategy can reduce the computation cost\nrelative to the standard permutation test, it relies on the computation of a\nlarger Euclidean distance (kernel) matrix with complexity $O((n_x + n_y)^2 p)$.\nIn this paper, we present a novel computationally efficient permutation\nalgorithm which only requires the pre-computation of the 3 smaller matrices and\nachieves large computational speedups without sacrificing finite-sample\nvalidity or statistical power. We illustrate its computational gains in a\nseries of experiments and compare its statistical power to the current\nstate-of-the-art approach for balancing computational cost and statistical\nperformance.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"26 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Computationally efficient permutation tests for the multivariate two-sample problem based on energy distance or maximum mean discrepancy statistics\",\"authors\":\"Elias Chaibub Neto\",\"doi\":\"arxiv-2406.06488\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Non-parametric two-sample tests based on energy distance or maximum mean\\ndiscrepancy are widely used statistical tests for comparing multivariate data\\nfrom two populations. While these tests enjoy desirable statistical properties,\\ntheir test statistics can be expensive to compute as they require the\\ncomputation of 3 distinct Euclidean distance (or kernel) matrices between\\nsamples, where the time complexity of each of these computations (namely,\\n$O(n_{x}^2 p)$, $O(n_{y}^2 p)$, and $O(n_{x} n_{y} p)$) scales quadratically\\nwith the number of samples ($n_x$, $n_y$) and linearly with the number of\\nvariables ($p$). Since the standard permutation test requires repeated\\nre-computations of these expensive statistics it's application to large\\ndatasets can become unfeasible. While several statistical approaches have been\\nproposed to mitigate this issue, they all sacrifice desirable statistical\\nproperties to decrease the computational cost (e.g., trade computation speed by\\na decrease in statistical power). A better computational strategy is to first\\npre-compute the Euclidean distance (kernel) matrix of the concatenated data,\\nand then permute indexes and retrieve the corresponding elements to compute the\\nre-sampled statistics. While this strategy can reduce the computation cost\\nrelative to the standard permutation test, it relies on the computation of a\\nlarger Euclidean distance (kernel) matrix with complexity $O((n_x + n_y)^2 p)$.\\nIn this paper, we present a novel computationally efficient permutation\\nalgorithm which only requires the pre-computation of the 3 smaller matrices and\\nachieves large computational speedups without sacrificing finite-sample\\nvalidity or statistical power. We illustrate its computational gains in a\\nseries of experiments and compare its statistical power to the current\\nstate-of-the-art approach for balancing computational cost and statistical\\nperformance.\",\"PeriodicalId\":501215,\"journal\":{\"name\":\"arXiv - STAT - Computation\",\"volume\":\"26 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - STAT - Computation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.06488\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.06488","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

基于能量距离或最大我差的非参数双样本检验是广泛使用的统计检验方法，用于比较来自两个群体的多元数据。虽然这些检验具有理想的统计特性，但由于需要计算样本间 3 个不同的欧氏距离（或核）矩阵，其检验统计量的计算成本可能很高、其中每个计算的时间复杂度（即 $O(n_{x}^2 p)$、$O(n_{y}^2 p)$ 和 $O(n_{x} n_{y} p)$）与样本数（$n_x$、$n_y$）成二次方关系，与变量数（$p$）成线性关系。由于标准置换检验需要反复重新计算这些昂贵的统计量，因此应用于大型数据集可能变得不可行。虽然已经提出了几种统计方法来缓解这一问题，但它们都牺牲了理想的统计特性来降低计算成本（例如，以降低统计能力来换取计算速度）。一种更好的计算策略是，首先预先计算串联数据的欧氏距离（核）矩阵，然后对索引进行置换并检索相应的元素，以计算其采样统计量。在本文中，我们提出了一种新型计算高效的置换算法，它只需要预先计算 3 个较小的矩阵，并在不牺牲有限样本有效性或统计能力的前提下实现了较快的计算速度。我们在一系列实验中说明了该算法的计算收益，并将其统计能力与当前最先进的方法进行了比较，以平衡计算成本和统计性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Computationally efficient permutation tests for the multivariate two-sample problem based on energy distance or maximum mean discrepancy statistics

Non-parametric two-sample tests based on energy distance or maximum mean discrepancy are widely used statistical tests for comparing multivariate data from two populations. While these tests enjoy desirable statistical properties, their test statistics can be expensive to compute as they require the computation of 3 distinct Euclidean distance (or kernel) matrices between samples, where the time complexity of each of these computations (namely, $O(n_{x}^2 p)$, $O(n_{y}^2 p)$, and $O(n_{x} n_{y} p)$) scales quadratically with the number of samples ($n_x$, $n_y$) and linearly with the number of variables ($p$). Since the standard permutation test requires repeated re-computations of these expensive statistics it's application to large datasets can become unfeasible. While several statistical approaches have been proposed to mitigate this issue, they all sacrifice desirable statistical properties to decrease the computational cost (e.g., trade computation speed by a decrease in statistical power). A better computational strategy is to first pre-compute the Euclidean distance (kernel) matrix of the concatenated data, and then permute indexes and retrieve the corresponding elements to compute the re-sampled statistics. While this strategy can reduce the computation cost relative to the standard permutation test, it relies on the computation of a larger Euclidean distance (kernel) matrix with complexity $O((n_x + n_y)^2 p)$. In this paper, we present a novel computationally efficient permutation algorithm which only requires the pre-computation of the 3 smaller matrices and achieves large computational speedups without sacrificing finite-sample validity or statistical power. We illustrate its computational gains in a series of experiments and compare its statistical power to the current state-of-the-art approach for balancing computational cost and statistical performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - STAT - Computation

自引率

0.00%

发文量