{"title":"基于能量距离或最大均值差异统计的多变量双样本问题计算效率高的置换检验","authors":"Elias Chaibub Neto","doi":"arxiv-2406.06488","DOIUrl":null,"url":null,"abstract":"Non-parametric two-sample tests based on energy distance or maximum mean\ndiscrepancy are widely used statistical tests for comparing multivariate data\nfrom two populations. While these tests enjoy desirable statistical properties,\ntheir test statistics can be expensive to compute as they require the\ncomputation of 3 distinct Euclidean distance (or kernel) matrices between\nsamples, where the time complexity of each of these computations (namely,\n$O(n_{x}^2 p)$, $O(n_{y}^2 p)$, and $O(n_{x} n_{y} p)$) scales quadratically\nwith the number of samples ($n_x$, $n_y$) and linearly with the number of\nvariables ($p$). Since the standard permutation test requires repeated\nre-computations of these expensive statistics it's application to large\ndatasets can become unfeasible. While several statistical approaches have been\nproposed to mitigate this issue, they all sacrifice desirable statistical\nproperties to decrease the computational cost (e.g., trade computation speed by\na decrease in statistical power). A better computational strategy is to first\npre-compute the Euclidean distance (kernel) matrix of the concatenated data,\nand then permute indexes and retrieve the corresponding elements to compute the\nre-sampled statistics. While this strategy can reduce the computation cost\nrelative to the standard permutation test, it relies on the computation of a\nlarger Euclidean distance (kernel) matrix with complexity $O((n_x + n_y)^2 p)$.\nIn this paper, we present a novel computationally efficient permutation\nalgorithm which only requires the pre-computation of the 3 smaller matrices and\nachieves large computational speedups without sacrificing finite-sample\nvalidity or statistical power. We illustrate its computational gains in a\nseries of experiments and compare its statistical power to the current\nstate-of-the-art approach for balancing computational cost and statistical\nperformance.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Computationally efficient permutation tests for the multivariate two-sample problem based on energy distance or maximum mean discrepancy statistics\",\"authors\":\"Elias Chaibub Neto\",\"doi\":\"arxiv-2406.06488\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Non-parametric two-sample tests based on energy distance or maximum mean\\ndiscrepancy are widely used statistical tests for comparing multivariate data\\nfrom two populations. While these tests enjoy desirable statistical properties,\\ntheir test statistics can be expensive to compute as they require the\\ncomputation of 3 distinct Euclidean distance (or kernel) matrices between\\nsamples, where the time complexity of each of these computations (namely,\\n$O(n_{x}^2 p)$, $O(n_{y}^2 p)$, and $O(n_{x} n_{y} p)$) scales quadratically\\nwith the number of samples ($n_x$, $n_y$) and linearly with the number of\\nvariables ($p$). Since the standard permutation test requires repeated\\nre-computations of these expensive statistics it's application to large\\ndatasets can become unfeasible. While several statistical approaches have been\\nproposed to mitigate this issue, they all sacrifice desirable statistical\\nproperties to decrease the computational cost (e.g., trade computation speed by\\na decrease in statistical power). A better computational strategy is to first\\npre-compute the Euclidean distance (kernel) matrix of the concatenated data,\\nand then permute indexes and retrieve the corresponding elements to compute the\\nre-sampled statistics. While this strategy can reduce the computation cost\\nrelative to the standard permutation test, it relies on the computation of a\\nlarger Euclidean distance (kernel) matrix with complexity $O((n_x + n_y)^2 p)$.\\nIn this paper, we present a novel computationally efficient permutation\\nalgorithm which only requires the pre-computation of the 3 smaller matrices and\\nachieves large computational speedups without sacrificing finite-sample\\nvalidity or statistical power. We illustrate its computational gains in a\\nseries of experiments and compare its statistical power to the current\\nstate-of-the-art approach for balancing computational cost and statistical\\nperformance.\",\"PeriodicalId\":501215,\"journal\":{\"name\":\"arXiv - STAT - Computation\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - STAT - Computation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.06488\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.06488","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Computationally efficient permutation tests for the multivariate two-sample problem based on energy distance or maximum mean discrepancy statistics
Non-parametric two-sample tests based on energy distance or maximum mean
discrepancy are widely used statistical tests for comparing multivariate data
from two populations. While these tests enjoy desirable statistical properties,
their test statistics can be expensive to compute as they require the
computation of 3 distinct Euclidean distance (or kernel) matrices between
samples, where the time complexity of each of these computations (namely,
$O(n_{x}^2 p)$, $O(n_{y}^2 p)$, and $O(n_{x} n_{y} p)$) scales quadratically
with the number of samples ($n_x$, $n_y$) and linearly with the number of
variables ($p$). Since the standard permutation test requires repeated
re-computations of these expensive statistics it's application to large
datasets can become unfeasible. While several statistical approaches have been
proposed to mitigate this issue, they all sacrifice desirable statistical
properties to decrease the computational cost (e.g., trade computation speed by
a decrease in statistical power). A better computational strategy is to first
pre-compute the Euclidean distance (kernel) matrix of the concatenated data,
and then permute indexes and retrieve the corresponding elements to compute the
re-sampled statistics. While this strategy can reduce the computation cost
relative to the standard permutation test, it relies on the computation of a
larger Euclidean distance (kernel) matrix with complexity $O((n_x + n_y)^2 p)$.
In this paper, we present a novel computationally efficient permutation
algorithm which only requires the pre-computation of the 3 smaller matrices and
achieves large computational speedups without sacrificing finite-sample
validity or statistical power. We illustrate its computational gains in a
series of experiments and compare its statistical power to the current
state-of-the-art approach for balancing computational cost and statistical
performance.