测试无标记数据库的依赖性

IF 2.2 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Information Theory Pub Date : 2024-08-13 DOI:10.1109/TIT.2024.3442977

Vered Paslev;Wasim Huleihel

{"title":"测试无标记数据库的依赖性","authors":"Vered Paslev;Wasim Huleihel","doi":"10.1109/TIT.2024.3442977","DOIUrl":null,"url":null,"abstract":"In this paper, we investigate the problem of deciding whether two random databases \n<inline-formula> <tex-math>$\\textsf {X}\\in {\\mathcal { X}} ^{n\\times d}$ </tex-math></inline-formula>\n and \n<inline-formula> <tex-math>$\\textsf {Y}\\in {\\mathcal { Y}} ^{n\\times d}$ </tex-math></inline-formula>\n are statistically dependent or not. This is formulated as a hypothesis testing problem, where under the null hypothesis, these two databases are statistically independent, while under the alternative, there exists an unknown row permutation \n<inline-formula> <tex-math>$\\sigma $ </tex-math></inline-formula>\n, such that \n<inline-formula> <tex-math>$\\textsf {X}$ </tex-math></inline-formula>\n and \n<inline-formula> <tex-math>$\\textsf {Y}^{\\sigma } $ </tex-math></inline-formula>\n, a permuted version of \n<inline-formula> <tex-math>$\\textsf {Y}$ </tex-math></inline-formula>\n, are statistically dependent with some known joint distribution, but have the same marginal distributions as the null. We characterize the thresholds at which optimal testing is information-theoretically impossible and possible, as a function of n, d, and some spectral properties of the generative distributions of the datasets. For example, we prove that if a certain function of the eigenvalues of the likelihood function and d, is below a certain threshold, as \n<inline-formula> <tex-math>$d\\to \\infty $ </tex-math></inline-formula>\n, then weak detection (performing slightly better than random guessing) is statistically impossible, no matter what the value of n is. This mimics the performance of an efficient test that thresholds a centered version of the log-likelihood function of the observed matrices. We also analyze the case where d is fixed, for which we derive strong (vanishing error) and weak detection lower and upper bounds.","PeriodicalId":13494,"journal":{"name":"IEEE Transactions on Information Theory","volume":"70 10","pages":"7410-7431"},"PeriodicalIF":2.2000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Testing Dependency of Unlabeled Databases\",\"authors\":\"Vered Paslev;Wasim Huleihel\",\"doi\":\"10.1109/TIT.2024.3442977\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we investigate the problem of deciding whether two random databases \\n<inline-formula> <tex-math>$\\\\textsf {X}\\\\in {\\\\mathcal { X}} ^{n\\\\times d}$ </tex-math></inline-formula>\\n and \\n<inline-formula> <tex-math>$\\\\textsf {Y}\\\\in {\\\\mathcal { Y}} ^{n\\\\times d}$ </tex-math></inline-formula>\\n are statistically dependent or not. This is formulated as a hypothesis testing problem, where under the null hypothesis, these two databases are statistically independent, while under the alternative, there exists an unknown row permutation \\n<inline-formula> <tex-math>$\\\\sigma $ </tex-math></inline-formula>\\n, such that \\n<inline-formula> <tex-math>$\\\\textsf {X}$ </tex-math></inline-formula>\\n and \\n<inline-formula> <tex-math>$\\\\textsf {Y}^{\\\\sigma } $ </tex-math></inline-formula>\\n, a permuted version of \\n<inline-formula> <tex-math>$\\\\textsf {Y}$ </tex-math></inline-formula>\\n, are statistically dependent with some known joint distribution, but have the same marginal distributions as the null. We characterize the thresholds at which optimal testing is information-theoretically impossible and possible, as a function of n, d, and some spectral properties of the generative distributions of the datasets. For example, we prove that if a certain function of the eigenvalues of the likelihood function and d, is below a certain threshold, as \\n<inline-formula> <tex-math>$d\\\\to \\\\infty $ </tex-math></inline-formula>\\n, then weak detection (performing slightly better than random guessing) is statistically impossible, no matter what the value of n is. This mimics the performance of an efficient test that thresholds a centered version of the log-likelihood function of the observed matrices. We also analyze the case where d is fixed, for which we derive strong (vanishing error) and weak detection lower and upper bounds.\",\"PeriodicalId\":13494,\"journal\":{\"name\":\"IEEE Transactions on Information Theory\",\"volume\":\"70 10\",\"pages\":\"7410-7431\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2024-08-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Information Theory\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10634574/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Theory","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10634574/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

在本文中，我们研究了如何决定两个随机数据库 $\textsf {X}\in {\mathcal { X}} 是否是和 $\textsf {Y}\in {\mathcal { Y}} 是统计上的吗是否具有统计依赖性。这被表述为一个假设检验问题，在零假设下，这两个数据库在统计上是独立的，而在备择假设下，存在一个未知的行排列组合 $\sigma $ ，使得 $\textsf {X}$ 和 $\textsf {Y}^\{sigma } $ ，是 $\textsf {Y}^\{sigma } 的一个排列版本。$ ，$\textsf {Y}$的一个置换版本，在统计上与某种已知的联合分布相关，但具有与空值相同的边际分布。作为 n、d 和数据集生成分布的一些谱属性的函数，我们描述了最佳测试在信息论上不可能和可能的阈值。例如，我们证明，如果似然函数的特征值和 d 的某个函数低于某个阈值，即 $d\to \infty $，那么无论 n 的值是多少，弱检测（比随机猜测表现稍好）在统计学上都是不可能的。这模仿了高效测试的性能，该测试对观测矩阵的对数似然函数的居中版本进行阈值化。我们还分析了 d 固定的情况，并得出了强检测（误差消失）和弱检测的下限和上限。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Testing Dependency of Unlabeled Databases

In this paper, we investigate the problem of deciding whether two random databases

$\textsf {X}\in {\mathcal { X}} ^{n\times d}$

and

$\textsf {Y}\in {\mathcal { Y}} ^{n\times d}$

are statistically dependent or not. This is formulated as a hypothesis testing problem, where under the null hypothesis, these two databases are statistically independent, while under the alternative, there exists an unknown row permutation

$\sigma $

, such that

$\textsf {X}$

and

$\textsf {Y}^{\sigma } $

, a permuted version of

$\textsf {Y}$

, are statistically dependent with some known joint distribution, but have the same marginal distributions as the null. We characterize the thresholds at which optimal testing is information-theoretically impossible and possible, as a function of n, d, and some spectral properties of the generative distributions of the datasets. For example, we prove that if a certain function of the eigenvalues of the likelihood function and d, is below a certain threshold, as

$d\to \infty $

, then weak detection (performing slightly better than random guessing) is statistically impossible, no matter what the value of n is. This mimics the performance of an efficient test that thresholds a centered version of the log-likelihood function of the observed matrices. We also analyze the case where d is fixed, for which we derive strong (vanishing error) and weak detection lower and upper bounds.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Information Theory 工程技术-工程：电子与电气

CiteScore

5.70

自引率

20.00%

发文量

514

审稿时长

12 months

期刊介绍： The IEEE Transactions on Information Theory is a journal that publishes theoretical and experimental papers concerned with the transmission, processing, and utilization of information. The boundaries of acceptable subject matter are intentionally not sharply delimited. Rather, it is hoped that as the focus of research activity changes, a flexible policy will permit this Transactions to follow suit. Current appropriate topics are best reflected by recent Tables of Contents; they are summarized in the titles of editorial areas that appear on the inside front cover.

期刊最新文献

Table of Contents IEEE Transactions on Information Theory Publication Information IEEE Transactions on Information Theory Information for Authors Large and Small Deviations for Statistical Sequence Matching Derivatives of Entropy and the MMSE Conjecture