{"title":"测试无标记数据库的依赖性","authors":"Vered Paslev;Wasim Huleihel","doi":"10.1109/TIT.2024.3442977","DOIUrl":null,"url":null,"abstract":"In this paper, we investigate the problem of deciding whether two random databases \n<inline-formula> <tex-math>$\\textsf {X}\\in {\\mathcal { X}} ^{n\\times d}$ </tex-math></inline-formula>\n and \n<inline-formula> <tex-math>$\\textsf {Y}\\in {\\mathcal { Y}} ^{n\\times d}$ </tex-math></inline-formula>\n are statistically dependent or not. This is formulated as a hypothesis testing problem, where under the null hypothesis, these two databases are statistically independent, while under the alternative, there exists an unknown row permutation \n<inline-formula> <tex-math>$\\sigma $ </tex-math></inline-formula>\n, such that \n<inline-formula> <tex-math>$\\textsf {X}$ </tex-math></inline-formula>\n and \n<inline-formula> <tex-math>$\\textsf {Y}^{\\sigma } $ </tex-math></inline-formula>\n, a permuted version of \n<inline-formula> <tex-math>$\\textsf {Y}$ </tex-math></inline-formula>\n, are statistically dependent with some known joint distribution, but have the same marginal distributions as the null. We characterize the thresholds at which optimal testing is information-theoretically impossible and possible, as a function of n, d, and some spectral properties of the generative distributions of the datasets. For example, we prove that if a certain function of the eigenvalues of the likelihood function and d, is below a certain threshold, as \n<inline-formula> <tex-math>$d\\to \\infty $ </tex-math></inline-formula>\n, then weak detection (performing slightly better than random guessing) is statistically impossible, no matter what the value of n is. This mimics the performance of an efficient test that thresholds a centered version of the log-likelihood function of the observed matrices. We also analyze the case where d is fixed, for which we derive strong (vanishing error) and weak detection lower and upper bounds.","PeriodicalId":13494,"journal":{"name":"IEEE Transactions on Information Theory","volume":"70 10","pages":"7410-7431"},"PeriodicalIF":2.2000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Testing Dependency of Unlabeled Databases\",\"authors\":\"Vered Paslev;Wasim Huleihel\",\"doi\":\"10.1109/TIT.2024.3442977\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we investigate the problem of deciding whether two random databases \\n<inline-formula> <tex-math>$\\\\textsf {X}\\\\in {\\\\mathcal { X}} ^{n\\\\times d}$ </tex-math></inline-formula>\\n and \\n<inline-formula> <tex-math>$\\\\textsf {Y}\\\\in {\\\\mathcal { Y}} ^{n\\\\times d}$ </tex-math></inline-formula>\\n are statistically dependent or not. This is formulated as a hypothesis testing problem, where under the null hypothesis, these two databases are statistically independent, while under the alternative, there exists an unknown row permutation \\n<inline-formula> <tex-math>$\\\\sigma $ </tex-math></inline-formula>\\n, such that \\n<inline-formula> <tex-math>$\\\\textsf {X}$ </tex-math></inline-formula>\\n and \\n<inline-formula> <tex-math>$\\\\textsf {Y}^{\\\\sigma } $ </tex-math></inline-formula>\\n, a permuted version of \\n<inline-formula> <tex-math>$\\\\textsf {Y}$ </tex-math></inline-formula>\\n, are statistically dependent with some known joint distribution, but have the same marginal distributions as the null. We characterize the thresholds at which optimal testing is information-theoretically impossible and possible, as a function of n, d, and some spectral properties of the generative distributions of the datasets. For example, we prove that if a certain function of the eigenvalues of the likelihood function and d, is below a certain threshold, as \\n<inline-formula> <tex-math>$d\\\\to \\\\infty $ </tex-math></inline-formula>\\n, then weak detection (performing slightly better than random guessing) is statistically impossible, no matter what the value of n is. This mimics the performance of an efficient test that thresholds a centered version of the log-likelihood function of the observed matrices. We also analyze the case where d is fixed, for which we derive strong (vanishing error) and weak detection lower and upper bounds.\",\"PeriodicalId\":13494,\"journal\":{\"name\":\"IEEE Transactions on Information Theory\",\"volume\":\"70 10\",\"pages\":\"7410-7431\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2024-08-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Information Theory\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10634574/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Theory","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10634574/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
In this paper, we investigate the problem of deciding whether two random databases
$\textsf {X}\in {\mathcal { X}} ^{n\times d}$
and
$\textsf {Y}\in {\mathcal { Y}} ^{n\times d}$
are statistically dependent or not. This is formulated as a hypothesis testing problem, where under the null hypothesis, these two databases are statistically independent, while under the alternative, there exists an unknown row permutation
$\sigma $
, such that
$\textsf {X}$
and
$\textsf {Y}^{\sigma } $
, a permuted version of
$\textsf {Y}$
, are statistically dependent with some known joint distribution, but have the same marginal distributions as the null. We characterize the thresholds at which optimal testing is information-theoretically impossible and possible, as a function of n, d, and some spectral properties of the generative distributions of the datasets. For example, we prove that if a certain function of the eigenvalues of the likelihood function and d, is below a certain threshold, as
$d\to \infty $
, then weak detection (performing slightly better than random guessing) is statistically impossible, no matter what the value of n is. This mimics the performance of an efficient test that thresholds a centered version of the log-likelihood function of the observed matrices. We also analyze the case where d is fixed, for which we derive strong (vanishing error) and weak detection lower and upper bounds.
期刊介绍:
The IEEE Transactions on Information Theory is a journal that publishes theoretical and experimental papers concerned with the transmission, processing, and utilization of information. The boundaries of acceptable subject matter are intentionally not sharply delimited. Rather, it is hoped that as the focus of research activity changes, a flexible policy will permit this Transactions to follow suit. Current appropriate topics are best reflected by recent Tables of Contents; they are summarized in the titles of editorial areas that appear on the inside front cover.