Jake Fawkes, Lucile Ter-Minassian, Desi Ivanova, Uri Shalit, Chris Holmes
{"title":"Is merging worth it? Securely evaluating the information gain for causal dataset acquisition","authors":"Jake Fawkes, Lucile Ter-Minassian, Desi Ivanova, Uri Shalit, Chris Holmes","doi":"arxiv-2409.07215","DOIUrl":null,"url":null,"abstract":"Merging datasets across institutions is a lengthy and costly procedure,\nespecially when it involves private information. Data hosts may therefore want\nto prospectively gauge which datasets are most beneficial to merge with,\nwithout revealing sensitive information. For causal estimation this is\nparticularly challenging as the value of a merge will depend not only on the\nreduction in epistemic uncertainty but also the improvement in overlap. To\naddress this challenge, we introduce the first cryptographically secure\ninformation-theoretic approach for quantifying the value of a merge in the\ncontext of heterogeneous treatment effect estimation. We do this by evaluating\nthe Expected Information Gain (EIG) and utilising multi-party computation to\nensure it can be securely computed without revealing any raw data. As we\ndemonstrate, this can be used with differential privacy (DP) to ensure privacy\nrequirements whilst preserving more accurate computation than naive DP alone.\nTo the best of our knowledge, this work presents the first privacy-preserving\nmethod for dataset acquisition tailored to causal estimation. We demonstrate\nthe effectiveness and reliability of our method on a range of simulated and\nrealistic benchmarks. The code is available anonymously.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"49 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07215","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Merging datasets across institutions is a lengthy and costly procedure,
especially when it involves private information. Data hosts may therefore want
to prospectively gauge which datasets are most beneficial to merge with,
without revealing sensitive information. For causal estimation this is
particularly challenging as the value of a merge will depend not only on the
reduction in epistemic uncertainty but also the improvement in overlap. To
address this challenge, we introduce the first cryptographically secure
information-theoretic approach for quantifying the value of a merge in the
context of heterogeneous treatment effect estimation. We do this by evaluating
the Expected Information Gain (EIG) and utilising multi-party computation to
ensure it can be securely computed without revealing any raw data. As we
demonstrate, this can be used with differential privacy (DP) to ensure privacy
requirements whilst preserving more accurate computation than naive DP alone.
To the best of our knowledge, this work presents the first privacy-preserving
method for dataset acquisition tailored to causal estimation. We demonstrate
the effectiveness and reliability of our method on a range of simulated and
realistic benchmarks. The code is available anonymously.