{"title":"Differentially Private Synthetic High-dimensional Tabular Stream","authors":"Girish Kumar, Thomas Strohmer, Roman Vershynin","doi":"arxiv-2409.00322","DOIUrl":null,"url":null,"abstract":"While differentially private synthetic data generation has been explored\nextensively in the literature, how to update this data in the future if the\nunderlying private data changes is much less understood. We propose an\nalgorithmic framework for streaming data that generates multiple synthetic\ndatasets over time, tracking changes in the underlying private data. Our\nalgorithm satisfies differential privacy for the entire input stream (continual\ndifferential privacy) and can be used for high-dimensional tabular data.\nFurthermore, we show the utility of our method via experiments on real-world\ndatasets. The proposed algorithm builds upon a popular select, measure, fit,\nand iterate paradigm (used by offline synthetic data generation algorithms) and\nprivate counters for streams.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Statistics Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.00322","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
While differentially private synthetic data generation has been explored
extensively in the literature, how to update this data in the future if the
underlying private data changes is much less understood. We propose an
algorithmic framework for streaming data that generates multiple synthetic
datasets over time, tracking changes in the underlying private data. Our
algorithm satisfies differential privacy for the entire input stream (continual
differential privacy) and can be used for high-dimensional tabular data.
Furthermore, we show the utility of our method via experiments on real-world
datasets. The proposed algorithm builds upon a popular select, measure, fit,
and iterate paradigm (used by offline synthetic data generation algorithms) and
private counters for streams.