Local Hashing and Fake Data for Privacy-Aware Frequency Estimation

2023 17th International Conference on Ubiquitous Information Management and Communication (IMCOM) Pub Date : 2023-01-03 DOI:10.1109/IMCOM56909.2023.10035583

Gatha Varma

{"title":"Local Hashing and Fake Data for Privacy-Aware Frequency Estimation","authors":"Gatha Varma","doi":"10.1109/IMCOM56909.2023.10035583","DOIUrl":null,"url":null,"abstract":"Data collected from services and application users contain identifying attributes. The categorical attributes of user data capture information contained in a fixed set of domain values $\\boldsymbol{D}_{\\boldsymbol{m}}$. The statistical analysis of the collected data drives modeling, which in the case of categorical attributes is frequency estimation. It gives the approximate number of individuals who reported a specific value from set $\\boldsymbol{D}_{\\boldsymbol{m}}$. Under the conditions where the user data is collected repeatedly, frequency estimation may exhibit disclosure potential risks. Therefore it is important to privatize the user data such that the statistics are relevant yet minimize privacy risks. This is achieved by a set of algorithms called Frequency Oracles. Local Differential Privacy is a widely-used technique for the concerning circumstances. Additionally, several methods are used to amplify its privacy guarantees including sampling and randomization. In this paper, I propose the first sample-based frequency oracle which used Optimized Local Hashing (OLH) and was further enhanced by the replacement of some attribute values with fake data. The adaptive solution utilized the benefits offered by OLH for large-dimensioned dataset and a variance independent of dimensionality. The privacy-utility trade-off given by the proposed solution was found to be better than existing solutions for certain general and strict privacy regimes for multi-dimensional datasets.","PeriodicalId":230213,"journal":{"name":"2023 17th International Conference on Ubiquitous Information Management and Communication (IMCOM)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 17th International Conference on Ubiquitous Information Management and Communication (IMCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IMCOM56909.2023.10035583","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Data collected from services and application users contain identifying attributes. The categorical attributes of user data capture information contained in a fixed set of domain values $\boldsymbol{D}_{\boldsymbol{m}}$. The statistical analysis of the collected data drives modeling, which in the case of categorical attributes is frequency estimation. It gives the approximate number of individuals who reported a specific value from set $\boldsymbol{D}_{\boldsymbol{m}}$. Under the conditions where the user data is collected repeatedly, frequency estimation may exhibit disclosure potential risks. Therefore it is important to privatize the user data such that the statistics are relevant yet minimize privacy risks. This is achieved by a set of algorithms called Frequency Oracles. Local Differential Privacy is a widely-used technique for the concerning circumstances. Additionally, several methods are used to amplify its privacy guarantees including sampling and randomization. In this paper, I propose the first sample-based frequency oracle which used Optimized Local Hashing (OLH) and was further enhanced by the replacement of some attribute values with fake data. The adaptive solution utilized the benefits offered by OLH for large-dimensioned dataset and a variance independent of dimensionality. The privacy-utility trade-off given by the proposed solution was found to be better than existing solutions for certain general and strict privacy regimes for multi-dimensional datasets.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

隐私感知频率估计的局部哈希和假数据

从服务和应用程序用户收集的数据包含标识属性。用户数据捕获信息的分类属性包含在一组固定的域值$\boldsymbol{D}_{\boldsymbol{m}}$中。收集到的数据的统计分析驱动建模，在分类属性的情况下是频率估计。它给出了从set $\boldsymbol{D}_{\boldsymbol{m}}$中报告特定值的个人的大致数量。在重复收集用户数据的情况下，频率估计可能存在泄露的潜在风险。因此，将用户数据私有化是很重要的，这样统计数据是相关的，但最大限度地减少隐私风险。这是通过一组称为频率预言器的算法实现的。在这种情况下，局部差分隐私是一种广泛使用的技术。此外，还使用了抽样和随机化等方法来增强其隐私保障。在本文中，我提出了第一个基于样本的频率数据库，它使用了优化的局部哈希(OLH)，并通过用假数据替换一些属性值来进一步增强。该自适应解决方案利用了OLH对大维数据集的优势和与维数无关的方差。对于多维数据集的某些一般和严格的隐私制度，所提出的解决方案给出的隐私效用权衡优于现有的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2023 17th International Conference on Ubiquitous Information Management and Communication (IMCOM)

自引率

0.00%

发文量