{"title":"Local Hashing and Fake Data for Privacy-Aware Frequency Estimation","authors":"Gatha Varma","doi":"10.1109/IMCOM56909.2023.10035583","DOIUrl":null,"url":null,"abstract":"Data collected from services and application users contain identifying attributes. The categorical attributes of user data capture information contained in a fixed set of domain values $\\boldsymbol{D}_{\\boldsymbol{m}}$. The statistical analysis of the collected data drives modeling, which in the case of categorical attributes is frequency estimation. It gives the approximate number of individuals who reported a specific value from set $\\boldsymbol{D}_{\\boldsymbol{m}}$. Under the conditions where the user data is collected repeatedly, frequency estimation may exhibit disclosure potential risks. Therefore it is important to privatize the user data such that the statistics are relevant yet minimize privacy risks. This is achieved by a set of algorithms called Frequency Oracles. Local Differential Privacy is a widely-used technique for the concerning circumstances. Additionally, several methods are used to amplify its privacy guarantees including sampling and randomization. In this paper, I propose the first sample-based frequency oracle which used Optimized Local Hashing (OLH) and was further enhanced by the replacement of some attribute values with fake data. The adaptive solution utilized the benefits offered by OLH for large-dimensioned dataset and a variance independent of dimensionality. The privacy-utility trade-off given by the proposed solution was found to be better than existing solutions for certain general and strict privacy regimes for multi-dimensional datasets.","PeriodicalId":230213,"journal":{"name":"2023 17th International Conference on Ubiquitous Information Management and Communication (IMCOM)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 17th International Conference on Ubiquitous Information Management and Communication (IMCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IMCOM56909.2023.10035583","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Data collected from services and application users contain identifying attributes. The categorical attributes of user data capture information contained in a fixed set of domain values $\boldsymbol{D}_{\boldsymbol{m}}$. The statistical analysis of the collected data drives modeling, which in the case of categorical attributes is frequency estimation. It gives the approximate number of individuals who reported a specific value from set $\boldsymbol{D}_{\boldsymbol{m}}$. Under the conditions where the user data is collected repeatedly, frequency estimation may exhibit disclosure potential risks. Therefore it is important to privatize the user data such that the statistics are relevant yet minimize privacy risks. This is achieved by a set of algorithms called Frequency Oracles. Local Differential Privacy is a widely-used technique for the concerning circumstances. Additionally, several methods are used to amplify its privacy guarantees including sampling and randomization. In this paper, I propose the first sample-based frequency oracle which used Optimized Local Hashing (OLH) and was further enhanced by the replacement of some attribute values with fake data. The adaptive solution utilized the benefits offered by OLH for large-dimensioned dataset and a variance independent of dimensionality. The privacy-utility trade-off given by the proposed solution was found to be better than existing solutions for certain general and strict privacy regimes for multi-dimensional datasets.