{"title":"Proximity Test for Sensitive Categorical Attributes in Big Data","authors":"Zakariae El Ouazzani, H. Bakkali","doi":"10.1109/CloudTech.2018.8713359","DOIUrl":null,"url":null,"abstract":"Nowadays, various organizations obtain and store huge amounts of data in large data sets for research and mining purposes. As we know, the collected data are useful only if they are published or shared between companies. However, these data contain individual's sensitive information. Then, ensuring privacy in big data becomes a very significant issue. The concept of privacy protection aims to protect this private information from different privacy threats that may violate the individual's identity. Therefore, anonymization techniques become subject of research and must be applied before transmitting the data set to organizations. Anonymization techniques represent a way to ensure privacy in mixed data sets containing both numerical and categorical attributes. Based on horizontal clustering idea, several works have been realized; l-diversity technique is one of them treating sensitive numerical and categorical attributes. Although l-diversity is applied on a data set by putting only distinct values into diverse buckets, those distinct values may correspond after the anonymization process to a specific category. In this paper a new method called “Proximity test for sensitive categorical attributes” is proposed to deal with non-numerical attributes. The proposed algorithm comes to test the degree of proximity between values within each bucket in the data set. Moreover, it works without taking into consideration any threshold. This algorithm is implemented and evaluated on a test table. Furthermore, we highlighted all the steps of our proposed algorithm with detailed comments.","PeriodicalId":292196,"journal":{"name":"2018 4th International Conference on Cloud Computing Technologies and Applications (Cloudtech)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 4th International Conference on Cloud Computing Technologies and Applications (Cloudtech)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CloudTech.2018.8713359","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Nowadays, various organizations obtain and store huge amounts of data in large data sets for research and mining purposes. As we know, the collected data are useful only if they are published or shared between companies. However, these data contain individual's sensitive information. Then, ensuring privacy in big data becomes a very significant issue. The concept of privacy protection aims to protect this private information from different privacy threats that may violate the individual's identity. Therefore, anonymization techniques become subject of research and must be applied before transmitting the data set to organizations. Anonymization techniques represent a way to ensure privacy in mixed data sets containing both numerical and categorical attributes. Based on horizontal clustering idea, several works have been realized; l-diversity technique is one of them treating sensitive numerical and categorical attributes. Although l-diversity is applied on a data set by putting only distinct values into diverse buckets, those distinct values may correspond after the anonymization process to a specific category. In this paper a new method called “Proximity test for sensitive categorical attributes” is proposed to deal with non-numerical attributes. The proposed algorithm comes to test the degree of proximity between values within each bucket in the data set. Moreover, it works without taking into consideration any threshold. This algorithm is implemented and evaluated on a test table. Furthermore, we highlighted all the steps of our proposed algorithm with detailed comments.