Proximity Test for Sensitive Categorical Attributes in Big Data

2018 4th International Conference on Cloud Computing Technologies and Applications (Cloudtech) Pub Date : 2018-11-01 DOI:10.1109/CloudTech.2018.8713359

Zakariae El Ouazzani, H. Bakkali

{"title":"Proximity Test for Sensitive Categorical Attributes in Big Data","authors":"Zakariae El Ouazzani, H. Bakkali","doi":"10.1109/CloudTech.2018.8713359","DOIUrl":null,"url":null,"abstract":"Nowadays, various organizations obtain and store huge amounts of data in large data sets for research and mining purposes. As we know, the collected data are useful only if they are published or shared between companies. However, these data contain individual's sensitive information. Then, ensuring privacy in big data becomes a very significant issue. The concept of privacy protection aims to protect this private information from different privacy threats that may violate the individual's identity. Therefore, anonymization techniques become subject of research and must be applied before transmitting the data set to organizations. Anonymization techniques represent a way to ensure privacy in mixed data sets containing both numerical and categorical attributes. Based on horizontal clustering idea, several works have been realized; l-diversity technique is one of them treating sensitive numerical and categorical attributes. Although l-diversity is applied on a data set by putting only distinct values into diverse buckets, those distinct values may correspond after the anonymization process to a specific category. In this paper a new method called “Proximity test for sensitive categorical attributes” is proposed to deal with non-numerical attributes. The proposed algorithm comes to test the degree of proximity between values within each bucket in the data set. Moreover, it works without taking into consideration any threshold. This algorithm is implemented and evaluated on a test table. Furthermore, we highlighted all the steps of our proposed algorithm with detailed comments.","PeriodicalId":292196,"journal":{"name":"2018 4th International Conference on Cloud Computing Technologies and Applications (Cloudtech)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 4th International Conference on Cloud Computing Technologies and Applications (Cloudtech)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CloudTech.2018.8713359","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Nowadays, various organizations obtain and store huge amounts of data in large data sets for research and mining purposes. As we know, the collected data are useful only if they are published or shared between companies. However, these data contain individual's sensitive information. Then, ensuring privacy in big data becomes a very significant issue. The concept of privacy protection aims to protect this private information from different privacy threats that may violate the individual's identity. Therefore, anonymization techniques become subject of research and must be applied before transmitting the data set to organizations. Anonymization techniques represent a way to ensure privacy in mixed data sets containing both numerical and categorical attributes. Based on horizontal clustering idea, several works have been realized; l-diversity technique is one of them treating sensitive numerical and categorical attributes. Although l-diversity is applied on a data set by putting only distinct values into diverse buckets, those distinct values may correspond after the anonymization process to a specific category. In this paper a new method called “Proximity test for sensitive categorical attributes” is proposed to deal with non-numerical attributes. The proposed algorithm comes to test the degree of proximity between values within each bucket in the data set. Moreover, it works without taking into consideration any threshold. This algorithm is implemented and evaluated on a test table. Furthermore, we highlighted all the steps of our proposed algorithm with detailed comments.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

大数据敏感分类属性的接近性检验

如今，各种组织在大型数据集中获取和存储大量数据，用于研究和挖掘目的。正如我们所知，收集的数据只有在公司之间发布或共享时才有用。然而，这些数据包含了个人的敏感信息。因此，确保大数据中的隐私成为一个非常重要的问题。隐私保护的概念旨在保护这些私人信息免受可能侵犯个人身份的不同隐私威胁。因此，匿名化技术成为研究的主题，必须在将数据集传输给组织之前应用。匿名化技术提供了一种在包含数值和分类属性的混合数据集中确保隐私的方法。基于横向聚类思想，实现了若干工作;l-分集技术是处理敏感数值属性和分类属性的技术之一。尽管l-diversity通过只将不同的值放入不同的桶来应用于数据集，但这些不同的值在匿名化过程之后可能对应于特定的类别。本文提出了一种处理非数值属性的“敏感范畴属性接近性检验”方法。该算法用于测试数据集中每个桶内值之间的接近程度。此外，它不需要考虑任何阈值。该算法在一个测试表上进行了实现和评估。此外，我们用详细的注释突出了我们提出的算法的所有步骤。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 4th International Conference on Cloud Computing Technologies and Applications (Cloudtech)

自引率

0.00%

发文量

期刊最新文献

Analyzing fault tolerance mechanism of Hadoop Mapreduce under different type of failures Cloud Secured Protocol based on Partial Homomorphic Encryptions Wireless Sensor Networks as part of IOT: Performance study of WiMax - Mobil protocol Proximity Test for Sensitive Categorical Attributes in Big Data DTLS Integration in oneM2M based on Zolertia RE-motes