Salheddine Kabou , Laid Gasmi , Abdelbaset Kabou , Sidi Mohammed Benslimane
{"title":"ImDMI: Improved Distributed M-Invariance model to achieve privacy continuous big data publishing using Apache Spark","authors":"Salheddine Kabou , Laid Gasmi , Abdelbaset Kabou , Sidi Mohammed Benslimane","doi":"10.1016/j.bdr.2025.100519","DOIUrl":null,"url":null,"abstract":"<div><div>One of the critical challenges in the big data analytics is the individual's privacy issues. Data anonymization models including k-anonymity and l-diversity are used to guarantee the tradeoff between privacy and data utility while publishing the data. However, these models focus only on the single release of datasets and produce a certain level of privacy. In practical big data applications, data publishing is more complicated where the data is published continuously as new data is collected, and the privacy should be achieved for different releases. In this research, we propose a new distributed bottom up approach on Apache Spark for achievement of the m-invariance privacy model in the continuous big data context. The proposed approach, which is the first study that deals with dynamic big data publishing, is based on the insertion and the split process. In the first process, the data records collected from different workers are inserted into an improved bottom up R-tree generalization in order to minimizing the information loss. The second process concentrates on splitting the overflowed node with respect to the m-invariance model requirement by minimizing the overlap between the resulting partitions. The experimental results show significant improvement in term of data utility, execution time and counterfeit data records as compared to existing techniques in the literature.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100519"},"PeriodicalIF":3.5000,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data Research","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214579625000140","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
One of the critical challenges in the big data analytics is the individual's privacy issues. Data anonymization models including k-anonymity and l-diversity are used to guarantee the tradeoff between privacy and data utility while publishing the data. However, these models focus only on the single release of datasets and produce a certain level of privacy. In practical big data applications, data publishing is more complicated where the data is published continuously as new data is collected, and the privacy should be achieved for different releases. In this research, we propose a new distributed bottom up approach on Apache Spark for achievement of the m-invariance privacy model in the continuous big data context. The proposed approach, which is the first study that deals with dynamic big data publishing, is based on the insertion and the split process. In the first process, the data records collected from different workers are inserted into an improved bottom up R-tree generalization in order to minimizing the information loss. The second process concentrates on splitting the overflowed node with respect to the m-invariance model requirement by minimizing the overlap between the resulting partitions. The experimental results show significant improvement in term of data utility, execution time and counterfeit data records as compared to existing techniques in the literature.
期刊介绍:
The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic.
The journal will accept papers on foundational aspects in dealing with big data, as well as papers on specific Platforms and Technologies used to deal with big data. To promote Data Science and interdisciplinary collaboration between fields, and to showcase the benefits of data driven research, papers demonstrating applications of big data in domains as diverse as Geoscience, Social Web, Finance, e-Commerce, Health Care, Environment and Climate, Physics and Astronomy, Chemistry, life sciences and drug discovery, digital libraries and scientific publications, security and government will also be considered. Occasionally the journal may publish whitepapers on policies, standards and best practices.