一种混合匿名化管道,用于ML目的,改善敏感数据集的隐私-效用平衡

Jenno Verdonck, Kevin De Boeck, M. Willocx, Jorn Lapon, Vincent Naessens
{"title":"一种混合匿名化管道,用于ML目的,改善敏感数据集的隐私-效用平衡","authors":"Jenno Verdonck, Kevin De Boeck, M. Willocx, Jorn Lapon, Vincent Naessens","doi":"10.1145/3600160.3600168","DOIUrl":null,"url":null,"abstract":"The modern world is data-driven. Businesses increasingly take strategic decisions based on customer data, and companies are founded with a sole focus of performing machine-learning driven data analytics for third parties. External data sources containing sensitive records are often required to build qualitative machine learning models and, hence, perform accurate and meaningful predictions. However, exchanging sensitive datasets is no sinecure. Personal data must be managed according to privacy regulation. Similarly, loss of strategic data can negatively impact the competitiveness of a company. In both cases, dataset anonymization can overcome the aforementioned obstacles. This work proposes a hybrid anonymization pipeline combining masking and (intelligent) sampling to improve the privacy-utility balance of anonymized datasets. The approach is validated via in-depth experiments on a representative machine learning scenario. A quantitative privacy assessment of the proposed hybrid anonymization pipeline is performed and relies on two well-known privacy metrics, namely re-identification risk and certainty. Furthermore, this work shows that the utility level of the anonymized dataset remains acceptable, and that the overall privacy-utility balance increases when complementing masking with intelligent sampling. The study further restrains the common misconception that dataset anonymization is detrimental to the quality of machine learning models. The empirical study shows that anonymous datasets – generated by the hybrid anonymization pipeline – can compete with the original (identifiable) ones when they are used as input for training a machine learning model.","PeriodicalId":107145,"journal":{"name":"Proceedings of the 18th International Conference on Availability, Reliability and Security","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A hybrid anonymization pipeline to improve the privacy-utility balance in sensitive datasets for ML purposes\",\"authors\":\"Jenno Verdonck, Kevin De Boeck, M. Willocx, Jorn Lapon, Vincent Naessens\",\"doi\":\"10.1145/3600160.3600168\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The modern world is data-driven. Businesses increasingly take strategic decisions based on customer data, and companies are founded with a sole focus of performing machine-learning driven data analytics for third parties. External data sources containing sensitive records are often required to build qualitative machine learning models and, hence, perform accurate and meaningful predictions. However, exchanging sensitive datasets is no sinecure. Personal data must be managed according to privacy regulation. Similarly, loss of strategic data can negatively impact the competitiveness of a company. In both cases, dataset anonymization can overcome the aforementioned obstacles. This work proposes a hybrid anonymization pipeline combining masking and (intelligent) sampling to improve the privacy-utility balance of anonymized datasets. The approach is validated via in-depth experiments on a representative machine learning scenario. A quantitative privacy assessment of the proposed hybrid anonymization pipeline is performed and relies on two well-known privacy metrics, namely re-identification risk and certainty. Furthermore, this work shows that the utility level of the anonymized dataset remains acceptable, and that the overall privacy-utility balance increases when complementing masking with intelligent sampling. The study further restrains the common misconception that dataset anonymization is detrimental to the quality of machine learning models. The empirical study shows that anonymous datasets – generated by the hybrid anonymization pipeline – can compete with the original (identifiable) ones when they are used as input for training a machine learning model.\",\"PeriodicalId\":107145,\"journal\":{\"name\":\"Proceedings of the 18th International Conference on Availability, Reliability and Security\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 18th International Conference on Availability, Reliability and Security\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3600160.3600168\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th International Conference on Availability, Reliability and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3600160.3600168","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

现代世界是数据驱动的。企业越来越多地根据客户数据做出战略决策,公司成立的唯一重点是为第三方执行机器学习驱动的数据分析。通常需要包含敏感记录的外部数据源来构建定性机器学习模型,从而执行准确而有意义的预测。然而,交换敏感数据集并不是一件轻松的事。个人资料必须按私隐条例管理。同样,战略数据的丢失也会对公司的竞争力产生负面影响。在这两种情况下,数据集匿名化都可以克服上述障碍。本研究提出了一种结合屏蔽和(智能)采样的混合匿名化管道,以改善匿名数据集的隐私-效用平衡。该方法通过代表性机器学习场景的深入实验进行了验证。对所提出的混合匿名化管道进行了定量隐私评估,并依赖于两个众所周知的隐私指标,即重新识别风险和确定性。此外,这项工作表明,匿名数据集的效用水平仍然是可以接受的,并且当用智能采样补充掩蔽时,总体隐私效用平衡增加。该研究进一步抑制了数据集匿名化对机器学习模型质量有害的常见误解。实证研究表明,当匿名数据集被用作训练机器学习模型的输入时,由混合匿名化管道生成的匿名数据集可以与原始(可识别的)数据集竞争。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A hybrid anonymization pipeline to improve the privacy-utility balance in sensitive datasets for ML purposes
The modern world is data-driven. Businesses increasingly take strategic decisions based on customer data, and companies are founded with a sole focus of performing machine-learning driven data analytics for third parties. External data sources containing sensitive records are often required to build qualitative machine learning models and, hence, perform accurate and meaningful predictions. However, exchanging sensitive datasets is no sinecure. Personal data must be managed according to privacy regulation. Similarly, loss of strategic data can negatively impact the competitiveness of a company. In both cases, dataset anonymization can overcome the aforementioned obstacles. This work proposes a hybrid anonymization pipeline combining masking and (intelligent) sampling to improve the privacy-utility balance of anonymized datasets. The approach is validated via in-depth experiments on a representative machine learning scenario. A quantitative privacy assessment of the proposed hybrid anonymization pipeline is performed and relies on two well-known privacy metrics, namely re-identification risk and certainty. Furthermore, this work shows that the utility level of the anonymized dataset remains acceptable, and that the overall privacy-utility balance increases when complementing masking with intelligent sampling. The study further restrains the common misconception that dataset anonymization is detrimental to the quality of machine learning models. The empirical study shows that anonymous datasets – generated by the hybrid anonymization pipeline – can compete with the original (identifiable) ones when they are used as input for training a machine learning model.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Confidential Quantum Computing Enabling Qualified Anonymity for Enhanced User Privacy in the Digital Era Fingerprint forgery training: Easy to learn, hard to perform Experiences with Secure Pipelines in Highly Regulated Environments Leveraging Knowledge Graphs For Classifying Incident Situations in ICT Systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1