Privacy-Preserving Machine Learning Algorithms for Big Data Systems

2015 IEEE 35th International Conference on Distributed Computing Systems Pub Date : 2015-06-01 DOI:10.1109/ICDCS.2015.40

Kaihe Xu, Hao Yue, Linke Guo, Yuanxiong Guo, Yuguang Fang

{"title":"Privacy-Preserving Machine Learning Algorithms for Big Data Systems","authors":"Kaihe Xu, Hao Yue, Linke Guo, Yuanxiong Guo, Yuguang Fang","doi":"10.1109/ICDCS.2015.40","DOIUrl":null,"url":null,"abstract":"Machine learning has played an increasing important role in big data systems due to its capability of efficiently discovering valuable knowledge and hidden information. Often times big data such as healthcare systems or financial systems may involve with multiple organizations who may have different privacy policy, and may not explicitly share their data publicly while joint data processing may be a must. Thus, how to share big data among distributed data processing entities while mitigating privacy concerns becomes a challenging problem. Traditional methods rely on cryptographic tools and/or randomization to preserve privacy. Unfortunately, this alone may be inadequate for the emerging big data systems because they are mainly designed for traditional small-scale data sets. In this paper, we propose a novel framework to achieve privacy-preserving machine learning where the training data are distributed and each shared data portion is of large volume. Specifically, we utilize the data locality property of Apache Hadoop architecture and only a limited number of cryptographic operations at the Reduce() procedures to achieve privacy-preservation. We show that the proposed scheme is secure in the semi-honest model and use extensive simulations to demonstrate its scalability and correctness.","PeriodicalId":129182,"journal":{"name":"2015 IEEE 35th International Conference on Distributed Computing Systems","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"76","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 35th International Conference on Distributed Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS.2015.40","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 76

Abstract

Machine learning has played an increasing important role in big data systems due to its capability of efficiently discovering valuable knowledge and hidden information. Often times big data such as healthcare systems or financial systems may involve with multiple organizations who may have different privacy policy, and may not explicitly share their data publicly while joint data processing may be a must. Thus, how to share big data among distributed data processing entities while mitigating privacy concerns becomes a challenging problem. Traditional methods rely on cryptographic tools and/or randomization to preserve privacy. Unfortunately, this alone may be inadequate for the emerging big data systems because they are mainly designed for traditional small-scale data sets. In this paper, we propose a novel framework to achieve privacy-preserving machine learning where the training data are distributed and each shared data portion is of large volume. Specifically, we utilize the data locality property of Apache Hadoop architecture and only a limited number of cryptographic operations at the Reduce() procedures to achieve privacy-preservation. We show that the proposed scheme is secure in the semi-honest model and use extensive simulations to demonstrate its scalability and correctness.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

大数据系统中保护隐私的机器学习算法

机器学习在大数据系统中发挥着越来越重要的作用，因为它能够有效地发现有价值的知识和隐藏的信息。通常情况下，医疗保健系统或金融系统等大数据可能涉及多个组织，这些组织可能具有不同的隐私政策，并且可能不会明确地公开共享其数据，而联合数据处理可能是必须的。因此，如何在分布式数据处理实体之间共享大数据，同时减轻隐私问题成为一个具有挑战性的问题。传统的方法依赖于加密工具和/或随机化来保护隐私。不幸的是，对于新兴的大数据系统来说，仅靠这一点可能是不够的，因为它们主要是为传统的小规模数据集设计的。在本文中，我们提出了一个新的框架来实现保护隐私的机器学习，其中训练数据是分布式的，每个共享数据部分都是大容量的。具体来说，我们利用了Apache Hadoop架构的数据局域性属性，并且在Reduce()过程中只进行了有限数量的加密操作来实现隐私保护。我们证明了该方案在半诚实模型下是安全的，并通过大量的仿真来证明其可扩展性和正确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 IEEE 35th International Conference on Distributed Computing Systems

自引率

0.00%

发文量