首页 > 最新文献

Proceedings of the Third Workshop on Privacy in Natural Language Processing最新文献

英文 中文
Understanding Unintended Memorization in Language Models Under Federated Learning 理解联邦学习下语言模型中的非预期记忆
Pub Date : 2021-06-01 DOI: 10.18653/V1/2021.PRIVATENLP-1.1
Om Thakkar, Swaroop Indra Ramaswamy, Rajiv Mathews, F. Beaufays
Recent works have shown that language models (LMs), e.g., for next word prediction (NWP), have a tendency to memorize rare or unique sequences in the training data. Since useful LMs are often trained on sensitive data, it is critical to identify and mitigate such unintended memorization. Federated Learning (FL) has emerged as a novel framework for large-scale distributed learning tasks. It differs in many aspects from the well-studied central learning setting where all the data is stored at the central server, and minibatch stochastic gradient descent is used to conduct training. This work is motivated by our observation that NWP models trained under FL exhibited remarkably less propensity to such memorization compared to the central learning setting. Thus, we initiate a formal study to understand the effect of different components of FL on unintended memorization in trained NWP models. Our results show that several differing components of FL play an important role in reducing unintended memorization. First, we discover that the clustering of data according to users—which happens by design in FL—has the most significant effect in reducing such memorization. Using the Federated Averaging optimizer with larger effective minibatch sizes for training causes a further reduction. We also demonstrate that training in FL with a user-level differential privacy guarantee results in models that can provide high utility while being resilient to memorizing out-of-distribution phrases with thousands of insertions across over a hundred users in the training set.
最近的研究表明,语言模型(LMs),例如下一个单词预测(NWP),倾向于记忆训练数据中的稀有或唯一序列。由于有用的LMs通常是在敏感数据上进行训练的,因此识别和减少这种意外记忆非常重要。联邦学习(FL)是一种用于大规模分布式学习任务的新框架。它在许多方面不同于研究得很好的中央学习设置,在中央学习设置中,所有的数据都存储在中心服务器上,并使用minibatch随机梯度下降进行训练。这项工作的动机是我们观察到,与中心学习设置相比,在FL下训练的NWP模型表现出明显更少的这种记忆倾向。因此,我们启动了一项正式的研究,以了解FL的不同成分对训练后的NWP模型中意外记忆的影响。我们的研究结果表明,FL的几个不同组成部分在减少意外记忆中起着重要作用。首先,我们发现根据用户对数据进行聚类——这在fl中是通过设计实现的——在减少这种记忆方面有最显著的效果。使用具有更大有效小批大小的联邦平均优化器进行训练可以进一步减少。我们还证明,在具有用户级差分隐私保证的FL中进行训练会产生可以提供高效用的模型,同时能够灵活地记忆训练集中超过100个用户中具有数千个插入的分布外短语。
{"title":"Understanding Unintended Memorization in Language Models Under Federated Learning","authors":"Om Thakkar, Swaroop Indra Ramaswamy, Rajiv Mathews, F. Beaufays","doi":"10.18653/V1/2021.PRIVATENLP-1.1","DOIUrl":"https://doi.org/10.18653/V1/2021.PRIVATENLP-1.1","url":null,"abstract":"Recent works have shown that language models (LMs), e.g., for next word prediction (NWP), have a tendency to memorize rare or unique sequences in the training data. Since useful LMs are often trained on sensitive data, it is critical to identify and mitigate such unintended memorization. Federated Learning (FL) has emerged as a novel framework for large-scale distributed learning tasks. It differs in many aspects from the well-studied central learning setting where all the data is stored at the central server, and minibatch stochastic gradient descent is used to conduct training. This work is motivated by our observation that NWP models trained under FL exhibited remarkably less propensity to such memorization compared to the central learning setting. Thus, we initiate a formal study to understand the effect of different components of FL on unintended memorization in trained NWP models. Our results show that several differing components of FL play an important role in reducing unintended memorization. First, we discover that the clustering of data according to users—which happens by design in FL—has the most significant effect in reducing such memorization. Using the Federated Averaging optimizer with larger effective minibatch sizes for training causes a further reduction. We also demonstrate that training in FL with a user-level differential privacy guarantee results in models that can provide high utility while being resilient to memorizing out-of-distribution phrases with thousands of insertions across over a hundred users in the training set.","PeriodicalId":270632,"journal":{"name":"Proceedings of the Third Workshop on Privacy in Natural Language Processing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122990122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
An Investigation towards Differentially Private Sequence Tagging in a Federated Framework 联邦框架中差分私有序列标记的研究
Pub Date : 2021-06-01 DOI: 10.18653/V1/2021.PRIVATENLP-1.4
Abhik Jana, Chris Biemann
To build machine learning-based applications for sensitive domains like medical, legal, etc. where the digitized text contains private information, anonymization of text is required for preserving privacy. Sequence tagging, e.g. as done in Named Entity Recognition (NER) can help to detect private information. However, to train sequence tagging models, a sufficient amount of labeled data are required but for privacy-sensitive domains, such labeled data also can not be shared directly. In this paper, we investigate the applicability of a privacy-preserving framework for sequence tagging tasks, specifically NER. Hence, we analyze a framework for the NER task, which incorporates two levels of privacy protection. Firstly, we deploy a federated learning (FL) framework where the labeled data are not shared with the centralized server as well as the peer clients. Secondly, we apply differential privacy (DP) while the models are being trained in each client instance. While both privacy measures are suitable for privacy-aware models, their combination results in unstable models. To our knowledge, this is the first study of its kind on privacy-aware sequence tagging models.
要为敏感领域(如医疗、法律等)构建基于机器学习的应用程序,其中数字化文本包含私人信息,需要对文本进行匿名化以保护隐私。序列标记,例如在命名实体识别(NER)中所做的,可以帮助检测私有信息。然而,为了训练序列标记模型,需要足够数量的标记数据,但对于隐私敏感领域,这些标记数据也不能直接共享。在本文中,我们研究了隐私保护框架对序列标记任务的适用性,特别是NER。因此,我们分析了NER任务的框架,该框架包含两个级别的隐私保护。首先,我们部署了一个联邦学习(FL)框架,其中标记的数据不与集中式服务器以及对等客户端共享。其次,我们在每个客户端实例中训练模型时应用差分隐私(DP)。虽然这两种隐私措施都适用于隐私感知模型,但它们的组合会导致模型不稳定。据我们所知,这是对隐私感知序列标记模型的第一次研究。
{"title":"An Investigation towards Differentially Private Sequence Tagging in a Federated Framework","authors":"Abhik Jana, Chris Biemann","doi":"10.18653/V1/2021.PRIVATENLP-1.4","DOIUrl":"https://doi.org/10.18653/V1/2021.PRIVATENLP-1.4","url":null,"abstract":"To build machine learning-based applications for sensitive domains like medical, legal, etc. where the digitized text contains private information, anonymization of text is required for preserving privacy. Sequence tagging, e.g. as done in Named Entity Recognition (NER) can help to detect private information. However, to train sequence tagging models, a sufficient amount of labeled data are required but for privacy-sensitive domains, such labeled data also can not be shared directly. In this paper, we investigate the applicability of a privacy-preserving framework for sequence tagging tasks, specifically NER. Hence, we analyze a framework for the NER task, which incorporates two levels of privacy protection. Firstly, we deploy a federated learning (FL) framework where the labeled data are not shared with the centralized server as well as the peer clients. Secondly, we apply differential privacy (DP) while the models are being trained in each client instance. While both privacy measures are suitable for privacy-aware models, their combination results in unstable models. To our knowledge, this is the first study of its kind on privacy-aware sequence tagging models.","PeriodicalId":270632,"journal":{"name":"Proceedings of the Third Workshop on Privacy in Natural Language Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115796609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
Proceedings of the Third Workshop on Privacy in Natural Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1