找到了这个模型使用了我的代码评估代码模型中的成员泄漏风险

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING IEEE Transactions on Software Engineering Pub Date : 2024-10-25 DOI:10.1109/TSE.2024.3482719

Zhou Yang;Zhipeng Zhao;Chenyu Wang;Jieke Shi;Dongsun Kim;DongGyun Han;David Lo

{"title":"找到了这个模型使用了我的代码评估代码模型中的成员泄漏风险","authors":"Zhou Yang;Zhipeng Zhao;Chenyu Wang;Jieke Shi;Dongsun Kim;DongGyun Han;David Lo","doi":"10.1109/TSE.2024.3482719","DOIUrl":null,"url":null,"abstract":"Leveraging large-scale datasets from open-source projects and advances in large language models, recent progress has led to sophisticated code models for key software engineering tasks, such as program repair and code completion. These models are trained on data from various sources, including public open-source projects like GitHub and private, confidential code from companies, raising significant privacy concerns. This paper investigates a crucial but unexplored question: \n<italic>What is the risk of membership information leakage in code models?\n Membership leakage refers to the vulnerability where an attacker can infer whether a specific data point was part of the training dataset. We present \n<sc>Gotcha\n, a novel membership inference attack method designed for code models, and evaluate its effectiveness on Java-based datasets. \n<sc>Gotcha\n simultaneously considers three key factors: model input, model output, and ground truth. Our ablation study confirms that each factor significantly enhances attack performance. Our ablation study confirms that each factor significantly enhances attack performance. Our investigation reveals a troubling finding: \n<bold>membership leakage risk is significantly elevated\n. While previous methods had accuracy close to random guessing, \n<sc>Gotcha\n achieves high precision, with a true positive rate of 0.95 and a low false positive rate of 0.10. We also demonstrate that the attacker's knowledge of the victim model (e.g., model architecture and pre-training data) affects attack success. Additionally, modifying decoding strategies can help reduce membership leakage risks. This research highlights the urgent need to better understand the privacy vulnerabilities of code models and develop strong countermeasures against these threats.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 12","pages":"3290-3306"},"PeriodicalIF":6.5000,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Gotcha! This Model Uses My Code! Evaluating Membership Leakage Risks in Code Models\",\"authors\":\"Zhou Yang;Zhipeng Zhao;Chenyu Wang;Jieke Shi;Dongsun Kim;DongGyun Han;David Lo\",\"doi\":\"10.1109/TSE.2024.3482719\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Leveraging large-scale datasets from open-source projects and advances in large language models, recent progress has led to sophisticated code models for key software engineering tasks, such as program repair and code completion. These models are trained on data from various sources, including public open-source projects like GitHub and private, confidential code from companies, raising significant privacy concerns. This paper investigates a crucial but unexplored question: \\n<italic>What is the risk of membership information leakage in code models?\\n Membership leakage refers to the vulnerability where an attacker can infer whether a specific data point was part of the training dataset. We present \\n<sc>Gotcha\\n, a novel membership inference attack method designed for code models, and evaluate its effectiveness on Java-based datasets. \\n<sc>Gotcha\\n simultaneously considers three key factors: model input, model output, and ground truth. Our ablation study confirms that each factor significantly enhances attack performance. Our ablation study confirms that each factor significantly enhances attack performance. Our investigation reveals a troubling finding: \\n<bold>membership leakage risk is significantly elevated\\n. While previous methods had accuracy close to random guessing, \\n<sc>Gotcha\\n achieves high precision, with a true positive rate of 0.95 and a low false positive rate of 0.10. We also demonstrate that the attacker's knowledge of the victim model (e.g., model architecture and pre-training data) affects attack success. Additionally, modifying decoding strategies can help reduce membership leakage risks. This research highlights the urgent need to better understand the privacy vulnerabilities of code models and develop strong countermeasures against these threats.\",\"PeriodicalId\":13324,\"journal\":{\"name\":\"IEEE Transactions on Software Engineering\",\"volume\":\"50 12\",\"pages\":\"3290-3306\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2024-10-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Software Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10735776/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10735776/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

利用来自开源项目的大规模数据集和大型语言模型的进步，最近的进展导致了用于关键软件工程任务的复杂代码模型，例如程序修复和代码完成。这些模型是根据各种来源的数据进行训练的，包括GitHub等公共开源项目和公司的私人机密代码，这引起了严重的隐私问题。本文研究了一个关键但尚未探讨的问题：代码模型中成员信息泄漏的风险是什么？成员泄漏是指攻击者可以推断出特定数据点是否属于训练数据集的一部分。提出了一种针对代码模型设计的成员推理攻击方法Gotcha，并对其在基于java的数据集上的有效性进行了评估。Gotcha同时考虑三个关键因素：模型输入、模型输出和真实情况。我们的消融研究证实，每个因素都能显著提高攻击性能。我们的消融研究证实，每个因素都能显著提高攻击性能。我们的调查显示了一个令人不安的发现：会员泄露的风险显著增加。以前的方法准确率接近随机猜测，Gotcha达到了很高的精度，真阳性率为0.95，假阳性率为0.10。我们还证明了攻击者对受害者模型的了解（例如，模型架构和预训练数据）会影响攻击的成功。此外，修改解码策略可以帮助减少成员泄漏风险。这项研究强调了迫切需要更好地了解代码模型的隐私漏洞，并制定强有力的对策来应对这些威胁。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Gotcha! This Model Uses My Code! Evaluating Membership Leakage Risks in Code Models

Leveraging large-scale datasets from open-source projects and advances in large language models, recent progress has led to sophisticated code models for key software engineering tasks, such as program repair and code completion. These models are trained on data from various sources, including public open-source projects like GitHub and private, confidential code from companies, raising significant privacy concerns. This paper investigates a crucial but unexplored question: What is the risk of membership information leakage in code models? Membership leakage refers to the vulnerability where an attacker can infer whether a specific data point was part of the training dataset. We present Gotcha , a novel membership inference attack method designed for code models, and evaluate its effectiveness on Java-based datasets. Gotcha simultaneously considers three key factors: model input, model output, and ground truth. Our ablation study confirms that each factor significantly enhances attack performance. Our ablation study confirms that each factor significantly enhances attack performance. Our investigation reveals a troubling finding: membership leakage risk is significantly elevated . While previous methods had accuracy close to random guessing, Gotcha achieves high precision, with a true positive rate of 0.95 and a low false positive rate of 0.10. We also demonstrate that the attacker's knowledge of the victim model (e.g., model architecture and pre-training data) affects attack success. Additionally, modifying decoding strategies can help reduce membership leakage risks. This research highlights the urgent need to better understand the privacy vulnerabilities of code models and develop strong countermeasures against these threats.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.

期刊最新文献

One Sentence Can Kill the Bug: Auto-replay Mobile App Crashes from One-sentence Overviews Retrospective: Data Mining Static Code Attributes to Learn Defect Predictors PATEN: Identifying Unpatched Third-Party APIs via Fine-grained Patch-enhanced AST-level Signature Three “Influential” Software Design Papers A Reflection on “Advances in Software Inspections”