Zhou Yang;Zhipeng Zhao;Chenyu Wang;Jieke Shi;Dongsun Kim;DongGyun Han;David Lo
{"title":"找到了这个模型使用了我的代码评估代码模型中的成员泄漏风险","authors":"Zhou Yang;Zhipeng Zhao;Chenyu Wang;Jieke Shi;Dongsun Kim;DongGyun Han;David Lo","doi":"10.1109/TSE.2024.3482719","DOIUrl":null,"url":null,"abstract":"Leveraging large-scale datasets from open-source projects and advances in large language models, recent progress has led to sophisticated code models for key software engineering tasks, such as program repair and code completion. These models are trained on data from various sources, including public open-source projects like GitHub and private, confidential code from companies, raising significant privacy concerns. This paper investigates a crucial but unexplored question: \n<italic>What is the risk of membership information leakage in code models?</i>\n Membership leakage refers to the vulnerability where an attacker can infer whether a specific data point was part of the training dataset. We present \n<sc>Gotcha</small>\n, a novel membership inference attack method designed for code models, and evaluate its effectiveness on Java-based datasets. \n<sc>Gotcha</small>\n simultaneously considers three key factors: model input, model output, and ground truth. Our ablation study confirms that each factor significantly enhances attack performance. Our ablation study confirms that each factor significantly enhances attack performance. Our investigation reveals a troubling finding: \n<bold>membership leakage risk is significantly elevated</b>\n. While previous methods had accuracy close to random guessing, \n<sc>Gotcha</small>\n achieves high precision, with a true positive rate of 0.95 and a low false positive rate of 0.10. We also demonstrate that the attacker's knowledge of the victim model (e.g., model architecture and pre-training data) affects attack success. Additionally, modifying decoding strategies can help reduce membership leakage risks. This research highlights the urgent need to better understand the privacy vulnerabilities of code models and develop strong countermeasures against these threats.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 12","pages":"3290-3306"},"PeriodicalIF":6.5000,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Gotcha! This Model Uses My Code! Evaluating Membership Leakage Risks in Code Models\",\"authors\":\"Zhou Yang;Zhipeng Zhao;Chenyu Wang;Jieke Shi;Dongsun Kim;DongGyun Han;David Lo\",\"doi\":\"10.1109/TSE.2024.3482719\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Leveraging large-scale datasets from open-source projects and advances in large language models, recent progress has led to sophisticated code models for key software engineering tasks, such as program repair and code completion. These models are trained on data from various sources, including public open-source projects like GitHub and private, confidential code from companies, raising significant privacy concerns. This paper investigates a crucial but unexplored question: \\n<italic>What is the risk of membership information leakage in code models?</i>\\n Membership leakage refers to the vulnerability where an attacker can infer whether a specific data point was part of the training dataset. We present \\n<sc>Gotcha</small>\\n, a novel membership inference attack method designed for code models, and evaluate its effectiveness on Java-based datasets. \\n<sc>Gotcha</small>\\n simultaneously considers three key factors: model input, model output, and ground truth. Our ablation study confirms that each factor significantly enhances attack performance. Our ablation study confirms that each factor significantly enhances attack performance. Our investigation reveals a troubling finding: \\n<bold>membership leakage risk is significantly elevated</b>\\n. While previous methods had accuracy close to random guessing, \\n<sc>Gotcha</small>\\n achieves high precision, with a true positive rate of 0.95 and a low false positive rate of 0.10. We also demonstrate that the attacker's knowledge of the victim model (e.g., model architecture and pre-training data) affects attack success. Additionally, modifying decoding strategies can help reduce membership leakage risks. This research highlights the urgent need to better understand the privacy vulnerabilities of code models and develop strong countermeasures against these threats.\",\"PeriodicalId\":13324,\"journal\":{\"name\":\"IEEE Transactions on Software Engineering\",\"volume\":\"50 12\",\"pages\":\"3290-3306\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2024-10-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Software Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10735776/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10735776/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
Gotcha! This Model Uses My Code! Evaluating Membership Leakage Risks in Code Models
Leveraging large-scale datasets from open-source projects and advances in large language models, recent progress has led to sophisticated code models for key software engineering tasks, such as program repair and code completion. These models are trained on data from various sources, including public open-source projects like GitHub and private, confidential code from companies, raising significant privacy concerns. This paper investigates a crucial but unexplored question:
What is the risk of membership information leakage in code models?
Membership leakage refers to the vulnerability where an attacker can infer whether a specific data point was part of the training dataset. We present
Gotcha
, a novel membership inference attack method designed for code models, and evaluate its effectiveness on Java-based datasets.
Gotcha
simultaneously considers three key factors: model input, model output, and ground truth. Our ablation study confirms that each factor significantly enhances attack performance. Our ablation study confirms that each factor significantly enhances attack performance. Our investigation reveals a troubling finding:
membership leakage risk is significantly elevated
. While previous methods had accuracy close to random guessing,
Gotcha
achieves high precision, with a true positive rate of 0.95 and a low false positive rate of 0.10. We also demonstrate that the attacker's knowledge of the victim model (e.g., model architecture and pre-training data) affects attack success. Additionally, modifying decoding strategies can help reduce membership leakage risks. This research highlights the urgent need to better understand the privacy vulnerabilities of code models and develop strong countermeasures against these threats.
期刊介绍:
IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include:
a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models.
b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects.
c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards.
d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues.
e) System issues: Hardware-software trade-offs.
f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.