论大型语言模型中的数据集间代码重复和数据泄露

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING IEEE Transactions on Software Engineering Pub Date : 2024-11-21 DOI:10.1109/TSE.2024.3504286

José Antonio Hernández López;Boqi Chen;Mootez Saad;Tushar Sharma;Dániel Varró

{"title":"论大型语言模型中的数据集间代码重复和数据泄露","authors":"José Antonio Hernández López;Boqi Chen;Mootez Saad;Tushar Sharma;Dániel Varró","doi":"10.1109/TSE.2024.3504286","DOIUrl":null,"url":null,"abstract":"<italic>Motivation.\n Large language models (\n<sc>LLM\ns) have exhibited remarkable proficiency in diverse software engineering (\n<sc>SE\n) tasks, such as code summarization, code translation, and code search. Handling such tasks typically involves acquiring foundational coding knowledge on large, general-purpose datasets during a pre-training phase, and subsequently refining on smaller, task-specific datasets as part of a fine-tuning phase. \n<italic>Problem statement.\n Data leakage \n<italic>i.e.,\n using information of the test set to perform the model training, is a well-known issue in training of machine learning models. A manifestation of this issue is the intersection of the training and testing splits. While \n<italic>intra-dataset\n code duplication examines this intersection within a given dataset and has been addressed in prior research, \n<italic>inter-dataset code duplication\n, which gauges the overlap between different datasets, remains largely unexplored. If this phenomenon exists, it could compromise the integrity of \n<sc>LLM\n evaluations because of the inclusion of fine-tuning test samples that were already encountered during pre-training, resulting in inflated performance metrics. \n<italic>Contribution.\n This paper explores the phenomenon of inter-dataset code duplication and its impact on evaluating \n<sc>LLM\ns across diverse \n<sc>SE\n tasks. \n<italic>Study design.\n We conduct an empirical study using the \n<sc>CodeSearchNet\n dataset (\n<sc>csn\n), a widely adopted pre-training dataset, and five fine-tuning datasets used for various \n<sc>SE\n tasks. We first identify the intersection between the pre-training and fine-tuning datasets using a deduplication process. Next, we pre-train two versions of \n<sc>LLM\ns using a subset of \n<sc>csn\n: one leaky \n<sc>LLM\n, which includes the identified intersection in its pre-training set, and one non-leaky \n<sc>LLM\n that excludes these samples. Finally, we fine-tune both models and compare their performances using fine-tuning test samples that are part of the intersection. \n<italic>Results.\n Our findings reveal a potential threat to the evaluation of \n<sc>LLM\ns across multiple \n<sc>SE\n tasks, stemming from the inter-dataset code duplication phenomenon. We also demonstrate that this threat is accentuated by the chosen fine-tuning technique. Furthermore, we provide evidence that open-source models such as \n<sc>CodeBERT\n, \n<sc>GraphCodeBERT\n, and \n<sc>UnixCoder\n could be affected by inter-dataset duplication. Based on our findings, we delve into prior research that may be susceptible to this threat. Additionally, we offer guidance to \n<sc>SE\n researchers on strategies to prevent inter-dataset code duplication.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 1","pages":"192-205"},"PeriodicalIF":6.5000,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"On Inter-Dataset Code Duplication and Data Leakage in Large Language Models\",\"authors\":\"José Antonio Hernández López;Boqi Chen;Mootez Saad;Tushar Sharma;Dániel Varró\",\"doi\":\"10.1109/TSE.2024.3504286\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<italic>Motivation.\\n Large language models (\\n<sc>LLM\\ns) have exhibited remarkable proficiency in diverse software engineering (\\n<sc>SE\\n) tasks, such as code summarization, code translation, and code search. Handling such tasks typically involves acquiring foundational coding knowledge on large, general-purpose datasets during a pre-training phase, and subsequently refining on smaller, task-specific datasets as part of a fine-tuning phase. \\n<italic>Problem statement.\\n Data leakage \\n<italic>i.e.,\\n using information of the test set to perform the model training, is a well-known issue in training of machine learning models. A manifestation of this issue is the intersection of the training and testing splits. While \\n<italic>intra-dataset\\n code duplication examines this intersection within a given dataset and has been addressed in prior research, \\n<italic>inter-dataset code duplication\\n, which gauges the overlap between different datasets, remains largely unexplored. If this phenomenon exists, it could compromise the integrity of \\n<sc>LLM\\n evaluations because of the inclusion of fine-tuning test samples that were already encountered during pre-training, resulting in inflated performance metrics. \\n<italic>Contribution.\\n This paper explores the phenomenon of inter-dataset code duplication and its impact on evaluating \\n<sc>LLM\\ns across diverse \\n<sc>SE\\n tasks. \\n<italic>Study design.\\n We conduct an empirical study using the \\n<sc>CodeSearchNet\\n dataset (\\n<sc>csn\\n), a widely adopted pre-training dataset, and five fine-tuning datasets used for various \\n<sc>SE\\n tasks. We first identify the intersection between the pre-training and fine-tuning datasets using a deduplication process. Next, we pre-train two versions of \\n<sc>LLM\\ns using a subset of \\n<sc>csn\\n: one leaky \\n<sc>LLM\\n, which includes the identified intersection in its pre-training set, and one non-leaky \\n<sc>LLM\\n that excludes these samples. Finally, we fine-tune both models and compare their performances using fine-tuning test samples that are part of the intersection. \\n<italic>Results.\\n Our findings reveal a potential threat to the evaluation of \\n<sc>LLM\\ns across multiple \\n<sc>SE\\n tasks, stemming from the inter-dataset code duplication phenomenon. We also demonstrate that this threat is accentuated by the chosen fine-tuning technique. Furthermore, we provide evidence that open-source models such as \\n<sc>CodeBERT\\n, \\n<sc>GraphCodeBERT\\n, and \\n<sc>UnixCoder\\n could be affected by inter-dataset duplication. Based on our findings, we delve into prior research that may be susceptible to this threat. Additionally, we offer guidance to \\n<sc>SE\\n researchers on strategies to prevent inter-dataset code duplication.\",\"PeriodicalId\":13324,\"journal\":{\"name\":\"IEEE Transactions on Software Engineering\",\"volume\":\"51 1\",\"pages\":\"192-205\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2024-11-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Software Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10759822/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10759822/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

动机。大型语言模型（llm）在不同的软件工程（SE）任务中表现出了显著的熟练程度，例如代码总结、代码翻译和代码搜索。处理此类任务通常需要在预训练阶段获取大型通用数据集的基础编码知识，然后在微调阶段对较小的任务特定数据集进行细化。问题陈述。数据泄漏，即利用测试集的信息进行模型训练，是机器学习模型训练中一个众所周知的问题。这个问题的一个表现就是训练和测试分离的交集。虽然数据集内代码复制检查了给定数据集内的交叉点，并且已经在先前的研究中得到了解决，但数据集间代码复制（衡量不同数据集之间的重叠）在很大程度上仍未被探索。如果这种现象存在，它可能会损害LLM评估的完整性，因为它包含了在预训练期间已经遇到的微调测试样本，从而导致虚增的性能指标。的贡献。本文探讨了数据集间代码重复的现象及其对跨不同SE任务评估llm的影响。研究设计。我们使用CodeSearchNet数据集（csn）进行实证研究，这是一个广泛采用的预训练数据集，以及用于各种SE任务的五个微调数据集。我们首先使用重复数据删除过程确定预训练和微调数据集之间的交集。接下来，我们使用csn的一个子集预训练两个版本的LLM：一个漏的LLM，它在其预训练集中包括已识别的交集，另一个非漏的LLM排除这些样本。最后，我们对两个模型进行微调，并使用作为交集一部分的微调测试样本来比较它们的性能。结果。我们的研究结果揭示了跨多个SE任务的llm评估的潜在威胁，源于数据集间代码重复现象。我们还证明，所选择的微调技术加剧了这种威胁。此外，我们提供的证据表明，CodeBERT、GraphCodeBERT和UnixCoder等开源模型可能受到数据集间复制的影响。根据我们的发现，我们深入研究了可能易受这种威胁的先前研究。此外，我们还为SE研究人员提供了防止数据集间代码重复的策略指导。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

On Inter-Dataset Code Duplication and Data Leakage in Large Language Models

Motivation. Large language models ( LLM s) have exhibited remarkable proficiency in diverse software engineering ( SE ) tasks, such as code summarization, code translation, and code search. Handling such tasks typically involves acquiring foundational coding knowledge on large, general-purpose datasets during a pre-training phase, and subsequently refining on smaller, task-specific datasets as part of a fine-tuning phase. Problem statement. Data leakage i.e., using information of the test set to perform the model training, is a well-known issue in training of machine learning models. A manifestation of this issue is the intersection of the training and testing splits. While intra-dataset code duplication examines this intersection within a given dataset and has been addressed in prior research, inter-dataset code duplication , which gauges the overlap between different datasets, remains largely unexplored. If this phenomenon exists, it could compromise the integrity of LLM evaluations because of the inclusion of fine-tuning test samples that were already encountered during pre-training, resulting in inflated performance metrics. Contribution. This paper explores the phenomenon of inter-dataset code duplication and its impact on evaluating LLM s across diverse SE tasks. Study design. We conduct an empirical study using the CodeSearchNet dataset ( csn ), a widely adopted pre-training dataset, and five fine-tuning datasets used for various SE tasks. We first identify the intersection between the pre-training and fine-tuning datasets using a deduplication process. Next, we pre-train two versions of LLM s using a subset of csn : one leaky LLM , which includes the identified intersection in its pre-training set, and one non-leaky LLM that excludes these samples. Finally, we fine-tune both models and compare their performances using fine-tuning test samples that are part of the intersection. Results. Our findings reveal a potential threat to the evaluation of LLM s across multiple SE tasks, stemming from the inter-dataset code duplication phenomenon. We also demonstrate that this threat is accentuated by the chosen fine-tuning technique. Furthermore, we provide evidence that open-source models such as CodeBERT , GraphCodeBERT , and UnixCoder could be affected by inter-dataset duplication. Based on our findings, we delve into prior research that may be susceptible to this threat. Additionally, we offer guidance to SE researchers on strategies to prevent inter-dataset code duplication.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.