José Antonio Hernández López;Boqi Chen;Mootez Saad;Tushar Sharma;Dániel Varró
{"title":"论大型语言模型中的数据集间代码重复和数据泄露","authors":"José Antonio Hernández López;Boqi Chen;Mootez Saad;Tushar Sharma;Dániel Varró","doi":"10.1109/TSE.2024.3504286","DOIUrl":null,"url":null,"abstract":"<italic>Motivation.</i>\n Large language models (\n<sc>LLM</small>\ns) have exhibited remarkable proficiency in diverse software engineering (\n<sc>SE</small>\n) tasks, such as code summarization, code translation, and code search. Handling such tasks typically involves acquiring foundational coding knowledge on large, general-purpose datasets during a pre-training phase, and subsequently refining on smaller, task-specific datasets as part of a fine-tuning phase. \n<italic>Problem statement.</i>\n Data leakage \n<italic>i.e.,</i>\n using information of the test set to perform the model training, is a well-known issue in training of machine learning models. A manifestation of this issue is the intersection of the training and testing splits. While \n<italic>intra-dataset</i>\n code duplication examines this intersection within a given dataset and has been addressed in prior research, \n<italic>inter-dataset code duplication</i>\n, which gauges the overlap between different datasets, remains largely unexplored. If this phenomenon exists, it could compromise the integrity of \n<sc>LLM</small>\n evaluations because of the inclusion of fine-tuning test samples that were already encountered during pre-training, resulting in inflated performance metrics. \n<italic>Contribution.</i>\n This paper explores the phenomenon of inter-dataset code duplication and its impact on evaluating \n<sc>LLM</small>\ns across diverse \n<sc>SE</small>\n tasks. \n<italic>Study design.</i>\n We conduct an empirical study using the \n<sc>CodeSearchNet</small>\n dataset (\n<sc>csn</small>\n), a widely adopted pre-training dataset, and five fine-tuning datasets used for various \n<sc>SE</small>\n tasks. We first identify the intersection between the pre-training and fine-tuning datasets using a deduplication process. Next, we pre-train two versions of \n<sc>LLM</small>\ns using a subset of \n<sc>csn</small>\n: one leaky \n<sc>LLM</small>\n, which includes the identified intersection in its pre-training set, and one non-leaky \n<sc>LLM</small>\n that excludes these samples. Finally, we fine-tune both models and compare their performances using fine-tuning test samples that are part of the intersection. \n<italic>Results.</i>\n Our findings reveal a potential threat to the evaluation of \n<sc>LLM</small>\ns across multiple \n<sc>SE</small>\n tasks, stemming from the inter-dataset code duplication phenomenon. We also demonstrate that this threat is accentuated by the chosen fine-tuning technique. Furthermore, we provide evidence that open-source models such as \n<sc>CodeBERT</small>\n, \n<sc>GraphCodeBERT</small>\n, and \n<sc>UnixCoder</small>\n could be affected by inter-dataset duplication. Based on our findings, we delve into prior research that may be susceptible to this threat. Additionally, we offer guidance to \n<sc>SE</small>\n researchers on strategies to prevent inter-dataset code duplication.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 1","pages":"192-205"},"PeriodicalIF":6.5000,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"On Inter-Dataset Code Duplication and Data Leakage in Large Language Models\",\"authors\":\"José Antonio Hernández López;Boqi Chen;Mootez Saad;Tushar Sharma;Dániel Varró\",\"doi\":\"10.1109/TSE.2024.3504286\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<italic>Motivation.</i>\\n Large language models (\\n<sc>LLM</small>\\ns) have exhibited remarkable proficiency in diverse software engineering (\\n<sc>SE</small>\\n) tasks, such as code summarization, code translation, and code search. Handling such tasks typically involves acquiring foundational coding knowledge on large, general-purpose datasets during a pre-training phase, and subsequently refining on smaller, task-specific datasets as part of a fine-tuning phase. \\n<italic>Problem statement.</i>\\n Data leakage \\n<italic>i.e.,</i>\\n using information of the test set to perform the model training, is a well-known issue in training of machine learning models. A manifestation of this issue is the intersection of the training and testing splits. While \\n<italic>intra-dataset</i>\\n code duplication examines this intersection within a given dataset and has been addressed in prior research, \\n<italic>inter-dataset code duplication</i>\\n, which gauges the overlap between different datasets, remains largely unexplored. If this phenomenon exists, it could compromise the integrity of \\n<sc>LLM</small>\\n evaluations because of the inclusion of fine-tuning test samples that were already encountered during pre-training, resulting in inflated performance metrics. \\n<italic>Contribution.</i>\\n This paper explores the phenomenon of inter-dataset code duplication and its impact on evaluating \\n<sc>LLM</small>\\ns across diverse \\n<sc>SE</small>\\n tasks. \\n<italic>Study design.</i>\\n We conduct an empirical study using the \\n<sc>CodeSearchNet</small>\\n dataset (\\n<sc>csn</small>\\n), a widely adopted pre-training dataset, and five fine-tuning datasets used for various \\n<sc>SE</small>\\n tasks. We first identify the intersection between the pre-training and fine-tuning datasets using a deduplication process. Next, we pre-train two versions of \\n<sc>LLM</small>\\ns using a subset of \\n<sc>csn</small>\\n: one leaky \\n<sc>LLM</small>\\n, which includes the identified intersection in its pre-training set, and one non-leaky \\n<sc>LLM</small>\\n that excludes these samples. Finally, we fine-tune both models and compare their performances using fine-tuning test samples that are part of the intersection. \\n<italic>Results.</i>\\n Our findings reveal a potential threat to the evaluation of \\n<sc>LLM</small>\\ns across multiple \\n<sc>SE</small>\\n tasks, stemming from the inter-dataset code duplication phenomenon. We also demonstrate that this threat is accentuated by the chosen fine-tuning technique. Furthermore, we provide evidence that open-source models such as \\n<sc>CodeBERT</small>\\n, \\n<sc>GraphCodeBERT</small>\\n, and \\n<sc>UnixCoder</small>\\n could be affected by inter-dataset duplication. Based on our findings, we delve into prior research that may be susceptible to this threat. Additionally, we offer guidance to \\n<sc>SE</small>\\n researchers on strategies to prevent inter-dataset code duplication.\",\"PeriodicalId\":13324,\"journal\":{\"name\":\"IEEE Transactions on Software Engineering\",\"volume\":\"51 1\",\"pages\":\"192-205\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2024-11-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Software Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10759822/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10759822/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
On Inter-Dataset Code Duplication and Data Leakage in Large Language Models
Motivation.
Large language models (
LLM
s) have exhibited remarkable proficiency in diverse software engineering (
SE
) tasks, such as code summarization, code translation, and code search. Handling such tasks typically involves acquiring foundational coding knowledge on large, general-purpose datasets during a pre-training phase, and subsequently refining on smaller, task-specific datasets as part of a fine-tuning phase.
Problem statement.
Data leakage
i.e.,
using information of the test set to perform the model training, is a well-known issue in training of machine learning models. A manifestation of this issue is the intersection of the training and testing splits. While
intra-dataset
code duplication examines this intersection within a given dataset and has been addressed in prior research,
inter-dataset code duplication
, which gauges the overlap between different datasets, remains largely unexplored. If this phenomenon exists, it could compromise the integrity of
LLM
evaluations because of the inclusion of fine-tuning test samples that were already encountered during pre-training, resulting in inflated performance metrics.
Contribution.
This paper explores the phenomenon of inter-dataset code duplication and its impact on evaluating
LLM
s across diverse
SE
tasks.
Study design.
We conduct an empirical study using the
CodeSearchNet
dataset (
csn
), a widely adopted pre-training dataset, and five fine-tuning datasets used for various
SE
tasks. We first identify the intersection between the pre-training and fine-tuning datasets using a deduplication process. Next, we pre-train two versions of
LLM
s using a subset of
csn
: one leaky
LLM
, which includes the identified intersection in its pre-training set, and one non-leaky
LLM
that excludes these samples. Finally, we fine-tune both models and compare their performances using fine-tuning test samples that are part of the intersection.
Results.
Our findings reveal a potential threat to the evaluation of
LLM
s across multiple
SE
tasks, stemming from the inter-dataset code duplication phenomenon. We also demonstrate that this threat is accentuated by the chosen fine-tuning technique. Furthermore, we provide evidence that open-source models such as
CodeBERT
,
GraphCodeBERT
, and
UnixCoder
could be affected by inter-dataset duplication. Based on our findings, we delve into prior research that may be susceptible to this threat. Additionally, we offer guidance to
SE
researchers on strategies to prevent inter-dataset code duplication.
期刊介绍:
IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include:
a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models.
b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects.
c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards.
d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues.
e) System issues: Hardware-software trade-offs.
f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.