Rapid: Zero-shot Domain Adaptation for Code Search with Pre-trained Models

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING ACM Transactions on Software Engineering and Methodology Pub Date : 2024-01-18 DOI:10.1145/3641542

Guodong Fan, Shizhan Chen, Cuiyun Gao, Jianmao Xiao, Tao Zhang, Zhiyong Feng

{"title":"Rapid: Zero-shot Domain Adaptation for Code Search with Pre-trained Models","authors":"Guodong Fan, Shizhan Chen, Cuiyun Gao, Jianmao Xiao, Tao Zhang, Zhiyong Feng","doi":"10.1145/3641542","DOIUrl":null,"url":null,"abstract":"Code search, which refers to the process of identifying the most relevant code snippets for a given natural language query, plays a crucial role in software maintenance. However, current approaches heavily rely on labeled data for training, which results in performance decreases when confronted with cross-domain scenarios including domain-specific or project-specific situations. This decline can be attributed to their limited ability to effectively capture the semantics associated with such scenarios. To tackle the aforementioned problem, we propose a zeRo-shot domAin adaPtion with pre-traIned moDels framework for code search named RAPID. The framework first generates synthetic data by pseudo labeling, then trains the CodeBERT with sampled synthetic data. To avoid the influence of noisy synthetic data and enhance the model performance, we propose a mixture sampling strategy to obtain hard negative samples during training. Specifically, the mixture sampling strategy considers both relevancy and diversity to select the data that are hard to be distinguished by the models. To validate the effectiveness of our approach in zero-shot settings, we conduct extensive experiments and find that RAPID outperforms the CoCoSoDa and UniXcoder model by an average of 15.7% and 10%, respectively, as measured by the MRR metric. When trained on full data, our approach results in an average improvement of 7.5% under the MRR metric using CodeBERT. We observe that as the model’s performance in zero-shot tasks improves, the impact of hard negatives diminishes. Our observation also indicates that fine-tuning CodeT5 for generating pseudo labels can enhance the performance of the code search model, and using only 100-shot samples can yield comparable results to the supervised baseline. Furthermore, we evaluate the effectiveness of RAPID in real-world code search tasks in three GitHub projects through both human and automated assessments. Our findings reveal RAPID exhibits superior performance, e.g., an average improvement of 18% under the MRR metric over the top-performing model.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"37 1","pages":""},"PeriodicalIF":6.2000,"publicationDate":"2024-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Software Engineering and Methodology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3641542","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Code search, which refers to the process of identifying the most relevant code snippets for a given natural language query, plays a crucial role in software maintenance. However, current approaches heavily rely on labeled data for training, which results in performance decreases when confronted with cross-domain scenarios including domain-specific or project-specific situations. This decline can be attributed to their limited ability to effectively capture the semantics associated with such scenarios. To tackle the aforementioned problem, we propose a zeRo-shot domAin adaPtion with pre-traIned moDels framework for code search named RAPID. The framework first generates synthetic data by pseudo labeling, then trains the CodeBERT with sampled synthetic data. To avoid the influence of noisy synthetic data and enhance the model performance, we propose a mixture sampling strategy to obtain hard negative samples during training. Specifically, the mixture sampling strategy considers both relevancy and diversity to select the data that are hard to be distinguished by the models. To validate the effectiveness of our approach in zero-shot settings, we conduct extensive experiments and find that RAPID outperforms the CoCoSoDa and UniXcoder model by an average of 15.7% and 10%, respectively, as measured by the MRR metric. When trained on full data, our approach results in an average improvement of 7.5% under the MRR metric using CodeBERT. We observe that as the model’s performance in zero-shot tasks improves, the impact of hard negatives diminishes. Our observation also indicates that fine-tuning CodeT5 for generating pseudo labels can enhance the performance of the code search model, and using only 100-shot samples can yield comparable results to the supervised baseline. Furthermore, we evaluate the effectiveness of RAPID in real-world code search tasks in three GitHub projects through both human and automated assessments. Our findings reveal RAPID exhibits superior performance, e.g., an average improvement of 18% under the MRR metric over the top-performing model.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

快速：使用预训练模型进行代码搜索的零点领域自适应

代码搜索是指针对给定的自然语言查询识别最相关代码片段的过程，在软件维护中发挥着至关重要的作用。然而，当前的方法严重依赖于标注数据进行训练，这导致在面对跨领域场景（包括特定领域或特定项目情况）时性能下降。造成性能下降的原因是，这些方法有效捕捉与此类场景相关的语义的能力有限。为了解决上述问题，我们提出了一个用于代码搜索的带有预设模型的 zeRo-shot domAin adaPtion 框架，名为 RAPID。该框架首先通过伪标记生成合成数据，然后使用采样合成数据训练 CodeBERT。为了避免噪声合成数据的影响并提高模型性能，我们提出了一种混合采样策略，以便在训练过程中获得硬负样本。具体来说，混合采样策略同时考虑了相关性和多样性，以选择模型难以区分的数据。为了验证我们的方法在零样本设置中的有效性，我们进行了大量实验，发现根据 MRR 指标衡量，RAPID 的性能比 CoCoSoDa 和 UniXcoder 模型分别平均高出 15.7% 和 10%。当在完整数据上进行训练时，我们的方法在使用 CodeBERT 的 MRR 指标下平均提高了 7.5%。我们观察到，随着模型在零镜头任务中性能的提高，硬否定的影响也在减小。我们的观察还表明，对用于生成伪标签的 CodeT5 进行微调可以提高代码搜索模型的性能，而且仅使用 100 次样本就能获得与监督基线相当的结果。此外，我们还通过人工和自动评估，评估了 RAPID 在三个 GitHub 项目的实际代码搜索任务中的有效性。我们的研究结果表明，RAPID 表现出卓越的性能，例如，在 MRR 指标下，比表现最好的模型平均提高了 18%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Software Engineering and Methodology 工程技术-计算机：软件工程

CiteScore

6.30

自引率

4.50%

发文量

164

审稿时长

>12 weeks

期刊介绍： Designing and building a large, complex software system is a tremendous challenge. ACM Transactions on Software Engineering and Methodology (TOSEM) publishes papers on all aspects of that challenge: specification, design, development and maintenance. It covers tools and methodologies, languages, data structures, and algorithms. TOSEM also reports on successful efforts, noting practical lessons that can be scaled and transferred to other projects, and often looks at applications of innovative technologies. The tone is scholarly but readable; the content is worthy of study; the presentation is effective.