Rapid: Zero-shot Domain Adaptation for Code Search with Pre-trained Models

IF 6.6 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING ACM Transactions on Software Engineering and Methodology Pub Date : 2024-01-18 DOI:10.1145/3641542
Guodong Fan, Shizhan Chen, Cuiyun Gao, Jianmao Xiao, Tao Zhang, Zhiyong Feng
{"title":"Rapid: Zero-shot Domain Adaptation for Code Search with Pre-trained Models","authors":"Guodong Fan, Shizhan Chen, Cuiyun Gao, Jianmao Xiao, Tao Zhang, Zhiyong Feng","doi":"10.1145/3641542","DOIUrl":null,"url":null,"abstract":"<p>Code search, which refers to the process of identifying the most relevant code snippets for a given natural language query, plays a crucial role in software maintenance. However, current approaches heavily rely on labeled data for training, which results in performance decreases when confronted with cross-domain scenarios including domain-specific or project-specific situations. This decline can be attributed to their limited ability to effectively capture the semantics associated with such scenarios. To tackle the aforementioned problem, we propose a ze<b>R</b>o-shot dom<b>A</b>in ada<b>P</b>tion with pre-tra<b>I</b>ned mo<b>D</b>els framework for code search named RAPID. The framework first generates synthetic data by pseudo labeling, then trains the CodeBERT with sampled synthetic data. To avoid the influence of noisy synthetic data and enhance the model performance, we propose a mixture sampling strategy to obtain hard negative samples during training. Specifically, the mixture sampling strategy considers both relevancy and diversity to select the data that are hard to be distinguished by the models. To validate the effectiveness of our approach in zero-shot settings, we conduct extensive experiments and find that RAPID outperforms the CoCoSoDa and UniXcoder model by an average of 15.7% and 10%, respectively, as measured by the MRR metric. When trained on full data, our approach results in an average improvement of 7.5% under the MRR metric using CodeBERT. We observe that as the model’s performance in zero-shot tasks improves, the impact of hard negatives diminishes. Our observation also indicates that fine-tuning CodeT5 for generating pseudo labels can enhance the performance of the code search model, and using only 100-shot samples can yield comparable results to the supervised baseline. Furthermore, we evaluate the effectiveness of RAPID in real-world code search tasks in three GitHub projects through both human and automated assessments. Our findings reveal RAPID exhibits superior performance, e.g., an average improvement of 18% under the MRR metric over the top-performing model.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"37 1","pages":""},"PeriodicalIF":6.6000,"publicationDate":"2024-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Software Engineering and Methodology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3641542","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

Code search, which refers to the process of identifying the most relevant code snippets for a given natural language query, plays a crucial role in software maintenance. However, current approaches heavily rely on labeled data for training, which results in performance decreases when confronted with cross-domain scenarios including domain-specific or project-specific situations. This decline can be attributed to their limited ability to effectively capture the semantics associated with such scenarios. To tackle the aforementioned problem, we propose a zeRo-shot domAin adaPtion with pre-traIned moDels framework for code search named RAPID. The framework first generates synthetic data by pseudo labeling, then trains the CodeBERT with sampled synthetic data. To avoid the influence of noisy synthetic data and enhance the model performance, we propose a mixture sampling strategy to obtain hard negative samples during training. Specifically, the mixture sampling strategy considers both relevancy and diversity to select the data that are hard to be distinguished by the models. To validate the effectiveness of our approach in zero-shot settings, we conduct extensive experiments and find that RAPID outperforms the CoCoSoDa and UniXcoder model by an average of 15.7% and 10%, respectively, as measured by the MRR metric. When trained on full data, our approach results in an average improvement of 7.5% under the MRR metric using CodeBERT. We observe that as the model’s performance in zero-shot tasks improves, the impact of hard negatives diminishes. Our observation also indicates that fine-tuning CodeT5 for generating pseudo labels can enhance the performance of the code search model, and using only 100-shot samples can yield comparable results to the supervised baseline. Furthermore, we evaluate the effectiveness of RAPID in real-world code search tasks in three GitHub projects through both human and automated assessments. Our findings reveal RAPID exhibits superior performance, e.g., an average improvement of 18% under the MRR metric over the top-performing model.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
快速:使用预训练模型进行代码搜索的零点领域自适应
代码搜索是指针对给定的自然语言查询识别最相关代码片段的过程,在软件维护中发挥着至关重要的作用。然而,当前的方法严重依赖于标注数据进行训练,这导致在面对跨领域场景(包括特定领域或特定项目情况)时性能下降。造成性能下降的原因是,这些方法有效捕捉与此类场景相关的语义的能力有限。为了解决上述问题,我们提出了一个用于代码搜索的带有预设模型的 zeRo-shot domAin adaPtion 框架,名为 RAPID。该框架首先通过伪标记生成合成数据,然后使用采样合成数据训练 CodeBERT。为了避免噪声合成数据的影响并提高模型性能,我们提出了一种混合采样策略,以便在训练过程中获得硬负样本。具体来说,混合采样策略同时考虑了相关性和多样性,以选择模型难以区分的数据。为了验证我们的方法在零样本设置中的有效性,我们进行了大量实验,发现根据 MRR 指标衡量,RAPID 的性能比 CoCoSoDa 和 UniXcoder 模型分别平均高出 15.7% 和 10%。当在完整数据上进行训练时,我们的方法在使用 CodeBERT 的 MRR 指标下平均提高了 7.5%。我们观察到,随着模型在零镜头任务中性能的提高,硬否定的影响也在减小。我们的观察还表明,对用于生成伪标签的 CodeT5 进行微调可以提高代码搜索模型的性能,而且仅使用 100 次样本就能获得与监督基线相当的结果。此外,我们还通过人工和自动评估,评估了 RAPID 在三个 GitHub 项目的实际代码搜索任务中的有效性。我们的研究结果表明,RAPID 表现出卓越的性能,例如,在 MRR 指标下,比表现最好的模型平均提高了 18%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ACM Transactions on Software Engineering and Methodology
ACM Transactions on Software Engineering and Methodology 工程技术-计算机:软件工程
CiteScore
6.30
自引率
4.50%
发文量
164
审稿时长
>12 weeks
期刊介绍: Designing and building a large, complex software system is a tremendous challenge. ACM Transactions on Software Engineering and Methodology (TOSEM) publishes papers on all aspects of that challenge: specification, design, development and maintenance. It covers tools and methodologies, languages, data structures, and algorithms. TOSEM also reports on successful efforts, noting practical lessons that can be scaled and transferred to other projects, and often looks at applications of innovative technologies. The tone is scholarly but readable; the content is worthy of study; the presentation is effective.
期刊最新文献
Effective, Platform-Independent GUI Testing via Image Embedding and Reinforcement Learning Bitmap-Based Security Monitoring for Deeply Embedded Systems Harmonising Contributions: Exploring Diversity in Software Engineering through CQA Mining on Stack Overflow An Empirical Study on the Characteristics of Database Access Bugs in Java Applications Self-planning Code Generation with Large Language Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1