Fine-tuning and prompt engineering for large language models-based code review automation

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Information and Software Technology Pub Date : 2024-07-11 DOI:10.1016/j.infsof.2024.107523

Chanathip Pornprasit, Chakkrit Tantithamthavorn

{"title":"Fine-tuning and prompt engineering for large language models-based code review automation","authors":"Chanathip Pornprasit, Chakkrit Tantithamthavorn","doi":"10.1016/j.infsof.2024.107523","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><p>The rapid evolution of Large Language Models (LLMs) has sparked significant interest in leveraging their capabilities for automating code review processes. Prior studies often focus on developing LLMs for code review automation, yet require expensive resources, which is infeasible for organizations with limited budgets and resources. Thus, fine-tuning and prompt engineering are the two common approaches to leveraging LLMs for code review automation.</p></div><div><h3>Objective:</h3><p>We aim to investigate the performance of LLMs-based code review automation based on two contexts, i.e., when LLMs are leveraged by fine-tuning and prompting. Fine-tuning involves training the model on a specific code review dataset, while prompting involves providing explicit instructions to guide the model’s generation process without requiring a specific code review dataset.</p></div><div><h3>Methods:</h3><p>We leverage model fine-tuning and inference techniques (i.e., zero-shot learning, few-shot learning and persona) on LLMs-based code review automation. In total, we investigate 12 variations of two LLMs-based code review automation (i.e., GPT-3.5 and Magicoder), and compare them with the Guo et al.’s approach and three existing code review automation approaches (i.e., CodeReviewer, TufanoT5 and D-ACT).</p></div><div><h3>Results:</h3><p>The fine-tuning of GPT 3.5 with zero-shot learning helps GPT-3.5 to achieve 73.17%–74.23% higher EM than the Guo et al.’s approach. In addition, when GPT-3.5 is not fine-tuned, GPT-3.5 with few-shot learning achieves 46.38%–659.09% higher EM than GPT-3.5 with zero-shot learning.</p></div><div><h3>Conclusions:</h3><p>Based on our results, we recommend that (1) LLMs for code review automation should be fine-tuned to achieve the highest performance.; and (2) when data is not sufficient for model fine-tuning (e.g., a cold-start problem), few-shot learning without a persona should be used for LLMs for code review automation. Our findings contribute valuable insights into the practical recommendations and trade-offs associated with deploying LLMs for code review automation.</p></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"175 ","pages":"Article 107523"},"PeriodicalIF":3.8000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0950584924001289/pdfft?md5=526a4187620c208e9aedacd19f66db65&pid=1-s2.0-S0950584924001289-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584924001289","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Context:

The rapid evolution of Large Language Models (LLMs) has sparked significant interest in leveraging their capabilities for automating code review processes. Prior studies often focus on developing LLMs for code review automation, yet require expensive resources, which is infeasible for organizations with limited budgets and resources. Thus, fine-tuning and prompt engineering are the two common approaches to leveraging LLMs for code review automation.

Objective:

We aim to investigate the performance of LLMs-based code review automation based on two contexts, i.e., when LLMs are leveraged by fine-tuning and prompting. Fine-tuning involves training the model on a specific code review dataset, while prompting involves providing explicit instructions to guide the model’s generation process without requiring a specific code review dataset.

Methods:

We leverage model fine-tuning and inference techniques (i.e., zero-shot learning, few-shot learning and persona) on LLMs-based code review automation. In total, we investigate 12 variations of two LLMs-based code review automation (i.e., GPT-3.5 and Magicoder), and compare them with the Guo et al.’s approach and three existing code review automation approaches (i.e., CodeReviewer, TufanoT5 and D-ACT).

Results:

The fine-tuning of GPT 3.5 with zero-shot learning helps GPT-3.5 to achieve 73.17%–74.23% higher EM than the Guo et al.’s approach. In addition, when GPT-3.5 is not fine-tuned, GPT-3.5 with few-shot learning achieves 46.38%–659.09% higher EM than GPT-3.5 with zero-shot learning.

Conclusions:

Based on our results, we recommend that (1) LLMs for code review automation should be fine-tuned to achieve the highest performance.; and (2) when data is not sufficient for model fine-tuning (e.g., a cold-start problem), few-shot learning without a persona should be used for LLMs for code review automation. Our findings contribute valuable insights into the practical recommendations and trade-offs associated with deploying LLMs for code review automation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于大型语言模型的代码审查自动化的微调和提示工程

背景：大语言模型（LLM）的快速发展引发了人们对利用其功能实现代码审查流程自动化的极大兴趣。之前的研究通常侧重于开发用于代码审查自动化的 LLM，但这需要昂贵的资源，对于预算和资源有限的组织来说是不可行的。因此，微调和提示工程是利用 LLMs 实现代码审查自动化的两种常用方法。目标：我们旨在研究基于 LLMs 的代码审查自动化在微调和提示两种情况下的性能。微调包括在特定的代码审查数据集上训练模型，而提示则包括提供明确的指令来指导模型的生成过程，而不需要特定的代码审查数据集。方法：我们在基于 LLMs 的代码审查自动化中利用了模型微调和推理技术（即零镜头学习、少镜头学习和角色）。我们总共研究了两种基于 LLMs 的代码审查自动化方法（即 GPT-3.5 和 Magicoder）的 12 种变体，并将它们与 Guo 等人的方法和三种现有的代码审查自动化方法（即 CodeReviewer、TufanoT5 和 D-ACT）进行了比较。结论：基于我们的研究结果，我们建议：（1）用于代码审查自动化的 LLM 应进行微调，以实现最高性能；（2）当数据不足以对模型进行微调时（例如，冷启动问题），应采用少次学习方法、冷启动问题）时，用于代码审查自动化的 LLM 应使用无角色的少量学习。我们的研究结果为代码审查自动化部署 LLM 相关的实用建议和权衡提供了宝贵的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Information and Software Technology 工程技术-计算机：软件工程

CiteScore

9.10

自引率

7.70%

发文量

164

审稿时长

9.6 weeks

期刊介绍： Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include: • Software management, quality and metrics, • Software processes, • Software architecture, modelling, specification, design and programming • Functional and non-functional software requirements • Software testing and verification & validation • Empirical studies of all aspects of engineering and managing software development Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information. The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.