Hossein Hajipour, Lea Schönherr, Thorsten Holz, Mario Fritz
{"title":"HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data","authors":"Hossein Hajipour, Lea Schönherr, Thorsten Holz, Mario Fritz","doi":"arxiv-2409.06446","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) have shown great potential for automatic code\ngeneration and form the basis for various tools such as GitHub Copilot.\nHowever, recent studies highlight that many LLM-generated code contains serious\nsecurity vulnerabilities. While previous work tries to address this by training\nmodels that generate secure code, these attempts remain constrained by limited\naccess to training data and labor-intensive data preparation. In this paper, we introduce HexaCoder, a novel approach to enhance the\nability of LLMs to generate secure codes by automatically synthesizing secure\ncodes, which reduces the effort of finding suitable training data. HexaCoder\ncomprises two key components: an oracle-guided data synthesis pipeline and a\ntwo-step process for secure code generation. The data synthesis pipeline\ngenerates pairs of vulnerable and fixed codes for specific Common Weakness\nEnumeration (CWE) types by utilizing a state-of-the-art LLM for repairing\nvulnerable code. A security oracle identifies vulnerabilities, and a\nstate-of-the-art LLM repairs them by extending and/or editing the codes,\ncreating data pairs for fine-tuning using the Low-Rank Adaptation (LoRA)\nmethod. Each example of our fine-tuning dataset includes the necessary\nsecurity-related libraries and code that form the basis of our novel two-step\ngeneration approach. This allows the model to integrate security-relevant\nlibraries before generating the main code, significantly reducing the number of\ngenerated vulnerable codes by up to 85% compared to the baseline methods. We\nperform extensive evaluations on three different benchmarks for four LLMs,\ndemonstrating that HexaCoder not only improves the security of the generated\ncode but also maintains a high level of functional correctness.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"63 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06446","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Large language models (LLMs) have shown great potential for automatic code
generation and form the basis for various tools such as GitHub Copilot.
However, recent studies highlight that many LLM-generated code contains serious
security vulnerabilities. While previous work tries to address this by training
models that generate secure code, these attempts remain constrained by limited
access to training data and labor-intensive data preparation. In this paper, we introduce HexaCoder, a novel approach to enhance the
ability of LLMs to generate secure codes by automatically synthesizing secure
codes, which reduces the effort of finding suitable training data. HexaCoder
comprises two key components: an oracle-guided data synthesis pipeline and a
two-step process for secure code generation. The data synthesis pipeline
generates pairs of vulnerable and fixed codes for specific Common Weakness
Enumeration (CWE) types by utilizing a state-of-the-art LLM for repairing
vulnerable code. A security oracle identifies vulnerabilities, and a
state-of-the-art LLM repairs them by extending and/or editing the codes,
creating data pairs for fine-tuning using the Low-Rank Adaptation (LoRA)
method. Each example of our fine-tuning dataset includes the necessary
security-related libraries and code that form the basis of our novel two-step
generation approach. This allows the model to integrate security-relevant
libraries before generating the main code, significantly reducing the number of
generated vulnerable codes by up to 85% compared to the baseline methods. We
perform extensive evaluations on three different benchmarks for four LLMs,
demonstrating that HexaCoder not only improves the security of the generated
code but also maintains a high level of functional correctness.