Luis Mayer, Christian Heumann, Matthias Aßenmacher
{"title":"Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation","authors":"Luis Mayer, Christian Heumann, Matthias Aßenmacher","doi":"arxiv-2409.04164","DOIUrl":null,"url":null,"abstract":"In recent years, large language models (LLMs) have emerged as powerful tools\nwith potential applications in various fields, including software engineering.\nWithin the scope of this research, we evaluate five different state-of-the-art\nLLMs - Bard, BingChat, ChatGPT, Llama2, and Code Llama - concerning their\ncapabilities for text-to-code generation. In an empirical study, we feed\nprompts with textual descriptions of coding problems sourced from the\nprogramming website LeetCode to the models with the task of creating solutions\nin Python. Subsequently, the quality of the generated outputs is assessed using\nthe testing functionalities of LeetCode. The results indicate large differences\nin performance between the investigated models. ChatGPT can handle these\ntypical programming challenges by far the most effectively, surpassing even\ncode-specialized models like Code Llama. To gain further insights, we measure\nthe runtime as well as the memory usage of the generated outputs and compared\nthem to the other code submissions on Leetcode. A detailed error analysis,\nencompassing a comparison of the differences concerning correct indentation and\nform of the generated code as well as an assignment of the incorrectly solved\ntasks to certain error categories allows us to obtain a more nuanced picture of\nthe results and potential for improvement. The results also show a clear\npattern of increasingly incorrect produced code when the models are facing a\nlot of context in the form of longer prompts.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04164","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, large language models (LLMs) have emerged as powerful tools
with potential applications in various fields, including software engineering.
Within the scope of this research, we evaluate five different state-of-the-art
LLMs - Bard, BingChat, ChatGPT, Llama2, and Code Llama - concerning their
capabilities for text-to-code generation. In an empirical study, we feed
prompts with textual descriptions of coding problems sourced from the
programming website LeetCode to the models with the task of creating solutions
in Python. Subsequently, the quality of the generated outputs is assessed using
the testing functionalities of LeetCode. The results indicate large differences
in performance between the investigated models. ChatGPT can handle these
typical programming challenges by far the most effectively, surpassing even
code-specialized models like Code Llama. To gain further insights, we measure
the runtime as well as the memory usage of the generated outputs and compared
them to the other code submissions on Leetcode. A detailed error analysis,
encompassing a comparison of the differences concerning correct indentation and
form of the generated code as well as an assignment of the incorrectly solved
tasks to certain error categories allows us to obtain a more nuanced picture of
the results and potential for improvement. The results also show a clear
pattern of increasingly incorrect produced code when the models are facing a
lot of context in the form of longer prompts.