Whodunit: Classifying Code as Human Authored or GPT-4 Generated - A case study on CodeChef problems

ArXiv Pub Date : 2024-03-06 DOI:10.1145/3643991.3644926

Oseremen Joy Idialu, N. Mathews, Rungroj Maipradit, J. Atlee, Mei Nagappan

{"title":"Whodunit: Classifying Code as Human Authored or GPT-4 Generated - A case study on CodeChef problems","authors":"Oseremen Joy Idialu, N. Mathews, Rungroj Maipradit, J. Atlee, Mei Nagappan","doi":"10.1145/3643991.3644926","DOIUrl":null,"url":null,"abstract":"Artificial intelligence (AI) assistants such as GitHub Copilot and ChatGPT, built on large language models like GPT-4, are revolutionizing how programming tasks are performed, raising questions about whether code is authored by generative AI models. Such questions are of particular interest to educators, who worry that these tools enable a new form of academic dishonesty, in which students submit AI generated code as their own work. Our research explores the viability of using code stylometry and machine learning to distinguish between GPT-4 generated and human-authored code. Our dataset comprises human-authored solutions from CodeChef and AI-authored solutions generated by GPT-4. Our classifier outperforms baselines, with an F1-score and AUC-ROC score of 0.91. A variant of our classifier that excludes gameable features (e.g., empty lines, whitespace) still performs well with an F1-score and AUC-ROC score of 0.89. We also evaluated our classifier with respect to the difficulty of the programming problem and found that there was almost no difference between easier and intermediate problems, and the classifier performed only slightly worse on harder problems. Our study shows that code stylometry is a promising approach for distinguishing between GPT-4 generated code and human-authored code.","PeriodicalId":513202,"journal":{"name":"ArXiv","volume":"3 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3643991.3644926","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Artificial intelligence (AI) assistants such as GitHub Copilot and ChatGPT, built on large language models like GPT-4, are revolutionizing how programming tasks are performed, raising questions about whether code is authored by generative AI models. Such questions are of particular interest to educators, who worry that these tools enable a new form of academic dishonesty, in which students submit AI generated code as their own work. Our research explores the viability of using code stylometry and machine learning to distinguish between GPT-4 generated and human-authored code. Our dataset comprises human-authored solutions from CodeChef and AI-authored solutions generated by GPT-4. Our classifier outperforms baselines, with an F1-score and AUC-ROC score of 0.91. A variant of our classifier that excludes gameable features (e.g., empty lines, whitespace) still performs well with an F1-score and AUC-ROC score of 0.89. We also evaluated our classifier with respect to the difficulty of the programming problem and found that there was almost no difference between easier and intermediate problems, and the classifier performed only slightly worse on harder problems. Our study shows that code stylometry is a promising approach for distinguishing between GPT-4 generated code and human-authored code.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

侦探：将代码分类为人工编写还是 GPT-4 生成--关于 CodeChef 问题的案例研究

以 GPT-4 等大型语言模型为基础的 GitHub Copilot 和 ChatGPT 等人工智能（AI）助手正在彻底改变编程任务的执行方式，从而引发了关于代码是否由生成式 AI 模型编写的问题。教育工作者对这些问题尤其感兴趣，他们担心这些工具会助长一种新的学术不诚实行为，即学生将人工智能生成的代码作为自己的作品提交。我们的研究探索了使用代码风格测量和机器学习来区分 GPT-4 生成的代码和人类编写的代码的可行性。我们的数据集包括来自 CodeChef 的人类编写的解决方案和由 GPT-4 生成的人工智能编写的解决方案。我们的分类器表现优于基线，F1 分数和 AUC-ROC 分数均为 0.91。我们的分类器的变体排除了游戏特征（如空行、空白），仍然表现出色，F1 分数和 AUC-ROC 分数均为 0.89。我们还根据编程问题的难度对我们的分类器进行了评估，发现较简单的问题和中等难度的问题几乎没有区别，分类器在较难的问题上的表现也只是稍差一些。我们的研究表明，代码风格测量法是区分 GPT-4 生成的代码和人类编写的代码的一种很有前途的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ArXiv

自引率

0.00%

发文量