编程作业中的算法辨识

Pranshu Chourasia, Ganesh Ramakrishnan, V. Apte, Suraj Kumar
{"title":"编程作业中的算法辨识","authors":"Pranshu Chourasia, Ganesh Ramakrishnan, V. Apte, Suraj Kumar","doi":"10.1145/3524610.3527914","DOIUrl":null,"url":null,"abstract":"Current autograders of programming assignments are typically program output based; they fall short in many ways: e.g. they do not carry out subjective evaluations such as code quality, or whether the code has followed any instructor specified constraints; this is still done manually by teaching assistants. In this paper, we tackle a specific aspect of such evaluation: to verify whether a program implements a specific algorithm that the instructor specified. An algorithm, e.g. bubble sort, can be coded in myriad different ways, but a human can always understand the code and spot, say a bubble sort, vs. a selection sort. We develop and compare four approaches to do precisely this: given the source code of a program known to implement a certain functionality, identify the algorithm used, among a known set of algorithms. The approaches are based on code similarity, Support Vector Machine (SVM) with tree or graph kernels, and transformer neural architectures based only source code (CodeBERT), and the extension of this that includes code structure (GraphCodeBERT). Furthermore, we use a model for explainability (LIME) to generate insights into why certain programs get certain labels. Results based on our datasets of sorting, searching and shortest path codes, show that GraphCodeBERT, fine-tuned with scrambled source code, i.e., where identifiers are replaced consistently with arbitrary words, gives the best performance in algorithm identification, with accuracy of 96-99% depending on the functionality. Additionally, we add uncalled function source code elimination to our pre-processing pipeline of test programs, to improve the accuracy of classification of obfuscated source code.","PeriodicalId":426634,"journal":{"name":"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Algorithm Identification in Programming Assignments\",\"authors\":\"Pranshu Chourasia, Ganesh Ramakrishnan, V. Apte, Suraj Kumar\",\"doi\":\"10.1145/3524610.3527914\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Current autograders of programming assignments are typically program output based; they fall short in many ways: e.g. they do not carry out subjective evaluations such as code quality, or whether the code has followed any instructor specified constraints; this is still done manually by teaching assistants. In this paper, we tackle a specific aspect of such evaluation: to verify whether a program implements a specific algorithm that the instructor specified. An algorithm, e.g. bubble sort, can be coded in myriad different ways, but a human can always understand the code and spot, say a bubble sort, vs. a selection sort. We develop and compare four approaches to do precisely this: given the source code of a program known to implement a certain functionality, identify the algorithm used, among a known set of algorithms. The approaches are based on code similarity, Support Vector Machine (SVM) with tree or graph kernels, and transformer neural architectures based only source code (CodeBERT), and the extension of this that includes code structure (GraphCodeBERT). Furthermore, we use a model for explainability (LIME) to generate insights into why certain programs get certain labels. Results based on our datasets of sorting, searching and shortest path codes, show that GraphCodeBERT, fine-tuned with scrambled source code, i.e., where identifiers are replaced consistently with arbitrary words, gives the best performance in algorithm identification, with accuracy of 96-99% depending on the functionality. Additionally, we add uncalled function source code elimination to our pre-processing pipeline of test programs, to improve the accuracy of classification of obfuscated source code.\",\"PeriodicalId\":426634,\"journal\":{\"name\":\"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3524610.3527914\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3524610.3527914","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

目前的自动分级程序作业通常是基于程序输出;它们在很多方面都有不足之处:例如,它们不进行主观评估,比如代码质量,或者代码是否遵循了任何指导者指定的约束;这仍然是由助教手工完成的。在本文中,我们处理这种评估的一个具体方面:验证一个程序是否实现了讲师指定的特定算法。一个算法,例如冒泡排序,可以用无数种不同的方式编码,但人类总是可以理解代码并发现,比如冒泡排序和选择排序。我们开发并比较了四种方法来精确地做到这一点:给定已知实现某种功能的程序的源代码,在已知的一组算法中确定所使用的算法。这些方法基于代码相似度、具有树或图核的支持向量机(SVM),以及仅基于源代码(CodeBERT)的转换神经体系结构,并对其进行扩展,包括代码结构(GraphCodeBERT)。此外,我们使用可解释性模型(LIME)来深入了解为什么某些程序获得某些标签。基于我们的排序、搜索和最短路径代码数据集的结果表明,GraphCodeBERT对打乱的源代码进行了微调,即标识符被任意单词一致地替换,在算法识别中给出了最佳性能,根据功能的不同,准确率为96-99%。此外,我们在测试程序的预处理管道中增加了未调用函数源代码消除,以提高混淆源代码分类的准确性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Algorithm Identification in Programming Assignments
Current autograders of programming assignments are typically program output based; they fall short in many ways: e.g. they do not carry out subjective evaluations such as code quality, or whether the code has followed any instructor specified constraints; this is still done manually by teaching assistants. In this paper, we tackle a specific aspect of such evaluation: to verify whether a program implements a specific algorithm that the instructor specified. An algorithm, e.g. bubble sort, can be coded in myriad different ways, but a human can always understand the code and spot, say a bubble sort, vs. a selection sort. We develop and compare four approaches to do precisely this: given the source code of a program known to implement a certain functionality, identify the algorithm used, among a known set of algorithms. The approaches are based on code similarity, Support Vector Machine (SVM) with tree or graph kernels, and transformer neural architectures based only source code (CodeBERT), and the extension of this that includes code structure (GraphCodeBERT). Furthermore, we use a model for explainability (LIME) to generate insights into why certain programs get certain labels. Results based on our datasets of sorting, searching and shortest path codes, show that GraphCodeBERT, fine-tuned with scrambled source code, i.e., where identifiers are replaced consistently with arbitrary words, gives the best performance in algorithm identification, with accuracy of 96-99% depending on the functionality. Additionally, we add uncalled function source code elimination to our pre-processing pipeline of test programs, to improve the accuracy of classification of obfuscated source code.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Context-based Cluster Fault Localization Fine-Grained Code-Comment Semantic Interaction Analysis Find Bugs in Static Bug Finders Self-Supervised Learning of Smart Contract Representations An Exploratory Study of Analyzing JavaScript Online Code Clones
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1