[Engineering Paper] SCC: Automatic Classification of Code Snippets

Kamel Alreshedy, Dhanush Dharmaretnam, D. Germán, Venkatesh Srinivasan, T. Gulliver
{"title":"[Engineering Paper] SCC: Automatic Classification of Code Snippets","authors":"Kamel Alreshedy, Dhanush Dharmaretnam, D. Germán, Venkatesh Srinivasan, T. Gulliver","doi":"10.1109/SCAM.2018.00031","DOIUrl":null,"url":null,"abstract":"Determining the programming language of a source code file has been considered in the research community; it has been shown that Machine Learning (ML) and Natural Language Processing (NLP) algorithms can be effective in identifying the programming language of source code files. However, determining the programming language of a code snippet or a few lines of source code is still a challenging task. Online forums such as Stack Overflow and code repositories such as GitHub contain a large number of code snippets. In this paper, we describe Source Code Classification (SCC), a classifier that can identify the programming language of code snippets written in 21 different programming languages. A Multinomial Naive Bayes (MNB) classifier is employed which is trained using Stack Overflow posts. It is shown to achieve an accuracy of 75% which is higher than that with Programming Languages Identification (PLI-a proprietary online classifier of snippets) whose accuracy is only 55.5%. The average score for precision, recall and the F1 score with the proposed tool are 0.76, 0.75 and 0.75, respectively. In addition, it can distinguish between code snippets from a family of programming languages such as C, C++ and C#, and can also identify the programming language version such as C# 3.0, C# 4.0 and C# 5.0.","PeriodicalId":127335,"journal":{"name":"2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SCAM.2018.00031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15

Abstract

Determining the programming language of a source code file has been considered in the research community; it has been shown that Machine Learning (ML) and Natural Language Processing (NLP) algorithms can be effective in identifying the programming language of source code files. However, determining the programming language of a code snippet or a few lines of source code is still a challenging task. Online forums such as Stack Overflow and code repositories such as GitHub contain a large number of code snippets. In this paper, we describe Source Code Classification (SCC), a classifier that can identify the programming language of code snippets written in 21 different programming languages. A Multinomial Naive Bayes (MNB) classifier is employed which is trained using Stack Overflow posts. It is shown to achieve an accuracy of 75% which is higher than that with Programming Languages Identification (PLI-a proprietary online classifier of snippets) whose accuracy is only 55.5%. The average score for precision, recall and the F1 score with the proposed tool are 0.76, 0.75 and 0.75, respectively. In addition, it can distinguish between code snippets from a family of programming languages such as C, C++ and C#, and can also identify the programming language version such as C# 3.0, C# 4.0 and C# 5.0.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
[工程论文]SCC:代码片段自动分类
确定源代码文件的编程语言已经在研究界得到了考虑;研究表明,机器学习(ML)和自然语言处理(NLP)算法可以有效地识别源代码文件的编程语言。然而,确定代码片段或几行源代码的编程语言仍然是一项具有挑战性的任务。Stack Overflow等在线论坛和GitHub等代码库包含大量代码片段。在本文中,我们描述了源代码分类(SCC),一种可以识别用21种不同编程语言编写的代码片段的编程语言的分类器。采用多项朴素贝叶斯(MNB)分类器,该分类器使用Stack Overflow posts进行训练。它的准确率达到75%,高于编程语言识别(pli -一种专有的在线片段分类器),后者的准确率仅为55.5%。使用该工具,准确率、召回率和F1得分的平均值分别为0.76、0.75和0.75。此外,它还可以区分C、c++和c#等一系列编程语言的代码片段,还可以识别c# 3.0、c# 4.0和c# 5.0等编程语言的版本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
[Research Paper] Untangling Composite Commits Using Program Slicing [Engineering Paper] Built-in Clone Detection in Meta Languages [Research Paper] Static JavaScript Call Graphs: A Comparative Study [Engineering Paper] Challenges of Implementing Cross Translation Unit Analysis in Clang Static Analyzer [Engineering Paper] Graal: The Quest for Source Code Knowledge
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1