SCCD-GAN:一种基于GAN的增强语义代码克隆检测模型

Kun Xu, Yan Liu
{"title":"SCCD-GAN:一种基于GAN的增强语义代码克隆检测模型","authors":"Kun Xu, Yan Liu","doi":"10.1109/ICECE54449.2021.9674552","DOIUrl":null,"url":null,"abstract":"Code clone refers to a pair of semantically similar but syntactically similar or different code fragments that exist in code base. Excessive code clones in software system will cause a negative impact on system development and maintenance. In recent years, as deep learning has become a hot research area of machine learning, researchers have tried to apply deep learning techniques to code clone detection tasks. They have proposed a series of detection techniques using including unstructured (code in the form of sequential tokens) and structured (code in the form of abstract syntax trees and control-flow graphs) information to detect semantically similar but syntactically different code clone, which is the most difficult-to-detect clone type. However, although these methods have achieved an important improvement in the precision of semantic code clone detection, the corresponding false positive rate(FPR) is also at a very high level, making these methods unable to be effectively applied to real-world code bases. This paper proposed SCCD-GAN, an enhanced semantic code clone detection model which based on a graph representation form of programs and uses Graph Attention Network to measure the similarity of code pairs and achieved a lower detection FPR than existing methods. We built the graph representation of the code by expanding the control flow and data flow information to the original abstract syntax tree, and equipped with an attention mechanism to our model that focuses on the most important code parts and features which contribute much to the final detection precision.We implemented and evaluated our proposed method based on the benchmark dataset in the field of code clone detection-BigCloneBench2 and Google Code Jam. SCCD-GAN performed better than the existing state-of-the-art methods in terms of precision and false positive rate.","PeriodicalId":166178,"journal":{"name":"2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"SCCD-GAN: An Enhanced Semantic Code Clone Detection Model Using GAN\",\"authors\":\"Kun Xu, Yan Liu\",\"doi\":\"10.1109/ICECE54449.2021.9674552\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Code clone refers to a pair of semantically similar but syntactically similar or different code fragments that exist in code base. Excessive code clones in software system will cause a negative impact on system development and maintenance. In recent years, as deep learning has become a hot research area of machine learning, researchers have tried to apply deep learning techniques to code clone detection tasks. They have proposed a series of detection techniques using including unstructured (code in the form of sequential tokens) and structured (code in the form of abstract syntax trees and control-flow graphs) information to detect semantically similar but syntactically different code clone, which is the most difficult-to-detect clone type. However, although these methods have achieved an important improvement in the precision of semantic code clone detection, the corresponding false positive rate(FPR) is also at a very high level, making these methods unable to be effectively applied to real-world code bases. This paper proposed SCCD-GAN, an enhanced semantic code clone detection model which based on a graph representation form of programs and uses Graph Attention Network to measure the similarity of code pairs and achieved a lower detection FPR than existing methods. We built the graph representation of the code by expanding the control flow and data flow information to the original abstract syntax tree, and equipped with an attention mechanism to our model that focuses on the most important code parts and features which contribute much to the final detection precision.We implemented and evaluated our proposed method based on the benchmark dataset in the field of code clone detection-BigCloneBench2 and Google Code Jam. SCCD-GAN performed better than the existing state-of-the-art methods in terms of precision and false positive rate.\",\"PeriodicalId\":166178,\"journal\":{\"name\":\"2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE)\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICECE54449.2021.9674552\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICECE54449.2021.9674552","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

代码克隆是指存在于代码库中的一对语义相似但语法相似或不同的代码片段。软件系统中过多的代码克隆会对系统的开发和维护造成负面影响。近年来,随着深度学习成为机器学习的一个热门研究领域,研究人员尝试将深度学习技术应用于代码克隆检测任务。他们提出了一系列检测技术,包括使用非结构化(序列符号形式的代码)和结构化(抽象语法树和控制流图形式的代码)信息来检测语义相似但语法不同的代码克隆,这是最难检测的克隆类型。然而,尽管这些方法在语义代码克隆检测的精度上取得了重要的提高,但相应的误报率(FPR)也处于非常高的水平,使得这些方法无法有效地应用于现实世界的代码库。本文提出了一种基于程序图表示形式的增强语义代码克隆检测模型SCCD-GAN,该模型利用图注意网络度量代码对的相似度,实现了较低的检测FPR。我们通过将控制流和数据流信息扩展到原始抽象语法树来构建代码的图表示,并为我们的模型提供了一个关注机制,该机制关注对最终检测精度有很大贡献的最重要的代码部分和特征。我们基于代码克隆检测领域的基准数据集bigclonebench2和Google code Jam实现并评估了我们提出的方法。SCCD-GAN在准确性和假阳性率方面优于现有的最先进的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SCCD-GAN: An Enhanced Semantic Code Clone Detection Model Using GAN
Code clone refers to a pair of semantically similar but syntactically similar or different code fragments that exist in code base. Excessive code clones in software system will cause a negative impact on system development and maintenance. In recent years, as deep learning has become a hot research area of machine learning, researchers have tried to apply deep learning techniques to code clone detection tasks. They have proposed a series of detection techniques using including unstructured (code in the form of sequential tokens) and structured (code in the form of abstract syntax trees and control-flow graphs) information to detect semantically similar but syntactically different code clone, which is the most difficult-to-detect clone type. However, although these methods have achieved an important improvement in the precision of semantic code clone detection, the corresponding false positive rate(FPR) is also at a very high level, making these methods unable to be effectively applied to real-world code bases. This paper proposed SCCD-GAN, an enhanced semantic code clone detection model which based on a graph representation form of programs and uses Graph Attention Network to measure the similarity of code pairs and achieved a lower detection FPR than existing methods. We built the graph representation of the code by expanding the control flow and data flow information to the original abstract syntax tree, and equipped with an attention mechanism to our model that focuses on the most important code parts and features which contribute much to the final detection precision.We implemented and evaluated our proposed method based on the benchmark dataset in the field of code clone detection-BigCloneBench2 and Google Code Jam. SCCD-GAN performed better than the existing state-of-the-art methods in terms of precision and false positive rate.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Design of Emergency Rescue Command Platform Based on Satellite Mobile Communication System Multi-Dimensional Spectrum Data Denoising Based on Tensor Theory Predicting COVID-19 Severe Patients and Evaluation Method of 3 Stages Severe Level by Machine Learning A Novel Stacking Framework Based On Hybrid of Gradient Boosting-Adaptive Boosting-Multilayer Perceptron for Crash Injury Severity Prediction and Analysis Key Techniques on Unified Identity Authentication in OpenMBEE Integration
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1