SCCD-GAN:一种基于GAN的增强语义代码克隆检测模型

2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE) Pub Date : 2021-12-17 DOI:10.1109/ICECE54449.2021.9674552

Kun Xu, Yan Liu

{"title":"SCCD-GAN:一种基于GAN的增强语义代码克隆检测模型","authors":"Kun Xu, Yan Liu","doi":"10.1109/ICECE54449.2021.9674552","DOIUrl":null,"url":null,"abstract":"Code clone refers to a pair of semantically similar but syntactically similar or different code fragments that exist in code base. Excessive code clones in software system will cause a negative impact on system development and maintenance. In recent years, as deep learning has become a hot research area of machine learning, researchers have tried to apply deep learning techniques to code clone detection tasks. They have proposed a series of detection techniques using including unstructured (code in the form of sequential tokens) and structured (code in the form of abstract syntax trees and control-flow graphs) information to detect semantically similar but syntactically different code clone, which is the most difficult-to-detect clone type. However, although these methods have achieved an important improvement in the precision of semantic code clone detection, the corresponding false positive rate(FPR) is also at a very high level, making these methods unable to be effectively applied to real-world code bases. This paper proposed SCCD-GAN, an enhanced semantic code clone detection model which based on a graph representation form of programs and uses Graph Attention Network to measure the similarity of code pairs and achieved a lower detection FPR than existing methods. We built the graph representation of the code by expanding the control flow and data flow information to the original abstract syntax tree, and equipped with an attention mechanism to our model that focuses on the most important code parts and features which contribute much to the final detection precision.We implemented and evaluated our proposed method based on the benchmark dataset in the field of code clone detection-BigCloneBench2 and Google Code Jam. SCCD-GAN performed better than the existing state-of-the-art methods in terms of precision and false positive rate.","PeriodicalId":166178,"journal":{"name":"2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"SCCD-GAN: An Enhanced Semantic Code Clone Detection Model Using GAN\",\"authors\":\"Kun Xu, Yan Liu\",\"doi\":\"10.1109/ICECE54449.2021.9674552\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Code clone refers to a pair of semantically similar but syntactically similar or different code fragments that exist in code base. Excessive code clones in software system will cause a negative impact on system development and maintenance. In recent years, as deep learning has become a hot research area of machine learning, researchers have tried to apply deep learning techniques to code clone detection tasks. They have proposed a series of detection techniques using including unstructured (code in the form of sequential tokens) and structured (code in the form of abstract syntax trees and control-flow graphs) information to detect semantically similar but syntactically different code clone, which is the most difficult-to-detect clone type. However, although these methods have achieved an important improvement in the precision of semantic code clone detection, the corresponding false positive rate(FPR) is also at a very high level, making these methods unable to be effectively applied to real-world code bases. This paper proposed SCCD-GAN, an enhanced semantic code clone detection model which based on a graph representation form of programs and uses Graph Attention Network to measure the similarity of code pairs and achieved a lower detection FPR than existing methods. We built the graph representation of the code by expanding the control flow and data flow information to the original abstract syntax tree, and equipped with an attention mechanism to our model that focuses on the most important code parts and features which contribute much to the final detection precision.We implemented and evaluated our proposed method based on the benchmark dataset in the field of code clone detection-BigCloneBench2 and Google Code Jam. SCCD-GAN performed better than the existing state-of-the-art methods in terms of precision and false positive rate.\",\"PeriodicalId\":166178,\"journal\":{\"name\":\"2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE)\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICECE54449.2021.9674552\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICECE54449.2021.9674552","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

代码克隆是指存在于代码库中的一对语义相似但语法相似或不同的代码片段。软件系统中过多的代码克隆会对系统的开发和维护造成负面影响。近年来，随着深度学习成为机器学习的一个热门研究领域，研究人员尝试将深度学习技术应用于代码克隆检测任务。他们提出了一系列检测技术，包括使用非结构化(序列符号形式的代码)和结构化(抽象语法树和控制流图形式的代码)信息来检测语义相似但语法不同的代码克隆，这是最难检测的克隆类型。然而，尽管这些方法在语义代码克隆检测的精度上取得了重要的提高，但相应的误报率(FPR)也处于非常高的水平，使得这些方法无法有效地应用于现实世界的代码库。本文提出了一种基于程序图表示形式的增强语义代码克隆检测模型SCCD-GAN，该模型利用图注意网络度量代码对的相似度，实现了较低的检测FPR。我们通过将控制流和数据流信息扩展到原始抽象语法树来构建代码的图表示，并为我们的模型提供了一个关注机制，该机制关注对最终检测精度有很大贡献的最重要的代码部分和特征。我们基于代码克隆检测领域的基准数据集bigclonebench2和Google code Jam实现并评估了我们提出的方法。SCCD-GAN在准确性和假阳性率方面优于现有的最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SCCD-GAN: An Enhanced Semantic Code Clone Detection Model Using GAN

Code clone refers to a pair of semantically similar but syntactically similar or different code fragments that exist in code base. Excessive code clones in software system will cause a negative impact on system development and maintenance. In recent years, as deep learning has become a hot research area of machine learning, researchers have tried to apply deep learning techniques to code clone detection tasks. They have proposed a series of detection techniques using including unstructured (code in the form of sequential tokens) and structured (code in the form of abstract syntax trees and control-flow graphs) information to detect semantically similar but syntactically different code clone, which is the most difficult-to-detect clone type. However, although these methods have achieved an important improvement in the precision of semantic code clone detection, the corresponding false positive rate(FPR) is also at a very high level, making these methods unable to be effectively applied to real-world code bases. This paper proposed SCCD-GAN, an enhanced semantic code clone detection model which based on a graph representation form of programs and uses Graph Attention Network to measure the similarity of code pairs and achieved a lower detection FPR than existing methods. We built the graph representation of the code by expanding the control flow and data flow information to the original abstract syntax tree, and equipped with an attention mechanism to our model that focuses on the most important code parts and features which contribute much to the final detection precision.We implemented and evaluated our proposed method based on the benchmark dataset in the field of code clone detection-BigCloneBench2 and Google Code Jam. SCCD-GAN performed better than the existing state-of-the-art methods in terms of precision and false positive rate.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE)

自引率

0.00%

发文量