Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone Detection

2022 IEEE International Conference on Software Maintenance and Evolution (ICSME) Pub Date : 2022-10-01 DOI:10.1109/ICSME55016.2022.00080

Muslim Chochlov, Gul Aftab Ahmed, James Patten, Guoxian Lu, Wei Hou, David Gregg, J. Buckley

{"title":"Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone Detection","authors":"Muslim Chochlov, Gul Aftab Ahmed, James Patten, Guoxian Lu, Wei Hou, David Gregg, J. Buckley","doi":"10.1109/ICSME55016.2022.00080","DOIUrl":null,"url":null,"abstract":"Code clones can detrimentally impact software maintenance and manually detecting them in very large code-bases is impractical. Additionally, automated approaches find detection of Type 3 and Type 4 (inexact) clones very challenging. While the most recent artificial deep neural networks (for ex-ample BERT-based artificial neural networks) seem to be highly effective in detecting such clones, their pairwise comparison of every code pair in the target system(s) is inefficient and scales poorly on large codebases.We therefore introduce SSCD, a BERT-based clone detection approach that targets high recall of Type 3 and Type 4 clones at scale (in line with our industrial partner’s requirements). It does so by computing a representative embedding for each code fragment and finding similar fragments using a nearest neighbour search. SSCD thus avoids the pairwise-comparison bottleneck of other Neural Network approaches while also using parallel, GPU-accelerated search to tackle scalability.This paper details the approach and an empirical assessment towards configuring and evaluating that approach in industrial setting. The configuration analysis suggests that shorter input lengths and text-only based neural network models demonstrate better efficiency in SSCD, while only slightly decreasing effectiveness. The evaluation results suggest that SSCD is more effective than state-of-the-art approaches like SAGA and SourcererCC. It is also highly efficient: in its optimal setting, SSCD effectively locates clones in the entire 320 million LOC BigCloneBench (a standard clone detection benchmark) in just under three hours.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSME55016.2022.00080","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Code clones can detrimentally impact software maintenance and manually detecting them in very large code-bases is impractical. Additionally, automated approaches find detection of Type 3 and Type 4 (inexact) clones very challenging. While the most recent artificial deep neural networks (for ex-ample BERT-based artificial neural networks) seem to be highly effective in detecting such clones, their pairwise comparison of every code pair in the target system(s) is inefficient and scales poorly on large codebases.We therefore introduce SSCD, a BERT-based clone detection approach that targets high recall of Type 3 and Type 4 clones at scale (in line with our industrial partner’s requirements). It does so by computing a representative embedding for each code fragment and finding similar fragments using a nearest neighbour search. SSCD thus avoids the pairwise-comparison bottleneck of other Neural Network approaches while also using parallel, GPU-accelerated search to tackle scalability.This paper details the approach and an empirical assessment towards configuring and evaluating that approach in industrial setting. The configuration analysis suggests that shorter input lengths and text-only based neural network models demonstrate better efficiency in SSCD, while only slightly decreasing effectiveness. The evaluation results suggest that SSCD is more effective than state-of-the-art approaches like SAGA and SourcererCC. It is also highly efficient: in its optimal setting, SSCD effectively locates clones in the entire 320 million LOC BigCloneBench (a standard clone detection benchmark) in just under three hours.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于bert的最近邻克隆检测方法

代码克隆会对软件维护产生不利影响，并且在非常大的代码库中手动检测它们是不切实际的。此外，自动化方法发现3型和4型(不精确)克隆的检测非常具有挑战性。虽然最近的人工深度神经网络(例如基于bert的人工神经网络)在检测此类克隆方面似乎非常有效，但它们对目标系统中的每个代码对进行成对比较的效率很低，并且在大型代码库上的可扩展性很差。因此，我们引入了SSCD，这是一种基于bert的克隆检测方法，针对大规模的3型和4型克隆的高召回(符合我们的工业合作伙伴的要求)。它通过计算每个代码片段的代表性嵌入并使用最近邻搜索找到相似的片段来实现这一目标。因此，SSCD避免了其他神经网络方法的两两比较瓶颈，同时还使用并行的gpu加速搜索来解决可扩展性问题。本文详细介绍了该方法，并对在工业环境中配置和评估该方法进行了实证评估。配置分析表明，较短的输入长度和基于纯文本的神经网络模型在SSCD中显示出更好的效率，而有效性仅略有下降。评价结果表明，SSCD比SAGA和SourcererCC等最先进的方法更有效。它也非常高效:在最佳设置下，SSCD在不到3小时的时间内有效地在整个3.2亿个LOC BigCloneBench(标准克隆检测基准)中定位克隆。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)

自引率

0.00%

发文量

期刊最新文献

RestTestGen: An Extensible Framework for Automated Black-box Testing of RESTful APIs COBREX: A Tool for Extracting Business Rules from COBOL On the Security of Python Virtual Machines: An Empirical Study The Phantom Menace: Unmasking Security Issues in Evolving Software Impact of Defect Instances for Successful Deep Learning-based Automatic Program Repair