{"title":"Development and Benchmarking of Multilingual Code Clone Detector","authors":"Wenqing Zhu, Norihiro Yoshida, Toshihiro Kamiya, Eunjong Choi, Hiroaki Takada","doi":"arxiv-2409.06176","DOIUrl":null,"url":null,"abstract":"The diversity of programming languages is growing, making the language\nextensibility of code clone detectors crucial. However, this is challenging for\nmost existing clone detection detectors because the source code handler needs\nmodifications, which require specialist-level knowledge of the targeted\nlanguage and is time-consuming. Multilingual code clone detectors make it\neasier to add new language support by providing syntax information of the\ntarget language only. To address the shortcomings of existing multilingual\ndetectors for language scalability and detection performance, we propose a\nmultilingual code block extraction method based on ANTLR parser generation, and\nimplement a multilingual code clone detector (MSCCD), which supports the most\nsignificant number of languages currently available and has the ability to\ndetect Type-3 code clones. We follow the methodology of previous studies to\nevaluate the detection performance of the Java language. Compared to ten\nstate-of-the-art detectors, MSCCD performs at an average level while it also\nsupports a significantly larger number of languages. Furthermore, we propose\nthe first multilingual syntactic code clone evaluation benchmark based on the\nCodeNet database. Our results reveal that even when applying the same detection\napproach, performance can vary markedly depending on the language of the source\ncode under investigation. Overall, MSCCD is the most balanced one among the\nevaluated tools when considering detection performance and language\nextensibility.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06176","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The diversity of programming languages is growing, making the language
extensibility of code clone detectors crucial. However, this is challenging for
most existing clone detection detectors because the source code handler needs
modifications, which require specialist-level knowledge of the targeted
language and is time-consuming. Multilingual code clone detectors make it
easier to add new language support by providing syntax information of the
target language only. To address the shortcomings of existing multilingual
detectors for language scalability and detection performance, we propose a
multilingual code block extraction method based on ANTLR parser generation, and
implement a multilingual code clone detector (MSCCD), which supports the most
significant number of languages currently available and has the ability to
detect Type-3 code clones. We follow the methodology of previous studies to
evaluate the detection performance of the Java language. Compared to ten
state-of-the-art detectors, MSCCD performs at an average level while it also
supports a significantly larger number of languages. Furthermore, we propose
the first multilingual syntactic code clone evaluation benchmark based on the
CodeNet database. Our results reveal that even when applying the same detection
approach, performance can vary markedly depending on the language of the source
code under investigation. Overall, MSCCD is the most balanced one among the
evaluated tools when considering detection performance and language
extensibility.