Pengnan Hao, Zhuguo Li, Cui Liu, Yu Wen, Fanming Liu
{"title":"Towards Improving Multiple Authorship Attribution of Source Code","authors":"Pengnan Hao, Zhuguo Li, Cui Liu, Yu Wen, Fanming Liu","doi":"10.1109/QRS57517.2022.00059","DOIUrl":null,"url":null,"abstract":"Source code authorship attribution addresses the problems of copyright infringement disputes and plagiarism detection. However, most software projects are collaborative development projects. It is necessary to study multiple authorship attribution. Existing methods are not reliable in the domain of multiple authorship attribution. The reasons are as follows: i) It is a challenge to divide the code boundaries of different authors in a sample; ii) code segments belonging to different authors in a sample are usually small or incomplete. This paper proposes a method to address these challenges. We first divide the code sample into multiple lines, then integrate the code lines with similar author styles into code segments using Siamese networks. Finally, we use a path-based code representation and machine learning to identify authors. Experimental results show the method achieves an accuracy of 87.35% on C/C++ dataset and 91.35% on Java dataset, which performs better than existing methods.","PeriodicalId":143812,"journal":{"name":"2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/QRS57517.2022.00059","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Source code authorship attribution addresses the problems of copyright infringement disputes and plagiarism detection. However, most software projects are collaborative development projects. It is necessary to study multiple authorship attribution. Existing methods are not reliable in the domain of multiple authorship attribution. The reasons are as follows: i) It is a challenge to divide the code boundaries of different authors in a sample; ii) code segments belonging to different authors in a sample are usually small or incomplete. This paper proposes a method to address these challenges. We first divide the code sample into multiple lines, then integrate the code lines with similar author styles into code segments using Siamese networks. Finally, we use a path-based code representation and machine learning to identify authors. Experimental results show the method achieves an accuracy of 87.35% on C/C++ dataset and 91.35% on Java dataset, which performs better than existing methods.