Supervised Authorship Segmentation of Open Source Code Projects

Proceedings on Privacy Enhancing Technologies. Privacy Enhancing Technologies Symposium Pub Date : 2021-07-23 DOI:10.2478/popets-2021-0080

Edwin Dauber, R. Erbacher, Gregory G. Shearer, Mike Weisman, Frederica Free-Nelson, R. Greenstadt

{"title":"Supervised Authorship Segmentation of Open Source Code Projects","authors":"Edwin Dauber, R. Erbacher, Gregory G. Shearer, Mike Weisman, Frederica Free-Nelson, R. Greenstadt","doi":"10.2478/popets-2021-0080","DOIUrl":null,"url":null,"abstract":"Abstract Source code authorship attribution can be used for many types of intelligence on binaries and executables, including forensics, but introduces a threat to the privacy of anonymous programmers. Previous work has shown how to attribute individually authored code files and code segments. In this work, we examine authorship segmentation, in which we determine authorship of arbitrary parts of a program. While previous work has performed segmentation at the textual level, we attempt to attribute subtrees of the abstract syntax tree (AST). We focus on two primary problems: identifying the primary author of an arbitrary AST subtree and identifying on which edges of the AST primary authorship changes. We demonstrate that the former is a difficult problem but the later is much easier. We also demonstrate methods by which we can leverage the easier problem to improve accuracy for the harder problem. We show that while identifying the author of subtrees is difficult overall, this is primarily due to the abundance of small subtrees: in the validation set we can attribute subtrees of at least 25 nodes with accuracy over 80% and at least 33 nodes with accuracy over 90%, while in the test set we can attribute subtrees of at least 33 nodes with accuracy of 70%. While our baseline accuracy for single AST nodes is 20.21% for the validation set and 35.66% for the test set, we present techniques by which we can increase this accuracy to 42.01% and 49.21% respectively. We further present observations about collaborative code found on GitHub that may drive further research.","PeriodicalId":74556,"journal":{"name":"Proceedings on Privacy Enhancing Technologies. Privacy Enhancing Technologies Symposium","volume":"2021 1","pages":"464 - 479"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings on Privacy Enhancing Technologies. Privacy Enhancing Technologies Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/popets-2021-0080","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Abstract Source code authorship attribution can be used for many types of intelligence on binaries and executables, including forensics, but introduces a threat to the privacy of anonymous programmers. Previous work has shown how to attribute individually authored code files and code segments. In this work, we examine authorship segmentation, in which we determine authorship of arbitrary parts of a program. While previous work has performed segmentation at the textual level, we attempt to attribute subtrees of the abstract syntax tree (AST). We focus on two primary problems: identifying the primary author of an arbitrary AST subtree and identifying on which edges of the AST primary authorship changes. We demonstrate that the former is a difficult problem but the later is much easier. We also demonstrate methods by which we can leverage the easier problem to improve accuracy for the harder problem. We show that while identifying the author of subtrees is difficult overall, this is primarily due to the abundance of small subtrees: in the validation set we can attribute subtrees of at least 25 nodes with accuracy over 80% and at least 33 nodes with accuracy over 90%, while in the test set we can attribute subtrees of at least 33 nodes with accuracy of 70%. While our baseline accuracy for single AST nodes is 20.21% for the validation set and 35.66% for the test set, we present techniques by which we can increase this accuracy to 42.01% and 49.21% respectively. We further present observations about collaborative code found on GitHub that may drive further research.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

开源代码项目的监督作者细分

摘要源代码的作者归属可以用于二进制文件和可执行文件的许多类型的情报，包括取证，但会对匿名程序员的隐私造成威胁。以前的工作已经展示了如何为单独编写的代码文件和代码段赋予属性。在这项工作中，我们研究了作者身份分割，在该分割中，我们确定了程序任意部分的作者身份。虽然之前的工作已经在文本级别执行了分割，但我们尝试为抽象语法树（AST）的子树赋予属性。我们关注两个主要问题：识别任意AST子树的主要作者和识别AST主要作者的哪些边发生了变化。我们证明前者是一个难题，但后者要容易得多。我们还展示了一些方法，通过这些方法，我们可以利用更容易的问题来提高更难问题的准确性。我们发现，虽然识别子树的作者总体上很困难，但这主要是由于小子树的丰富性：在验证集中，我们可以对至少25个节点的子树进行属性，准确率超过80%，对至少33个节点的子集进行属性，准确性超过90%，而在测试集中，我们对至少33个子节点的子树可以进行属性，精确度达到70%。虽然我们对单个AST节点的基线准确度在验证集为20.21%，在测试集为35.66%，但我们提出了可以将准确度分别提高到42.01%和49.21%的技术。我们进一步介绍了在GitHub上发现的合作代码的观察结果，这些观察结果可能会推动进一步的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊