Towards Improving Code Stylometry Analysis in Underground Forums

Proceedings on Privacy Enhancing Technologies. Privacy Enhancing Technologies Symposium Pub Date : 2021-11-20 DOI:10.2478/popets-2022-0007

Michal Tereszkowski-Kaminski, S. Pastrana, Jorge Blasco, Guillermo Suarez-Tangil

{"title":"Towards Improving Code Stylometry Analysis in Underground Forums","authors":"Michal Tereszkowski-Kaminski, S. Pastrana, Jorge Blasco, Guillermo Suarez-Tangil","doi":"10.2478/popets-2022-0007","DOIUrl":null,"url":null,"abstract":"Abstract Code Stylometry has emerged as a powerful mechanism to identify programmers. While there have been significant advances in the field, existing mechanisms underperform in challenging domains. One such domain is studying the provenance of code shared in underground forums, where code posts tend to have small or incomplete source code fragments. This paper proposes a method designed to deal with the idiosyncrasies of code snippets shared in these forums. Our system fuses a forum-specific learning pipeline with Conformal Prediction to generate predictions with precise confidence levels as a novelty. We see that identifying unreliable code snippets is paramount to generate high-accuracy predictions, and this is a task where traditional learning settings fail. Overall, our method performs as twice as well as the state-of-the-art in a constrained setting with a large number of authors (i.e., 100). When dealing with a smaller number of authors (i.e., 20), it performs at high accuracy (89%). We also evaluate our work on an open-world assumption and see that our method is more effective at retaining samples.","PeriodicalId":74556,"journal":{"name":"Proceedings on Privacy Enhancing Technologies. Privacy Enhancing Technologies Symposium","volume":"2022 1","pages":"126 - 147"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings on Privacy Enhancing Technologies. Privacy Enhancing Technologies Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/popets-2022-0007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Abstract Code Stylometry has emerged as a powerful mechanism to identify programmers. While there have been significant advances in the field, existing mechanisms underperform in challenging domains. One such domain is studying the provenance of code shared in underground forums, where code posts tend to have small or incomplete source code fragments. This paper proposes a method designed to deal with the idiosyncrasies of code snippets shared in these forums. Our system fuses a forum-specific learning pipeline with Conformal Prediction to generate predictions with precise confidence levels as a novelty. We see that identifying unreliable code snippets is paramount to generate high-accuracy predictions, and this is a task where traditional learning settings fail. Overall, our method performs as twice as well as the state-of-the-art in a constrained setting with a large number of authors (i.e., 100). When dealing with a smaller number of authors (i.e., 20), it performs at high accuracy (89%). We also evaluate our work on an open-world assumption and see that our method is more effective at retaining samples.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

改进地下论坛中的代码样式分析

代码风格已经成为一种识别程序员的强大机制。虽然该领域已经取得了重大进展，但现有机制在具有挑战性的领域表现不佳。其中一个领域是研究地下论坛中共享的代码的来源，那里的代码帖子往往有小的或不完整的源代码片段。本文提出了一种方法来处理这些论坛中共享的代码片段的特性。我们的系统将论坛特定的学习管道与保形预测融合在一起，以产生具有精确置信度的预测。我们看到，识别不可靠的代码片段对于生成高精度预测至关重要，这是传统学习设置失败的任务。总的来说，我们的方法在有大量作者(即100人)的约束设置下的性能是最先进方法的两倍。当处理较少数量的作者(例如，20)时，它的准确率很高(89%)。我们还在开放世界假设下评估了我们的工作，并发现我们的方法在保留样本方面更有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊