Towards Improving Code Stylometry Analysis in Underground Forums

Michal Tereszkowski-Kaminski, S. Pastrana, Jorge Blasco, Guillermo Suarez-Tangil
{"title":"Towards Improving Code Stylometry Analysis in Underground Forums","authors":"Michal Tereszkowski-Kaminski, S. Pastrana, Jorge Blasco, Guillermo Suarez-Tangil","doi":"10.2478/popets-2022-0007","DOIUrl":null,"url":null,"abstract":"Abstract Code Stylometry has emerged as a powerful mechanism to identify programmers. While there have been significant advances in the field, existing mechanisms underperform in challenging domains. One such domain is studying the provenance of code shared in underground forums, where code posts tend to have small or incomplete source code fragments. This paper proposes a method designed to deal with the idiosyncrasies of code snippets shared in these forums. Our system fuses a forum-specific learning pipeline with Conformal Prediction to generate predictions with precise confidence levels as a novelty. We see that identifying unreliable code snippets is paramount to generate high-accuracy predictions, and this is a task where traditional learning settings fail. Overall, our method performs as twice as well as the state-of-the-art in a constrained setting with a large number of authors (i.e., 100). When dealing with a smaller number of authors (i.e., 20), it performs at high accuracy (89%). We also evaluate our work on an open-world assumption and see that our method is more effective at retaining samples.","PeriodicalId":74556,"journal":{"name":"Proceedings on Privacy Enhancing Technologies. Privacy Enhancing Technologies Symposium","volume":"2022 1","pages":"126 - 147"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings on Privacy Enhancing Technologies. Privacy Enhancing Technologies Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/popets-2022-0007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Abstract Code Stylometry has emerged as a powerful mechanism to identify programmers. While there have been significant advances in the field, existing mechanisms underperform in challenging domains. One such domain is studying the provenance of code shared in underground forums, where code posts tend to have small or incomplete source code fragments. This paper proposes a method designed to deal with the idiosyncrasies of code snippets shared in these forums. Our system fuses a forum-specific learning pipeline with Conformal Prediction to generate predictions with precise confidence levels as a novelty. We see that identifying unreliable code snippets is paramount to generate high-accuracy predictions, and this is a task where traditional learning settings fail. Overall, our method performs as twice as well as the state-of-the-art in a constrained setting with a large number of authors (i.e., 100). When dealing with a smaller number of authors (i.e., 20), it performs at high accuracy (89%). We also evaluate our work on an open-world assumption and see that our method is more effective at retaining samples.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
改进地下论坛中的代码样式分析
代码风格已经成为一种识别程序员的强大机制。虽然该领域已经取得了重大进展,但现有机制在具有挑战性的领域表现不佳。其中一个领域是研究地下论坛中共享的代码的来源,那里的代码帖子往往有小的或不完整的源代码片段。本文提出了一种方法来处理这些论坛中共享的代码片段的特性。我们的系统将论坛特定的学习管道与保形预测融合在一起,以产生具有精确置信度的预测。我们看到,识别不可靠的代码片段对于生成高精度预测至关重要,这是传统学习设置失败的任务。总的来说,我们的方法在有大量作者(即100人)的约束设置下的性能是最先进方法的两倍。当处理较少数量的作者(例如,20)时,它的准确率很高(89%)。我们还在开放世界假设下评估了我们的工作,并发现我们的方法在保留样本方面更有效。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
审稿时长
16 weeks
期刊最新文献
Editors' Introduction Compact and Divisible E-Cash with Threshold Issuance On the Robustness of Topics API to a Re-Identification Attack DP-SIPS: A simpler, more scalable mechanism for differentially private partition selection Privacy-Preserving Federated Recurrent Neural Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1