Automatically Labeling Cyber Threat Intelligence reports using Natural Language Processing

Proceedings of the ACM Symposium on Document Engineering 2023 Pub Date : 2023-08-22 DOI:10.1145/3573128.3609348

Hamza Abdi, S. Bagley, S. Furnell, J. Twycross

{"title":"Automatically Labeling Cyber Threat Intelligence reports using Natural Language Processing","authors":"Hamza Abdi, S. Bagley, S. Furnell, J. Twycross","doi":"10.1145/3573128.3609348","DOIUrl":null,"url":null,"abstract":"Attribution provides valuable intelligence in the face of Advanced Persistent Threat (APT) attacks. By accurately identifying the culprits and actors behind the attacks, we can gain more insights into their motivations, capabilities, and potential future targets. Cyber Threat Intelligence (CTI) reports are relied upon to attribute these attacks effectively. These reports are compiled by security experts and provide valuable information about threat actors and their attacks. We are interested in building a fully automated APT attribution framework. An essential step in doing so is the automated processing and extraction of information from CTI reports. However, CTI reports are largely unstructured, making extraction and analysis of the information a difficult task. To begin this work, we introduce a method for automatically highlighting a CTI report with the main threat actor attributed within the report. This is done using a custom Natural Language Processing (NLP) model based on the spaCy library. Also, the study showcases and highlights the performance and effectiveness of various pdf-to-text Python libraries that were used in this work. Additionally, to evaluate the effectiveness of our model, we experimented on a dataset consisting of 605 English documents, which were randomly collected from various sources on the internet and manually labeled. Our method achieved an accuracy of 97%. Finally, we discuss the challenges associated with processing these documents automatically and propose some methods for tackling them.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Symposium on Document Engineering 2023","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3573128.3609348","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Attribution provides valuable intelligence in the face of Advanced Persistent Threat (APT) attacks. By accurately identifying the culprits and actors behind the attacks, we can gain more insights into their motivations, capabilities, and potential future targets. Cyber Threat Intelligence (CTI) reports are relied upon to attribute these attacks effectively. These reports are compiled by security experts and provide valuable information about threat actors and their attacks. We are interested in building a fully automated APT attribution framework. An essential step in doing so is the automated processing and extraction of information from CTI reports. However, CTI reports are largely unstructured, making extraction and analysis of the information a difficult task. To begin this work, we introduce a method for automatically highlighting a CTI report with the main threat actor attributed within the report. This is done using a custom Natural Language Processing (NLP) model based on the spaCy library. Also, the study showcases and highlights the performance and effectiveness of various pdf-to-text Python libraries that were used in this work. Additionally, to evaluate the effectiveness of our model, we experimented on a dataset consisting of 605 English documents, which were randomly collected from various sources on the internet and manually labeled. Our method achieved an accuracy of 97%. Finally, we discuss the challenges associated with processing these documents automatically and propose some methods for tackling them.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用自然语言处理自动标记网络威胁情报报告

在面对高级持续性威胁(APT)攻击时，归因提供了有价值的情报。通过准确识别攻击背后的罪魁祸首和参与者，我们可以更深入地了解他们的动机、能力和潜在的未来目标。网络威胁情报(CTI)报告可以有效地归因于这些攻击。这些报告由安全专家编写，并提供有关威胁参与者及其攻击的宝贵信息。我们有兴趣建立一个完全自动化的APT归因框架。这样做的一个重要步骤是从CTI报告中自动处理和提取信息。然而，CTI报告在很大程度上是非结构化的，使得信息的提取和分析成为一项困难的任务。为了开始这项工作，我们引入了一种方法，用于自动突出显示CTI报告，其中包含报告中属性的主要威胁参与者。这是使用基于spaCy库的自定义自然语言处理(NLP)模型完成的。此外，该研究还展示并强调了在这项工作中使用的各种pdf-to-text Python库的性能和有效性。此外，为了评估我们模型的有效性，我们在一个由605个英文文档组成的数据集上进行了实验，这些文档从互联网上的各种来源随机收集并手动标记。我们的方法达到了97%的准确率。最后，我们讨论了与自动处理这些文档相关的挑战，并提出了一些解决这些挑战的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the ACM Symposium on Document Engineering 2023

自引率

0.00%

发文量