Automatically Labeling Cyber Threat Intelligence reports using Natural Language Processing

Hamza Abdi, S. Bagley, S. Furnell, J. Twycross
{"title":"Automatically Labeling Cyber Threat Intelligence reports using Natural Language Processing","authors":"Hamza Abdi, S. Bagley, S. Furnell, J. Twycross","doi":"10.1145/3573128.3609348","DOIUrl":null,"url":null,"abstract":"Attribution provides valuable intelligence in the face of Advanced Persistent Threat (APT) attacks. By accurately identifying the culprits and actors behind the attacks, we can gain more insights into their motivations, capabilities, and potential future targets. Cyber Threat Intelligence (CTI) reports are relied upon to attribute these attacks effectively. These reports are compiled by security experts and provide valuable information about threat actors and their attacks. We are interested in building a fully automated APT attribution framework. An essential step in doing so is the automated processing and extraction of information from CTI reports. However, CTI reports are largely unstructured, making extraction and analysis of the information a difficult task. To begin this work, we introduce a method for automatically highlighting a CTI report with the main threat actor attributed within the report. This is done using a custom Natural Language Processing (NLP) model based on the spaCy library. Also, the study showcases and highlights the performance and effectiveness of various pdf-to-text Python libraries that were used in this work. Additionally, to evaluate the effectiveness of our model, we experimented on a dataset consisting of 605 English documents, which were randomly collected from various sources on the internet and manually labeled. Our method achieved an accuracy of 97%. Finally, we discuss the challenges associated with processing these documents automatically and propose some methods for tackling them.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Symposium on Document Engineering 2023","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3573128.3609348","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Attribution provides valuable intelligence in the face of Advanced Persistent Threat (APT) attacks. By accurately identifying the culprits and actors behind the attacks, we can gain more insights into their motivations, capabilities, and potential future targets. Cyber Threat Intelligence (CTI) reports are relied upon to attribute these attacks effectively. These reports are compiled by security experts and provide valuable information about threat actors and their attacks. We are interested in building a fully automated APT attribution framework. An essential step in doing so is the automated processing and extraction of information from CTI reports. However, CTI reports are largely unstructured, making extraction and analysis of the information a difficult task. To begin this work, we introduce a method for automatically highlighting a CTI report with the main threat actor attributed within the report. This is done using a custom Natural Language Processing (NLP) model based on the spaCy library. Also, the study showcases and highlights the performance and effectiveness of various pdf-to-text Python libraries that were used in this work. Additionally, to evaluate the effectiveness of our model, we experimented on a dataset consisting of 605 English documents, which were randomly collected from various sources on the internet and manually labeled. Our method achieved an accuracy of 97%. Finally, we discuss the challenges associated with processing these documents automatically and propose some methods for tackling them.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用自然语言处理自动标记网络威胁情报报告
在面对高级持续性威胁(APT)攻击时,归因提供了有价值的情报。通过准确识别攻击背后的罪魁祸首和参与者,我们可以更深入地了解他们的动机、能力和潜在的未来目标。网络威胁情报(CTI)报告可以有效地归因于这些攻击。这些报告由安全专家编写,并提供有关威胁参与者及其攻击的宝贵信息。我们有兴趣建立一个完全自动化的APT归因框架。这样做的一个重要步骤是从CTI报告中自动处理和提取信息。然而,CTI报告在很大程度上是非结构化的,使得信息的提取和分析成为一项困难的任务。为了开始这项工作,我们引入了一种方法,用于自动突出显示CTI报告,其中包含报告中属性的主要威胁参与者。这是使用基于spaCy库的自定义自然语言处理(NLP)模型完成的。此外,该研究还展示并强调了在这项工作中使用的各种pdf-to-text Python库的性能和有效性。此外,为了评估我们模型的有效性,我们在一个由605个英文文档组成的数据集上进行了实验,这些文档从互联网上的各种来源随机收集并手动标记。我们的方法达到了97%的准确率。最后,我们讨论了与自动处理这些文档相关的挑战,并提出了一些解决这些挑战的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
OntG-Bart: Ontology-Infused Clinical Abstractive Summarization Deep-learning for dysgraphia detection in children handwritings Addressing the gap between current language models and key-term-based clustering YinYang, a Fast and Robust Adaptive Document Image Binarization for Optical Character Recognition Automatically Labeling Cyber Threat Intelligence reports using Natural Language Processing
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1