Allard Dijk , Emre Halisdemir , Cosimo Melella , Alari Schu , Mauno Pihelgas , Roland Meier
{"title":"LSPR23: A novel IDS dataset from the largest live-fire cybersecurity exercise","authors":"Allard Dijk , Emre Halisdemir , Cosimo Melella , Alari Schu , Mauno Pihelgas , Roland Meier","doi":"10.1016/j.jisa.2024.103847","DOIUrl":null,"url":null,"abstract":"<div><p>Cybersecurity threats are constantly evolving and becoming increasingly sophisticated, automated, adaptive, and intelligent. This makes it difficult for organizations to defend their digital assets. Industry professionals are looking for solutions to improve the efficiency and effectiveness of cybersecurity operations, adopting different strategies. In cybersecurity, the importance of developing new intrusion detection systems (IDSs) to address these threats has emerged. Most of these systems today are based on machine learning. But these systems need high-quality data to “learn” the characteristics of malicious traffic. Such datasets are difficult to obtain and therefore rarely available.</p><p>This paper advances the state of the art and presents a new high-quality IDS dataset. The dataset originates from Locked Shields, one of the world’s most extensive live-fire cyber defense exercises. This ensures that (i) it contains realistic behavior of attackers and defenders; (ii) it contains sophisticated attacks; and (iii) it contains labels, as the actions of the attackers are well-documented.</p><p>The dataset includes approximately 16 million network flows, [F3] of which approximately 1.6 million were labeled malicious. What is unique about this dataset is the use of a new labeling technique that increases the accuracy level of data labeling.</p><p>We evaluate the robustness of our dataset using both quantitative and qualitative methodologies. We begin with a quantitative examination of the Suricata IDS alerts based on signatures and anomalies. Subsequently, we assess the reproducibility of machine learning experiments conducted by Känzig et al., who used a private Locked Shields dataset. We also apply the quality criteria outlined by the evaluation framework proposed by Gharib et al.</p><p>Using our dataset with an existing classifier, we demonstrate comparable results (F1 score of 0.997) to the original paper where the classifier was evaluated on a private dataset (F1 score of 0.984)</p></div>","PeriodicalId":48638,"journal":{"name":"Journal of Information Security and Applications","volume":"85 ","pages":"Article 103847"},"PeriodicalIF":3.8000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Security and Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214212624001492","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Cybersecurity threats are constantly evolving and becoming increasingly sophisticated, automated, adaptive, and intelligent. This makes it difficult for organizations to defend their digital assets. Industry professionals are looking for solutions to improve the efficiency and effectiveness of cybersecurity operations, adopting different strategies. In cybersecurity, the importance of developing new intrusion detection systems (IDSs) to address these threats has emerged. Most of these systems today are based on machine learning. But these systems need high-quality data to “learn” the characteristics of malicious traffic. Such datasets are difficult to obtain and therefore rarely available.
This paper advances the state of the art and presents a new high-quality IDS dataset. The dataset originates from Locked Shields, one of the world’s most extensive live-fire cyber defense exercises. This ensures that (i) it contains realistic behavior of attackers and defenders; (ii) it contains sophisticated attacks; and (iii) it contains labels, as the actions of the attackers are well-documented.
The dataset includes approximately 16 million network flows, [F3] of which approximately 1.6 million were labeled malicious. What is unique about this dataset is the use of a new labeling technique that increases the accuracy level of data labeling.
We evaluate the robustness of our dataset using both quantitative and qualitative methodologies. We begin with a quantitative examination of the Suricata IDS alerts based on signatures and anomalies. Subsequently, we assess the reproducibility of machine learning experiments conducted by Känzig et al., who used a private Locked Shields dataset. We also apply the quality criteria outlined by the evaluation framework proposed by Gharib et al.
Using our dataset with an existing classifier, we demonstrate comparable results (F1 score of 0.997) to the original paper where the classifier was evaluated on a private dataset (F1 score of 0.984)
期刊介绍:
Journal of Information Security and Applications (JISA) focuses on the original research and practice-driven applications with relevance to information security and applications. JISA provides a common linkage between a vibrant scientific and research community and industry professionals by offering a clear view on modern problems and challenges in information security, as well as identifying promising scientific and "best-practice" solutions. JISA issues offer a balance between original research work and innovative industrial approaches by internationally renowned information security experts and researchers.