A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning.

Discover data Pub Date : 2023-01-01 DOI:10.1007/s44248-023-00003-x

Paul K Mvula, Paula Branco, Guy-Vincent Jourdan, Herna L Viktor

{"title":"A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning.","authors":"Paul K Mvula, Paula Branco, Guy-Vincent Jourdan, Herna L Viktor","doi":"10.1007/s44248-023-00003-x","DOIUrl":null,"url":null,"abstract":"<p><p>In Machine Learning, the datasets used to build models are one of the main factors limiting what these models can achieve and how good their predictive performance is. Machine Learning applications for cyber-security or computer security are numerous including cyber threat mitigation and security infrastructure enhancement through pattern recognition, real-time attack detection, and in-depth penetration testing. Therefore, for these applications in particular, the datasets used to build the models must be carefully thought to be representative of real-world data. However, because of the scarcity of labelled data and the cost of manually labelling positive examples, there is a growing corpus of literature utilizing Semi-Supervised Learning with cyber-security data repositories. In this work, we provide a comprehensive overview of publicly available data repositories and datasets used for building computer security or cyber-security systems based on Semi-Supervised Learning, where only a few labels are necessary or available for building strong models. We highlight the strengths and limitations of the data repositories and sets and provide an analysis of the performance assessment metrics used to evaluate the built models. Finally, we discuss open challenges and provide future research directions for using cyber-security datasets and evaluating models built upon them.</p>","PeriodicalId":72824,"journal":{"name":"Discover data","volume":"1 1","pages":"4"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10079755/pdf/","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Discover data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s44248-023-00003-x","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In Machine Learning, the datasets used to build models are one of the main factors limiting what these models can achieve and how good their predictive performance is. Machine Learning applications for cyber-security or computer security are numerous including cyber threat mitigation and security infrastructure enhancement through pattern recognition, real-time attack detection, and in-depth penetration testing. Therefore, for these applications in particular, the datasets used to build the models must be carefully thought to be representative of real-world data. However, because of the scarcity of labelled data and the cost of manually labelling positive examples, there is a growing corpus of literature utilizing Semi-Supervised Learning with cyber-security data repositories. In this work, we provide a comprehensive overview of publicly available data repositories and datasets used for building computer security or cyber-security systems based on Semi-Supervised Learning, where only a few labels are necessary or available for building strong models. We highlight the strengths and limitations of the data repositories and sets and provide an analysis of the performance assessment metrics used to evaluate the built models. Finally, we discuss open challenges and provide future research directions for using cyber-security datasets and evaluating models built upon them.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

对网络安全数据存储库和半监督学习绩效评估指标的系统文献综述。

在机器学习中，用于构建模型的数据集是限制这些模型实现的主要因素之一，以及它们的预测性能有多好。机器学习在网络安全或计算机安全方面的应用有很多，包括通过模式识别、实时攻击检测和深入渗透测试来缓解网络威胁和增强安全基础设施。因此，特别是对于这些应用程序，必须仔细考虑用于构建模型的数据集是否代表真实世界的数据。然而，由于标记数据的稀缺性和手动标记正例的成本，越来越多的文献利用网络安全数据存储库的半监督学习。在这项工作中，我们提供了一个全面的概述，用于构建基于半监督学习的计算机安全或网络安全系统的公开可用的数据存储库和数据集，其中只有少数标签是必要的或可用于构建强模型。我们强调了数据存储库和数据集的优势和局限性，并提供了用于评估构建模型的性能评估指标的分析。最后，我们讨论了开放的挑战，并为使用网络安全数据集和评估基于它们的模型提供了未来的研究方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Discover data

自引率

0.00%

发文量

期刊最新文献

A comparative case study on the performance of global sensitivity analysis methods on digit classification. The measurement errors of google trends data Benchmarking of Secure Group Communication schemes with focus on IoT TFPsocialmedia: a public dataset for studying Turkish foreign policy Data sharing and exchanging with incentive and optimization: a survey