基于K-Means的大数据异常检测扩展隔离林

ACM Transactions on Cyber-Physical Systems (TCPS) Pub Date : 2021-04-27 DOI:10.1145/3460976

Md Tahmid Rahman Laskar, J. Huang, Vladan Smetana, Chris Stewart, Kees Pouw, Aijun An, Steve Chan, Lei Liu

{"title":"基于K-Means的大数据异常检测扩展隔离林","authors":"Md Tahmid Rahman Laskar, J. Huang, Vladan Smetana, Chris Stewart, Kees Pouw, Aijun An, Steve Chan, Lei Liu","doi":"10.1145/3460976","DOIUrl":null,"url":null,"abstract":"Industrial Information Technology infrastructures are often vulnerable to cyberattacks. To ensure security to the computer systems in an industrial environment, it is required to build effective intrusion detection systems to monitor the cyber-physical systems (e.g., computer networks) in the industry for malicious activities. This article aims to build such intrusion detection systems to protect the computer networks from cyberattacks. More specifically, we propose a novel unsupervised machine learning approach that combines the K-Means algorithm with the Isolation Forest for anomaly detection in industrial big data scenarios. Since our objective is to build the intrusion detection system for the big data scenario in the industrial domain, we utilize the Apache Spark framework to implement our proposed model that was trained in large network traffic data (about 123 million instances of network traffic) stored in Elasticsearch. Moreover, we evaluate our proposed model on the live streaming data and find that our proposed system can be used for real-time anomaly detection in the industrial setup. In addition, we address different challenges that we face while training our model on large datasets and explicitly describe how these issues were resolved. Based on our empirical evaluation in different use cases for anomaly detection in real-world network traffic data, we observe that our proposed system is effective to detect anomalies in big data scenarios. Finally, we evaluate our proposed model on several academic datasets to compare with other models and find that it provides comparable performance with other state-of-the-art approaches.","PeriodicalId":380257,"journal":{"name":"ACM Transactions on Cyber-Physical Systems (TCPS)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"Extending Isolation Forest for Anomaly Detection in Big Data via K-Means\",\"authors\":\"Md Tahmid Rahman Laskar, J. Huang, Vladan Smetana, Chris Stewart, Kees Pouw, Aijun An, Steve Chan, Lei Liu\",\"doi\":\"10.1145/3460976\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Industrial Information Technology infrastructures are often vulnerable to cyberattacks. To ensure security to the computer systems in an industrial environment, it is required to build effective intrusion detection systems to monitor the cyber-physical systems (e.g., computer networks) in the industry for malicious activities. This article aims to build such intrusion detection systems to protect the computer networks from cyberattacks. More specifically, we propose a novel unsupervised machine learning approach that combines the K-Means algorithm with the Isolation Forest for anomaly detection in industrial big data scenarios. Since our objective is to build the intrusion detection system for the big data scenario in the industrial domain, we utilize the Apache Spark framework to implement our proposed model that was trained in large network traffic data (about 123 million instances of network traffic) stored in Elasticsearch. Moreover, we evaluate our proposed model on the live streaming data and find that our proposed system can be used for real-time anomaly detection in the industrial setup. In addition, we address different challenges that we face while training our model on large datasets and explicitly describe how these issues were resolved. Based on our empirical evaluation in different use cases for anomaly detection in real-world network traffic data, we observe that our proposed system is effective to detect anomalies in big data scenarios. Finally, we evaluate our proposed model on several academic datasets to compare with other models and find that it provides comparable performance with other state-of-the-art approaches.\",\"PeriodicalId\":380257,\"journal\":{\"name\":\"ACM Transactions on Cyber-Physical Systems (TCPS)\",\"volume\":\"40 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Cyber-Physical Systems (TCPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3460976\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Cyber-Physical Systems (TCPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3460976","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

摘要

工业信息技术基础设施往往容易受到网络攻击。为了确保工业环境中计算机系统的安全，需要建立有效的入侵检测系统来监控工业中的网络物理系统(如计算机网络)的恶意活动。本文旨在建立这样的入侵检测系统，以保护计算机网络免受网络攻击。更具体地说，我们提出了一种新的无监督机器学习方法，该方法将K-Means算法与隔离森林相结合，用于工业大数据场景中的异常检测。由于我们的目标是为工业领域的大数据场景构建入侵检测系统，我们利用Apache Spark框架来实现我们提出的模型，该模型是在存储在Elasticsearch中的大型网络流量数据(约1.23亿网络流量实例)中训练出来的。此外，我们在实时流数据上评估了我们提出的模型，发现我们提出的系统可以用于工业设置中的实时异常检测。此外，我们解决了在大型数据集上训练模型时面临的不同挑战，并明确描述了这些问题是如何解决的。基于我们对真实网络流量数据异常检测的不同用例的经验评估，我们观察到我们提出的系统可以有效地检测大数据场景下的异常。最后，我们在几个学术数据集上评估了我们提出的模型，并与其他模型进行了比较，发现它与其他最先进的方法提供了相当的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Extending Isolation Forest for Anomaly Detection in Big Data via K-Means

Industrial Information Technology infrastructures are often vulnerable to cyberattacks. To ensure security to the computer systems in an industrial environment, it is required to build effective intrusion detection systems to monitor the cyber-physical systems (e.g., computer networks) in the industry for malicious activities. This article aims to build such intrusion detection systems to protect the computer networks from cyberattacks. More specifically, we propose a novel unsupervised machine learning approach that combines the K-Means algorithm with the Isolation Forest for anomaly detection in industrial big data scenarios. Since our objective is to build the intrusion detection system for the big data scenario in the industrial domain, we utilize the Apache Spark framework to implement our proposed model that was trained in large network traffic data (about 123 million instances of network traffic) stored in Elasticsearch. Moreover, we evaluate our proposed model on the live streaming data and find that our proposed system can be used for real-time anomaly detection in the industrial setup. In addition, we address different challenges that we face while training our model on large datasets and explicitly describe how these issues were resolved. Based on our empirical evaluation in different use cases for anomaly detection in real-world network traffic data, we observe that our proposed system is effective to detect anomalies in big data scenarios. Finally, we evaluate our proposed model on several academic datasets to compare with other models and find that it provides comparable performance with other state-of-the-art approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Cyber-Physical Systems (TCPS)

自引率

0.00%

发文量