{"title":"Classifying UNSW-NB15 Network Traffic in the Big Data Framework using Random Forest in Spark","authors":"","doi":"10.4018/ijbdia.287617","DOIUrl":null,"url":null,"abstract":"The focus of this work is on detecting and classifying attacks in network traffic using a binary as well as multi-class machine learning classifier, Random Forest, in a distributed Big Data environment using Apache Spark. The classifier is tested using the UNSW-NB15 dataset. Major problems in these types of datasets include high dimensionality and imbalanced data. To address the issue of high dimensionality, both Information Gain as well as Principal Components Analysis (PCA) were applied before training and testing the data using Random Forest in Apache Spark. Binary as well as multi-class Random Forest classifiers were compared in a distributed environment, with and without using PCA, using various number of Spark cores and Random Forest trees, in terms of performance time and statistical measures. The highest accuracy was obtained by the binary classifier at 99.94%, using 8 cores and 30 trees. This study obtained higher accuracy and lower FAR rates than previously achieved, with low testing times.","PeriodicalId":272065,"journal":{"name":"International Journal of Big Data Intelligence and Applications","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Big Data Intelligence and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4018/ijbdia.287617","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
The focus of this work is on detecting and classifying attacks in network traffic using a binary as well as multi-class machine learning classifier, Random Forest, in a distributed Big Data environment using Apache Spark. The classifier is tested using the UNSW-NB15 dataset. Major problems in these types of datasets include high dimensionality and imbalanced data. To address the issue of high dimensionality, both Information Gain as well as Principal Components Analysis (PCA) were applied before training and testing the data using Random Forest in Apache Spark. Binary as well as multi-class Random Forest classifiers were compared in a distributed environment, with and without using PCA, using various number of Spark cores and Random Forest trees, in terms of performance time and statistical measures. The highest accuracy was obtained by the binary classifier at 99.94%, using 8 cores and 30 trees. This study obtained higher accuracy and lower FAR rates than previously achieved, with low testing times.