A Fault Tolerant Approach for Malicious URL Filtering

2018 International Symposium on Networks, Computers and Communications (ISNCC) Pub Date : 2018-06-01 DOI:10.1109/ISNCC.2018.8530984

Mansoor Ahmed, Abid Khan, Osama Saleem, Muhammad Haris

{"title":"A Fault Tolerant Approach for Malicious URL Filtering","authors":"Mansoor Ahmed, Abid Khan, Osama Saleem, Muhammad Haris","doi":"10.1109/ISNCC.2018.8530984","DOIUrl":null,"url":null,"abstract":"Existing URL filtering mechanisms lacks support for real-time fault tolerance and scalability. In this paper these issues are addressed by developing a scalable model which is real time and fault tolerant to classify streams of URL traffic. The key feature of our model is that it saves computation time, resources usage and bandwidth. This model is implemented in Apache Spark which runs APIs for machine learning and streaming. The dataset consists of 2.4 million URLs which were taken from both clean and malicious classes. In training set, clean URLs are labeled as 1 and malicious are labeled as 0. For this proposed model, distributed in-memory computation is provided by Apache Spark's resilient distributed datasets (RDD) in fault tolerant manner. By increasing number of nodes in the cluster we achieved linear scalability. Our model attained an accuracy of 96% on logistic regression classifier and scaled up well with the Apache Spark's cluster. In 55 second using logistic regression classifier from Spark ML1ib, 2 million URLs can be filtered. The model achieved fl-score values of 0.92, 0.95 and 0.93 along with precision and the results are evaluated using cross-validation schemes.","PeriodicalId":313846,"journal":{"name":"2018 International Symposium on Networks, Computers and Communications (ISNCC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Symposium on Networks, Computers and Communications (ISNCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISNCC.2018.8530984","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Existing URL filtering mechanisms lacks support for real-time fault tolerance and scalability. In this paper these issues are addressed by developing a scalable model which is real time and fault tolerant to classify streams of URL traffic. The key feature of our model is that it saves computation time, resources usage and bandwidth. This model is implemented in Apache Spark which runs APIs for machine learning and streaming. The dataset consists of 2.4 million URLs which were taken from both clean and malicious classes. In training set, clean URLs are labeled as 1 and malicious are labeled as 0. For this proposed model, distributed in-memory computation is provided by Apache Spark's resilient distributed datasets (RDD) in fault tolerant manner. By increasing number of nodes in the cluster we achieved linear scalability. Our model attained an accuracy of 96% on logistic regression classifier and scaled up well with the Apache Spark's cluster. In 55 second using logistic regression classifier from Spark ML1ib, 2 million URLs can be filtered. The model achieved fl-score values of 0.92, 0.95 and 0.93 along with precision and the results are evaluated using cross-validation schemes.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种恶意URL过滤的容错方法

现有的URL过滤机制缺乏对实时容错和可伸缩性的支持。本文通过开发一种实时、容错的可扩展URL流分类模型来解决这些问题。该模型的主要特点是节省了计算时间、资源使用和带宽。该模型是在Apache Spark中实现的，它运行用于机器学习和流媒体的api。该数据集由240万个url组成，这些url来自干净类和恶意类。在训练集中，干净url被标记为1，恶意url被标记为0。在这个模型中，分布式内存计算由Apache Spark的弹性分布式数据集(RDD)以容错的方式提供。通过增加集群中的节点数量，我们实现了线性可扩展性。我们的模型在逻辑回归分类器上达到了96%的准确率，并且在Apache Spark的集群上进行了很好的扩展。使用Spark ML1ib的逻辑回归分类器，在55秒内可以过滤200万个url。模型的f -score值分别为0.92、0.95和0.93，精度较高，并采用交叉验证方案对结果进行评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 International Symposium on Networks, Computers and Communications (ISNCC)

自引率

0.00%

发文量