An intelligent identification and classification system for malicious uniform resource locators (URLs).

IF 4.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neural Computing & Applications Pub Date : 2023-04-20 DOI:10.1007/s00521-023-08592-z

Qasem Abu Al-Haija, Mustafa Al-Fayoumi

{"title":"An intelligent identification and classification system for malicious uniform resource locators (URLs).","authors":"Qasem Abu Al-Haija, Mustafa Al-Fayoumi","doi":"10.1007/s00521-023-08592-z","DOIUrl":null,"url":null,"abstract":"<p><p>Uniform Resource Locator (URL) is a unique identifier composed of protocol and domain name used to locate and retrieve a resource on the Internet. Like any Internet service, URLs (also called websites) are vulnerable to compromise by attackers to develop Malicious URLs that can exploit/devastate the user's information and resources. Malicious URLs are usually designed with the intention of promoting cyber-attacks such as spam, phishing, malware, and defacement. These websites usually require action on the user's side and can reach users across emails, text messages, pop-ups, or devious advertisements. They have a potential impact that can reach, in some cases, to compromise the machine or network of the user, especially those arriving by email. Therefore, developing systems to detect malicious URLs is of great interest nowadays. This paper proposes a high-performance machine learning-based detection system to identify Malicious URLs. The proposed system provides two layers of detection. Firstly, we identify the URLs as either benign or malware using a binary classifier. Secondly, we classify the URL classes based on their feature into five classes: benign, spam, phishing, malware, and defacement. Specifically, we report on four ensemble learning approaches, viz. the ensemble of bagging trees (En_Bag) approach, the ensemble of k-nearest neighbor (En_kNN) approach, and the ensemble of boosted decision trees (En_Bos) approach, and the ensemble of subspace discriminator (En_Dsc) approach. The developed approaches have been evaluated on an inclusive and contemporary dataset for uniform resource locators (ISCX-URL2016). ISCX-URL2016 provides a lightweight dataset for detecting and categorizing malicious URLs according to their attack type and lexical analysis. Conventional machine learning evaluation measurements are used to evaluate the detection accuracy, precision, recall, F Score, and detection time. Our experiential assessment indicates that the ensemble of bagging trees (En_Bag) approach provides better performance rates than other ensemble methods. Alternatively, the ensemble of the k-nearest neighbor (En_kNN) approach provides the highest inference speed. We also contrast our En_Bag model with state-of-the-art solutions and show its superiority in binary classification and multi-classification with accuracy rates of 99.3% and 97.92%, respectively.</p>","PeriodicalId":49766,"journal":{"name":"Neural Computing & Applications","volume":" ","pages":"1-17"},"PeriodicalIF":4.5000,"publicationDate":"2023-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10117275/pdf/","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Computing & Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00521-023-08592-z","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 5

Abstract

Uniform Resource Locator (URL) is a unique identifier composed of protocol and domain name used to locate and retrieve a resource on the Internet. Like any Internet service, URLs (also called websites) are vulnerable to compromise by attackers to develop Malicious URLs that can exploit/devastate the user's information and resources. Malicious URLs are usually designed with the intention of promoting cyber-attacks such as spam, phishing, malware, and defacement. These websites usually require action on the user's side and can reach users across emails, text messages, pop-ups, or devious advertisements. They have a potential impact that can reach, in some cases, to compromise the machine or network of the user, especially those arriving by email. Therefore, developing systems to detect malicious URLs is of great interest nowadays. This paper proposes a high-performance machine learning-based detection system to identify Malicious URLs. The proposed system provides two layers of detection. Firstly, we identify the URLs as either benign or malware using a binary classifier. Secondly, we classify the URL classes based on their feature into five classes: benign, spam, phishing, malware, and defacement. Specifically, we report on four ensemble learning approaches, viz. the ensemble of bagging trees (En_Bag) approach, the ensemble of k-nearest neighbor (En_kNN) approach, and the ensemble of boosted decision trees (En_Bos) approach, and the ensemble of subspace discriminator (En_Dsc) approach. The developed approaches have been evaluated on an inclusive and contemporary dataset for uniform resource locators (ISCX-URL2016). ISCX-URL2016 provides a lightweight dataset for detecting and categorizing malicious URLs according to their attack type and lexical analysis. Conventional machine learning evaluation measurements are used to evaluate the detection accuracy, precision, recall, F Score, and detection time. Our experiential assessment indicates that the ensemble of bagging trees (En_Bag) approach provides better performance rates than other ensemble methods. Alternatively, the ensemble of the k-nearest neighbor (En_kNN) approach provides the highest inference speed. We also contrast our En_Bag model with state-of-the-art solutions and show its superiority in binary classification and multi-classification with accuracy rates of 99.3% and 97.92%, respectively.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种用于恶意统一资源定位器（URL）的智能识别和分类系统。

统一资源定位器（URL）是一个由协议和域名组成的唯一标识符，用于在互联网上定位和检索资源。与任何互联网服务一样，URL（也称为网站）很容易受到攻击者的攻击，从而开发出可以利用/破坏用户信息和资源的恶意URL。恶意URL的设计通常旨在促进网络攻击，如垃圾邮件、网络钓鱼、恶意软件和污损。这些网站通常需要用户采取行动，可以通过电子邮件、短信、弹出窗口或狡猾的广告联系用户。它们具有潜在的影响，在某些情况下，可能会危及用户的机器或网络，尤其是那些通过电子邮件到达的用户。因此，开发检测恶意URL的系统是当今人们非常感兴趣的。本文提出了一种基于机器学习的高性能恶意URL检测系统。所提出的系统提供了两层检测。首先，我们使用二进制分类器将URL识别为良性或恶意。其次，我们根据URL类的特征将其分为五类：良性、垃圾邮件、网络钓鱼、恶意软件和污损。具体而言，我们报告了四种集成学习方法，即套袋树集成（En_Bag）方法、k近邻集成（En_kNN）方法、增强决策树集成（En_Bos）方法和子空间鉴别器集成（En_Dsc）方法。已在统一资源定位器的包容性和当代数据集（ISCX-URL2016）上对所开发的方法进行了评估。ISCX-URL2016提供了一个轻量级数据集，用于根据恶意URL的攻击类型和词法分析对其进行检测和分类。传统的机器学习评估测量用于评估检测准确性、精确度、召回率、F分数和检测时间。我们的经验评估表明，套袋树集成（En_Bag）方法比其他集成方法提供了更好的性能。或者，k近邻（En_kNN）方法的集合提供了最高的推理速度。我们还将我们的En_Bag模型与最先进的解决方案进行了比较，并展示了其在二元分类和多分类方面的优势，准确率分别为99.3%和97.92%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Neural Computing & Applications 工程技术-计算机：人工智能

CiteScore

11.40

自引率

8.30%

发文量

1280

审稿时长

6.9 months

期刊介绍： Neural Computing & Applications is an international journal which publishes original research and other information in the field of practical applications of neural computing and related techniques such as genetic algorithms, fuzzy logic and neuro-fuzzy systems. All items relevant to building practical systems are within its scope, including but not limited to: -adaptive computing- algorithms- applicable neural networks theory- applied statistics- architectures- artificial intelligence- benchmarks- case histories of innovative applications- fuzzy logic- genetic algorithms- hardware implementations- hybrid intelligent systems- intelligent agents- intelligent control systems- intelligent diagnostics- intelligent forecasting- machine learning- neural networks- neuro-fuzzy systems- pattern recognition- performance measures- self-learning systems- software simulations- supervised and unsupervised learning methods- system engineering and integration. Featured contributions fall into several categories: Original Articles, Review Articles, Book Reviews and Announcements.

期刊最新文献

Neural network-based surrogate model in postprocessing of topology optimized structures. Modeling dislocation dynamics data using semantic web technologies. Difference rewards policy gradients. Fourier convolutional decoder: reconstructing solar flare images via deep learning. Int-HRL: towards intention-based hierarchical reinforcement learning.