Weighted ensemble classifier for malicious link detection using natural language processing

IF 0.6 Q4 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS International Journal of Pervasive Computing and Communications Pub Date : 2023-01-03 DOI:10.1108/ijpcc-09-2022-0312

S. A, S. Balasubaramanian, Pradeepa Ganesan, Justin Rajasekaran, K. R

{"title":"Weighted ensemble classifier for malicious link detection using natural language processing","authors":"S. A, S. Balasubaramanian, Pradeepa Ganesan, Justin Rajasekaran, K. R","doi":"10.1108/ijpcc-09-2022-0312","DOIUrl":null,"url":null,"abstract":"\nPurpose\nThe internet has completely merged into contemporary life. People are addicted to using internet services for everyday activities. Consequently, an abundance of information about people and organizations is available online, which encourages the proliferation of cybercrimes. Cybercriminals often use malicious links for large-scale cyberattacks, which are disseminated via email, SMS and social media. Recognizing malicious links online can be exceedingly challenging. The purpose of this paper is to present a strong security system that can detect malicious links in the cyberspace using natural language processing technique.\n\n\nDesign/methodology/approach\nThe researcher recommends a variety of approaches, including blacklisting and rules-based machine/deep learning, for automatically recognizing malicious links. But the approaches generally necessitate the generation of a set of features to generalize the detection process. Most of the features are generated by processing URLs and content of the web page, as well as some external features such as the ranking of the web page and domain name system information. This process of feature extraction and selection typically takes more time and demands a high level of expertise in the domain. Sometimes the generated features may not leverage the full potentials of the data set. In addition, the majority of the currently deployed systems make use of a single classifier for the classification of malicious links. However, prediction accuracy may vary widely depending on the data set and the classifier used.\n\n\nFindings\nTo address the issue of generating feature sets, the proposed method uses natural language processing techniques (term frequency and inverse document frequency) that vectorize URLs. To build a robust system for the classification of malicious links, the proposed system implements weighted soft voting classifier, an ensemble classifier that combines predictions of base classifiers. The ability or skill of each classifier serves as the base for the weight that is assigned to it.\n\n\nOriginality/value\nThe proposed method performs better when the optimal weights are assigned. The performance of the proposed method was assessed by using two different data sets (D1 and D2) and compared performance against base machine learning classifiers and previous research results. The outcome accuracy shows that the proposed method is superior to the existing methods, offering 91.4% and 98.8% accuracy for data sets D1 and D2, respectively.\n","PeriodicalId":43952,"journal":{"name":"International Journal of Pervasive Computing and Communications","volume":null,"pages":null},"PeriodicalIF":0.6000,"publicationDate":"2023-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Pervasive Computing and Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/ijpcc-09-2022-0312","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 2

Abstract

Purpose The internet has completely merged into contemporary life. People are addicted to using internet services for everyday activities. Consequently, an abundance of information about people and organizations is available online, which encourages the proliferation of cybercrimes. Cybercriminals often use malicious links for large-scale cyberattacks, which are disseminated via email, SMS and social media. Recognizing malicious links online can be exceedingly challenging. The purpose of this paper is to present a strong security system that can detect malicious links in the cyberspace using natural language processing technique. Design/methodology/approach The researcher recommends a variety of approaches, including blacklisting and rules-based machine/deep learning, for automatically recognizing malicious links. But the approaches generally necessitate the generation of a set of features to generalize the detection process. Most of the features are generated by processing URLs and content of the web page, as well as some external features such as the ranking of the web page and domain name system information. This process of feature extraction and selection typically takes more time and demands a high level of expertise in the domain. Sometimes the generated features may not leverage the full potentials of the data set. In addition, the majority of the currently deployed systems make use of a single classifier for the classification of malicious links. However, prediction accuracy may vary widely depending on the data set and the classifier used. Findings To address the issue of generating feature sets, the proposed method uses natural language processing techniques (term frequency and inverse document frequency) that vectorize URLs. To build a robust system for the classification of malicious links, the proposed system implements weighted soft voting classifier, an ensemble classifier that combines predictions of base classifiers. The ability or skill of each classifier serves as the base for the weight that is assigned to it. Originality/value The proposed method performs better when the optimal weights are assigned. The performance of the proposed method was assessed by using two different data sets (D1 and D2) and compared performance against base machine learning classifiers and previous research results. The outcome accuracy shows that the proposed method is superior to the existing methods, offering 91.4% and 98.8% accuracy for data sets D1 and D2, respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于自然语言处理的恶意链接检测加权集成分类器

目的互联网已经完全融入了当代生活。人们沉迷于在日常活动中使用互联网服务。因此，网上有大量关于个人和组织的信息，这助长了网络犯罪的扩散。网络犯罪分子经常使用恶意链接进行大规模网络攻击，这些攻击通过电子邮件、短信和社交媒体传播。在线识别恶意链接可能极具挑战性。本文的目的是提出一个强大的安全系统，可以利用自然语言处理技术检测网络空间中的恶意链接。设计/方法论/方法研究人员推荐了多种方法，包括列入黑名单和基于规则的机器/深度学习，用于自动识别恶意链接。但这些方法通常需要生成一组特征来概括检测过程。大多数特征是通过处理网页的URL和内容生成的，以及一些外部特征，如网页的排名和域名系统信息。这种特征提取和选择过程通常需要更多的时间，并且需要该领域的高水平专业知识。有时生成的特征可能无法充分利用数据集的潜力。此外，目前部署的大多数系统都使用单个分类器对恶意链接进行分类。然而，根据所使用的数据集和分类器，预测精度可能会有很大差异。发现为了解决生成特征集的问题，所提出的方法使用了自然语言处理技术（术语频率和文档反向频率）来对URL进行矢量化。为了建立一个用于恶意链接分类的鲁棒系统，所提出的系统实现了加权软投票分类器，这是一种结合了基本分类器预测的集成分类器。每个分类器的能力或技能是分配给它的权重的基础。原始性/值当分配了最佳权重时，所提出的方法表现更好。通过使用两个不同的数据集（D1和D2）评估了所提出方法的性能，并将其与基本机器学习分类器和先前的研究结果进行了比较。结果准确度表明，该方法优于现有方法，对D1和D2数据集的准确度分别为91.4%和98.8%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Journal of Pervasive Computing and Communications COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-

CiteScore

6.60

自引率

0.00%

发文量