Towards Performance of NLP Transformers on URL-Based Phishing Detection for Mobile Devices

J. Ubiquitous Syst. Pervasive Networks Pub Date : 2022-08-01 DOI:10.5383/juspn.17.01.005

H. Shirazi, K. Haynes, I. Ray

{"title":"Towards Performance of NLP Transformers on URL-Based Phishing Detection for Mobile Devices","authors":"H. Shirazi, K. Haynes, I. Ray","doi":"10.5383/juspn.17.01.005","DOIUrl":null,"url":null,"abstract":"Hackers are increasingly launching phishing attacks via SMS and social media. Games and dating apps introduce yet another attack vector. However, current deep learning-based phishing detection applications do not apply to mobile devices due to the computational burden. We propose a lightweight phishing detection algorithm that distinguishes phishing from legitimate websites solely from URLs to be used in mobile devices. As a baseline performance, we apply Artificial Neural Networks (ANNs) to URL-based and HTML-based website features. A model search results in 15 ANN models with accuracies >96%, comparable to state-of-the-art approaches. Next, we test the performance of deep ANNs on URLbased features only; however, all models perform poorly with the highest accuracy of 86.2%, indicating that URL-based features alone are not adequate to detect phishing websites even with deep ANNs. Since language transformers learn to represent context-dependent text sequences, we hypothesize that they will be able to learn directly from the text in URLs to distinguish between legitimate and malicious websites. We apply three state-of-the-art deep transformers (BERT, ELECTRA, and RoBERTa) for phishing detection. Testing custom and standard vocabularies, we find that pre-trained transformers available for immediate use (with fine-tuning) outperform the model trained with the custom URL-based vocabulary. In addition, we test a thinner BERT transformer which is suitable for lightweight devices like mobiles, called MobileBERT. Our results emphasize that evaluation metrics of this model are competitive to other models in this study, yet the testing time is significantly less, making this model a choice for embedding phishing detection algorithms in mobile phones. Using pre-trained transformers to predict phishing websites from only URLs has five advantages: 1) requires little training time (230 to 320 s), 2) is more easily updatable than feature-based approaches because no pre-processing of URLs is required, 3) is safer to use because phishing websites can be predicted without physically visiting the malicious sites, 4) is easily deployable for real-time detection and is applicable to run on mobile devices, and 5) using a mobile specific transformer yields comparable performance and predicts 3 times faster than the other transformer models in this study.","PeriodicalId":376249,"journal":{"name":"J. Ubiquitous Syst. Pervasive Networks","volume":"95 14","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Ubiquitous Syst. Pervasive Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5383/juspn.17.01.005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Hackers are increasingly launching phishing attacks via SMS and social media. Games and dating apps introduce yet another attack vector. However, current deep learning-based phishing detection applications do not apply to mobile devices due to the computational burden. We propose a lightweight phishing detection algorithm that distinguishes phishing from legitimate websites solely from URLs to be used in mobile devices. As a baseline performance, we apply Artificial Neural Networks (ANNs) to URL-based and HTML-based website features. A model search results in 15 ANN models with accuracies >96%, comparable to state-of-the-art approaches. Next, we test the performance of deep ANNs on URLbased features only; however, all models perform poorly with the highest accuracy of 86.2%, indicating that URL-based features alone are not adequate to detect phishing websites even with deep ANNs. Since language transformers learn to represent context-dependent text sequences, we hypothesize that they will be able to learn directly from the text in URLs to distinguish between legitimate and malicious websites. We apply three state-of-the-art deep transformers (BERT, ELECTRA, and RoBERTa) for phishing detection. Testing custom and standard vocabularies, we find that pre-trained transformers available for immediate use (with fine-tuning) outperform the model trained with the custom URL-based vocabulary. In addition, we test a thinner BERT transformer which is suitable for lightweight devices like mobiles, called MobileBERT. Our results emphasize that evaluation metrics of this model are competitive to other models in this study, yet the testing time is significantly less, making this model a choice for embedding phishing detection algorithms in mobile phones. Using pre-trained transformers to predict phishing websites from only URLs has five advantages: 1) requires little training time (230 to 320 s), 2) is more easily updatable than feature-based approaches because no pre-processing of URLs is required, 3) is safer to use because phishing websites can be predicted without physically visiting the malicious sites, 4) is easily deployable for real-time detection and is applicable to run on mobile devices, and 5) using a mobile specific transformer yields comparable performance and predicts 3 times faster than the other transformer models in this study.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于url的移动设备网络钓鱼检测NLP变压器性能研究

黑客越来越多地通过短信和社交媒体发起网络钓鱼攻击。游戏和约会应用程序引入了另一种攻击媒介。然而，目前基于深度学习的网络钓鱼检测应用程序由于计算负担而不适用于移动设备。我们提出了一种轻量级的网络钓鱼检测算法，该算法仅从移动设备中使用的url中区分网络钓鱼与合法网站。作为基准性能，我们将人工神经网络(ann)应用于基于url和基于html的网站特征。模型搜索结果为15个ANN模型，准确率>96%，与最先进的方法相当。接下来，我们仅在基于url的特征上测试深度人工神经网络的性能;然而，所有模型都表现不佳，最高准确率为86.2%，这表明即使使用深度人工神经网络，仅基于url的特征也不足以检测网络钓鱼网站。由于语言转换器学习表示与上下文相关的文本序列，我们假设它们将能够直接从url中的文本中学习，以区分合法和恶意网站。我们采用三个最先进的深层变压器(BERT, ELECTRA和RoBERTa)进行网络钓鱼检测。在测试自定义词汇表和标准词汇表时，我们发现可立即使用的预训练的转换器(通过微调)优于使用基于url的自定义词汇表训练的模型。此外，我们还测试了一种更薄的BERT变压器，它适用于手机等轻型设备，称为MobileBERT。我们的研究结果强调，该模型的评估指标与本研究中其他模型相比具有竞争力，但测试时间显著减少，使该模型成为在手机上嵌入网络钓鱼检测算法的选择。使用预训练的变形器仅从url预测网络钓鱼网站有五个优点:1)需要很少的训练时间(230到320秒)，2)比基于特征的方法更容易更新，因为不需要对url进行预处理，3)使用更安全，因为网络钓鱼网站可以在不实际访问恶意网站的情况下预测，4)易于部署用于实时检测，适用于移动设备上运行。5)使用移动专用变压器产生可比的性能，并且预测速度比本研究中其他变压器模型快3倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

J. Ubiquitous Syst. Pervasive Networks

自引率

0.00%

发文量