A MACHINE LEARNING CLASSIFICATION APPROACH TO DETECT TLS-BASED MALWARE USING ENTROPY-BASED FLOW SET FEATURES

Q4 Computer Science International Journal of Information and Communication Technology Pub Date : 2022-07-17 DOI:10.32890/jict2022.21.3.1

Kinan Keshkeh, A. Jantan, Kamal Alieyan

{"title":"A MACHINE LEARNING CLASSIFICATION APPROACH TO DETECT TLS-BASED MALWARE USING ENTROPY-BASED FLOW SET FEATURES","authors":"Kinan Keshkeh, A. Jantan, Kamal Alieyan","doi":"10.32890/jict2022.21.3.1","DOIUrl":null,"url":null,"abstract":"Transport Layer Security (TLS) based malware is one of the most hazardous malware types, as it relies on encryption to conceal connections. Due to the complexity of TLS traffic decryption, several anomaly-based detection studies have been conducted to detect TLS-based malware using different features and machine learning (ML) algorithms. However, most of these studies utilized flow features with no feature transformation or relied on inefficient flow feature transformations like frequency-based periodicity analysis and outliers percentage. This paper introduces TLSMalDetect, a TLS-based malware detection approach that integrates periodicity-independent entropy-based flow set (EFS) features generated by a flow feature transformation technique to solve flow feature utilization issues in related research. EFS features effectiveness was evaluated in two ways: (1) by comparing them to the corresponding outliers percentage and flow features using four feature importance methods, and (2) by analyzing classification performance with and without EFS features. Moreover, new Transmission Control Protocol features not explored in literature were incorporated into TLSMalDetect, and their contribution was assessed. This study’s results proved EFS features of the number of packets sent and received were superior to related outliers percentage and flow features and could remarkably increase the performance up to ~42% in the case of Support Vector Machine accuracy. Furthermore, using the basic features, TLSMalDetect achieved the highest accuracy of 93.69% by Naïve Bayes (NB) among the ML algorithms applied. Also, from a comparison view, TLSMalDetect’s Random Forest precision of 98.99% and NB recall of 92.91% exceeded the best relevant findings of previous studies. These comparative results demonstrated the TLSMalDetect’s ability to detect more malware flows out of total malicious flows than existing works. It could also generate more actual alerts from overall alerts than earlier research.Transport Layer Security (TLS) based malware is one of the most hazardous malware types, as it relies on encryption to conceal connections. Due to the complexity of TLS traffic decryption, several anomaly-based detection studies have been conducted to detect TLS-based malware using different features and machine learning (ML) algorithms. However, most of these studies utilized flow features with no feature transformation or relied on inefficient flow feature transformations like frequency-based periodicity analysis and outliers percentage. This paper introduces TLSMalDetect, a TLS-based malware detection approach that integrates periodicity-independent entropy-based flow set (EFS) features generated by a flow feature transformation technique to solve flow feature utilization issues in related research. EFS features effectiveness was evaluated in two ways: (1) by comparing them to the corresponding outliers percentage and flow features using four feature importance methods, and (2) by analyzing classification performance with and without EFS features. Moreover, new Transmission Control Protocol features not explored in literature were incorporated into TLSMalDetect, and their contribution was assessed. This study’s results proved EFS features of the number of packets sent and received were superior to related outliers percentage and flow features and could remarkably increase the performance up to ~42% in the case of Support Vector Machine accuracy. Furthermore, using the basic features, TLSMalDetect achieved the highest accuracy of 93.69% by Naïve Bayes (NB) among the ML algorithms applied. Also, from a comparison view, TLSMalDetect’s Random Forest precision of 98.99% and NB recall of 92.91% exceeded the best relevant findings of previous studies. These comparative results demonstrated the TLSMalDetect’s ability to detect more malware flows out of total malicious flows than existing works. It could also generate more actual alerts from overall alerts than earlier research.","PeriodicalId":39396,"journal":{"name":"International Journal of Information and Communication Technology","volume":"31 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information and Communication Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32890/jict2022.21.3.1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 0

Abstract

Transport Layer Security (TLS) based malware is one of the most hazardous malware types, as it relies on encryption to conceal connections. Due to the complexity of TLS traffic decryption, several anomaly-based detection studies have been conducted to detect TLS-based malware using different features and machine learning (ML) algorithms. However, most of these studies utilized flow features with no feature transformation or relied on inefficient flow feature transformations like frequency-based periodicity analysis and outliers percentage. This paper introduces TLSMalDetect, a TLS-based malware detection approach that integrates periodicity-independent entropy-based flow set (EFS) features generated by a flow feature transformation technique to solve flow feature utilization issues in related research. EFS features effectiveness was evaluated in two ways: (1) by comparing them to the corresponding outliers percentage and flow features using four feature importance methods, and (2) by analyzing classification performance with and without EFS features. Moreover, new Transmission Control Protocol features not explored in literature were incorporated into TLSMalDetect, and their contribution was assessed. This study’s results proved EFS features of the number of packets sent and received were superior to related outliers percentage and flow features and could remarkably increase the performance up to ~42% in the case of Support Vector Machine accuracy. Furthermore, using the basic features, TLSMalDetect achieved the highest accuracy of 93.69% by Naïve Bayes (NB) among the ML algorithms applied. Also, from a comparison view, TLSMalDetect’s Random Forest precision of 98.99% and NB recall of 92.91% exceeded the best relevant findings of previous studies. These comparative results demonstrated the TLSMalDetect’s ability to detect more malware flows out of total malicious flows than existing works. It could also generate more actual alerts from overall alerts than earlier research.Transport Layer Security (TLS) based malware is one of the most hazardous malware types, as it relies on encryption to conceal connections. Due to the complexity of TLS traffic decryption, several anomaly-based detection studies have been conducted to detect TLS-based malware using different features and machine learning (ML) algorithms. However, most of these studies utilized flow features with no feature transformation or relied on inefficient flow feature transformations like frequency-based periodicity analysis and outliers percentage. This paper introduces TLSMalDetect, a TLS-based malware detection approach that integrates periodicity-independent entropy-based flow set (EFS) features generated by a flow feature transformation technique to solve flow feature utilization issues in related research. EFS features effectiveness was evaluated in two ways: (1) by comparing them to the corresponding outliers percentage and flow features using four feature importance methods, and (2) by analyzing classification performance with and without EFS features. Moreover, new Transmission Control Protocol features not explored in literature were incorporated into TLSMalDetect, and their contribution was assessed. This study’s results proved EFS features of the number of packets sent and received were superior to related outliers percentage and flow features and could remarkably increase the performance up to ~42% in the case of Support Vector Machine accuracy. Furthermore, using the basic features, TLSMalDetect achieved the highest accuracy of 93.69% by Naïve Bayes (NB) among the ML algorithms applied. Also, from a comparison view, TLSMalDetect’s Random Forest precision of 98.99% and NB recall of 92.91% exceeded the best relevant findings of previous studies. These comparative results demonstrated the TLSMalDetect’s ability to detect more malware flows out of total malicious flows than existing works. It could also generate more actual alerts from overall alerts than earlier research.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用基于熵的流集特征检测基于tls的恶意软件的机器学习分类方法

基于传输层安全(TLS)的恶意软件是最危险的恶意软件类型之一，因为它依赖于加密来隐藏连接。由于TLS流量解密的复杂性，已经进行了一些基于异常的检测研究，使用不同的特征和机器学习(ML)算法来检测基于TLS的恶意软件。然而，这些研究大多利用了没有特征变换的流动特征，或者依赖于低效的流动特征变换，如基于频率的周期性分析和离群值百分比。本文介绍了一种基于tls的恶意软件检测方法tlsmalldetect，该方法集成了由流特征变换技术生成的基于周期无关熵的流集(EFS)特征，解决了相关研究中的流特征利用问题。通过两种方法评估EFS特征的有效性:(1)使用四种特征重要性方法将其与相应的异常值百分比和流量特征进行比较，(2)分析有无EFS特征的分类性能。此外，将文献中未探讨的新传输控制协议特性纳入tlsmalldetect，并评估其贡献。本研究的结果证明，在支持向量机准确率的情况下，发送和接收数据包数量的EFS特征优于相关的异常值百分比和流量特征，可以显著提高性能，最高可达42%。此外，利用这些基本特征，使用Naïve贝叶斯(NB)的tlsmalldetect在所应用的ML算法中达到了最高的93.69%的准确率。同时，从对比来看，tlsmalldetect的随机森林精度为98.99%，NB召回率为92.91%，超过了以往研究的最佳相关结果。这些比较结果表明，与现有的工作相比，tlsmalldetect能够从总的恶意流量中检测出更多的恶意流量。它还可以从总体警报中产生比早期研究更多的实际警报。基于传输层安全(TLS)的恶意软件是最危险的恶意软件类型之一，因为它依赖于加密来隐藏连接。由于TLS流量解密的复杂性，已经进行了一些基于异常的检测研究，使用不同的特征和机器学习(ML)算法来检测基于TLS的恶意软件。然而，这些研究大多利用了没有特征变换的流动特征，或者依赖于低效的流动特征变换，如基于频率的周期性分析和离群值百分比。本文介绍了一种基于tls的恶意软件检测方法tlsmalldetect，该方法集成了由流特征变换技术生成的基于周期无关熵的流集(EFS)特征，解决了相关研究中的流特征利用问题。通过两种方法评估EFS特征的有效性:(1)使用四种特征重要性方法将其与相应的异常值百分比和流量特征进行比较，(2)分析有无EFS特征的分类性能。此外，将文献中未探讨的新传输控制协议特性纳入tlsmalldetect，并评估其贡献。本研究的结果证明，在支持向量机准确率的情况下，发送和接收数据包数量的EFS特征优于相关的异常值百分比和流量特征，可以显著提高性能，最高可达42%。此外，利用这些基本特征，使用Naïve贝叶斯(NB)的tlsmalldetect在所应用的ML算法中达到了最高的93.69%的准确率。同时，从对比来看，tlsmalldetect的随机森林精度为98.99%，NB召回率为92.91%，超过了以往研究的最佳相关结果。这些比较结果表明，与现有的工作相比，tlsmalldetect能够从总的恶意流量中检测出更多的恶意流量。它还可以从总体警报中产生比早期研究更多的实际警报。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Journal of Information and Communication Technology Computer Science-Information Systems

CiteScore

0.70

自引率

0.00%

发文量

期刊介绍： IJICT is a refereed journal in the field of information and communication technology (ICT), providing an international forum for professionals, engineers and researchers. IJICT reports the new paradigms in this emerging field of technology and envisions the future developments in the frontier areas. The journal addresses issues for the vertical and horizontal applications in this area. Topics covered include: -Information theory/coding- Information/IT/network security, standards, applications- Internet/web based systems/products- Data mining/warehousing- Network planning, design, administration- Sensor/ad hoc networks- Human-computer intelligent interaction, AI- Computational linguistics, digital speech- Distributed/cooperative media- Interactive communication media/content- Social interaction, mobile communications- Signal representation/processing, image processing- Virtual reality, cyber law, e-governance- Microprocessor interfacing, hardware design- Control of industrial processes, ERP/CRM/SCM