Evolutionary feature selection for machine learning based malware classification

IF 5.1 2区工程技术 Q1 ENGINEERING, MULTIDISCIPLINARY Engineering Science and Technology-An International Journal-Jestech Pub Date : 2024-07-20 DOI:10.1016/j.jestch.2024.101762

Gülsade Kale , Gazi Erkan Bostancı , Fatih Vehbi Çelebi

{"title":"Evolutionary feature selection for machine learning based malware classification","authors":"Gülsade Kale , Gazi Erkan Bostancı , Fatih Vehbi Çelebi","doi":"10.1016/j.jestch.2024.101762","DOIUrl":null,"url":null,"abstract":"<div><p>Conducting thorough research, analysis, and detection of cyber-threatening malware with the right parameters is crucial for safeguarding a country’s security and economy. Increasingly sophisticated cyber-attacks directly affect individual welfare, social dynamics, and political stability. So, due to the evolving nature of malware, which continuously improves itself to evade detection, it is even more essential to select effective and decisive parameters, considering interactions among various malware features. As malware evolves with new technologies and techniques, signature-based detection systems are becoming inadequate. Instead of relying on these still widely used but insufficient systems, in this study a new system was established focusing on malware behavior and the relationships between malware features resulting from these behaviors. In this system, rather than using a uniform approach, multi-objective genetic algorithms (MOGAs) are employed to select critical and decisive features for malware detection. These selected features are then utilized by machine learning (ML) algorithms within the implemented hybrid framework to accurately detect and classify malware.</p><p>The aim of this paper is to identify the optimal feature selection and classification methods yielding the highest accuracy within the Cuckoo Sandbox environment. Specifically, the J48 Decision Tree (J48), Reduced Error Pruning Tree (REP Tree), Adaptive Boosting Model 1 (AdaboostM1), Multilayer Perceptron (MLP), and Naive Bayes (NB) classifiers were assessed. Through our analysis, the feature set was refined from 335 to 200, considering inter-feature relationships, resulting in a peak accuracy of 93.33% and a corresponding 40% performance enhancement due to the reduction in the number of features. The obtained metrics were meticulously compared and evaluated with respect to the employed algorithms and methodologies. Additionally, Mc Nemar’s test was utilized to evaluate the performance of different malware detection classifiers by comparing their correct and incorrect classifications. Notably, the Mc Nemar’s test revealed significant improvements upon analysis of the results.</p></div>","PeriodicalId":48609,"journal":{"name":"Engineering Science and Technology-An International Journal-Jestech","volume":"56 ","pages":"Article 101762"},"PeriodicalIF":5.1000,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2215098624001484/pdfft?md5=0d6938e427b84bd803811ba937d012a9&pid=1-s2.0-S2215098624001484-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Science and Technology-An International Journal-Jestech","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2215098624001484","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Conducting thorough research, analysis, and detection of cyber-threatening malware with the right parameters is crucial for safeguarding a country’s security and economy. Increasingly sophisticated cyber-attacks directly affect individual welfare, social dynamics, and political stability. So, due to the evolving nature of malware, which continuously improves itself to evade detection, it is even more essential to select effective and decisive parameters, considering interactions among various malware features. As malware evolves with new technologies and techniques, signature-based detection systems are becoming inadequate. Instead of relying on these still widely used but insufficient systems, in this study a new system was established focusing on malware behavior and the relationships between malware features resulting from these behaviors. In this system, rather than using a uniform approach, multi-objective genetic algorithms (MOGAs) are employed to select critical and decisive features for malware detection. These selected features are then utilized by machine learning (ML) algorithms within the implemented hybrid framework to accurately detect and classify malware.

The aim of this paper is to identify the optimal feature selection and classification methods yielding the highest accuracy within the Cuckoo Sandbox environment. Specifically, the J48 Decision Tree (J48), Reduced Error Pruning Tree (REP Tree), Adaptive Boosting Model 1 (AdaboostM1), Multilayer Perceptron (MLP), and Naive Bayes (NB) classifiers were assessed. Through our analysis, the feature set was refined from 335 to 200, considering inter-feature relationships, resulting in a peak accuracy of 93.33% and a corresponding 40% performance enhancement due to the reduction in the number of features. The obtained metrics were meticulously compared and evaluated with respect to the employed algorithms and methodologies. Additionally, Mc Nemar’s test was utilized to evaluate the performance of different malware detection classifiers by comparing their correct and incorrect classifications. Notably, the Mc Nemar’s test revealed significant improvements upon analysis of the results.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于机器学习的恶意软件分类的进化特征选择

利用正确的参数对具有网络威胁的恶意软件进行全面的研究、分析和检测，对于保障国家的安全和经济至关重要。日益复杂的网络攻击会直接影响个人福利、社会动态和政治稳定。因此，由于恶意软件具有不断演变的特性，它们会不断改进自身以逃避检测，因此，考虑到各种恶意软件特征之间的相互作用，选择有效和决定性的参数就显得更加重要。随着恶意软件在新技术和新工艺方面的不断发展，基于特征码的检测系统已显得力不从心。本研究建立了一个新的系统，而不是依赖于这些仍在广泛使用但却不够完善的系统，该系统侧重于恶意软件的行为以及由这些行为产生的恶意软件特征之间的关系。在这个系统中，没有采用统一的方法，而是采用了多目标遗传算法（MOGAs）来选择检测恶意软件的关键和决定性特征。本文旨在确定最佳特征选择和分类方法，从而在布谷鸟沙盒环境中获得最高准确率。具体来说，我们评估了 J48 决策树（J48）、减误剪枝树（REP Tree）、自适应提升模型 1（AdaboostM1）、多层感知器（MLP）和奈夫贝叶斯（NB）分类器。通过分析，我们将特征集从 335 个细化为 200 个，并考虑了特征间的关系，结果是准确率达到了 93.33% 的峰值，由于特征数量的减少，性能相应提高了 40%。针对所采用的算法和方法，对所获得的指标进行了细致的比较和评估。此外，通过比较不同恶意软件检测分类器的正确分类和错误分类，利用麦克尼玛测试评估了它们的性能。值得注意的是，在对结果进行分析后，Mc Nemar 测试显示了显著的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Engineering Science and Technology-An International Journal-Jestech Materials Science-Electronic, Optical and Magnetic Materials

CiteScore

11.20

自引率

3.50%

发文量

153

审稿时长

22 days

期刊介绍： Engineering Science and Technology, an International Journal (JESTECH) (formerly Technology), a peer-reviewed quarterly engineering journal, publishes both theoretical and experimental high quality papers of permanent interest, not previously published in journals, in the field of engineering and applied science which aims to promote the theory and practice of technology and engineering. In addition to peer-reviewed original research papers, the Editorial Board welcomes original research reports, state-of-the-art reviews and communications in the broadly defined field of engineering science and technology. The scope of JESTECH includes a wide spectrum of subjects including: -Electrical/Electronics and Computer Engineering (Biomedical Engineering and Instrumentation; Coding, Cryptography, and Information Protection; Communications, Networks, Mobile Computing and Distributed Systems; Compilers and Operating Systems; Computer Architecture, Parallel Processing, and Dependability; Computer Vision and Robotics; Control Theory; Electromagnetic Waves, Microwave Techniques and Antennas; Embedded Systems; Integrated Circuits, VLSI Design, Testing, and CAD; Microelectromechanical Systems; Microelectronics, and Electronic Devices and Circuits; Power, Energy and Energy Conversion Systems; Signal, Image, and Speech Processing) -Mechanical and Civil Engineering (Automotive Technologies; Biomechanics; Construction Materials; Design and Manufacturing; Dynamics and Control; Energy Generation, Utilization, Conversion, and Storage; Fluid Mechanics and Hydraulics; Heat and Mass Transfer; Micro-Nano Sciences; Renewable and Sustainable Energy Technologies; Robotics and Mechatronics; Solid Mechanics and Structure; Thermal Sciences) -Metallurgical and Materials Engineering (Advanced Materials Science; Biomaterials; Ceramic and Inorgnanic Materials; Electronic-Magnetic Materials; Energy and Environment; Materials Characterizastion; Metallurgy; Polymers and Nanocomposites)