Feature Extraction Using Genetic Programming with Applications in Malware Detection

2015 17th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC) Pub Date : 2015-09-21 DOI:10.1109/SYNASC.2015.43

Cristina Vatamanu, Dragos Gavrilut, Razvan Benchea, H. Luchian

{"title":"Feature Extraction Using Genetic Programming with Applications in Malware Detection","authors":"Cristina Vatamanu, Dragos Gavrilut, Razvan Benchea, H. Luchian","doi":"10.1109/SYNASC.2015.43","DOIUrl":null,"url":null,"abstract":"This paper extends the authors' previous research on a malware detection method, focusing on improving the accuracy of the perceptron based - One Side Class Perceptron algorithm via the use of Genetic Programming. We are concerned with finding a proper balance between the three basic requirements for malware detection algorithms: (a) that their training time on large datasets falls below acceptable upper limits; (b) that their false positive rate (clean/legitimate files/software wrongly classified as malware) is as close as possible to 0 and (c) that their detection rate is as close as possible to 1. When the first two requirements are set as objectives for the design of detection algorithms, it often happens that the third objective is missed: the detection rate is low. This study focuses on improving the detection rate while preserving the small training time and the low rate of false positives. Another concern is to use the perceptron-based algorithm's good performance on linearly separable data, by extracting features from existing ones. In order to keep the overall training time low, the huge search space of possible extracted features is efficiently explored in terms of time and memory foot-print using Genetic Programming; better separability is sought for. For experiments we used a dataset consisting of 350,000 executable files with an initial set of 300 Boolean features describing each of them. The feature-extraction algorithm is implemented in a parallel manner in order to cope with the size of the data set. We also tested different ways of controlling the growth in size of the variable-length chromosomes. The experimental results show that the features produced by this method are better than the best ones obtained through mapping allowing for an increase in detection rate.","PeriodicalId":6488,"journal":{"name":"2015 17th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)","volume":"157 1","pages":"224-231"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 17th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNASC.2015.43","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

This paper extends the authors' previous research on a malware detection method, focusing on improving the accuracy of the perceptron based - One Side Class Perceptron algorithm via the use of Genetic Programming. We are concerned with finding a proper balance between the three basic requirements for malware detection algorithms: (a) that their training time on large datasets falls below acceptable upper limits; (b) that their false positive rate (clean/legitimate files/software wrongly classified as malware) is as close as possible to 0 and (c) that their detection rate is as close as possible to 1. When the first two requirements are set as objectives for the design of detection algorithms, it often happens that the third objective is missed: the detection rate is low. This study focuses on improving the detection rate while preserving the small training time and the low rate of false positives. Another concern is to use the perceptron-based algorithm's good performance on linearly separable data, by extracting features from existing ones. In order to keep the overall training time low, the huge search space of possible extracted features is efficiently explored in terms of time and memory foot-print using Genetic Programming; better separability is sought for. For experiments we used a dataset consisting of 350,000 executable files with an initial set of 300 Boolean features describing each of them. The feature-extraction algorithm is implemented in a parallel manner in order to cope with the size of the data set. We also tested different ways of controlling the growth in size of the variable-length chromosomes. The experimental results show that the features produced by this method are better than the best ones obtained through mapping allowing for an increase in detection rate.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

遗传编程特征提取及其在恶意软件检测中的应用

本文扩展了作者之前对恶意软件检测方法的研究，重点是通过使用遗传规划来提高基于感知机的单边类感知机算法的准确性。我们关注的是在恶意软件检测算法的三个基本要求之间找到适当的平衡:(a)它们在大数据集上的训练时间低于可接受的上限;(b)它们的误报率(干净/合法文件/软件被错误归类为恶意软件)尽可能接近于0，(c)它们的检测率尽可能接近于1。当将前两个要求作为检测算法设计的目标时，往往会忽略第三个目标:检测率低。本研究的重点是提高检测率，同时保持较短的训练时间和较低的误报率。另一个问题是利用基于感知器的算法在线性可分数据上的良好性能，从现有数据中提取特征。为了保持较低的整体训练时间，利用遗传规划从时间和内存占用两方面有效地挖掘了可能提取的特征的巨大搜索空间;寻求更好的可分离性。在实验中，我们使用了一个由350,000个可执行文件组成的数据集，初始集有300个布尔特征来描述每个文件。为了应对数据集的规模，特征提取算法采用并行方式实现。我们还测试了控制变长染色体生长的不同方法。实验结果表明，该方法产生的特征比通过映射获得的最佳特征更好，从而提高了检测率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 17th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)

自引率

0.00%

发文量