Cristina Vatamanu, Dragos Gavrilut, Razvan Benchea, H. Luchian
{"title":"Feature Extraction Using Genetic Programming with Applications in Malware Detection","authors":"Cristina Vatamanu, Dragos Gavrilut, Razvan Benchea, H. Luchian","doi":"10.1109/SYNASC.2015.43","DOIUrl":null,"url":null,"abstract":"This paper extends the authors' previous research on a malware detection method, focusing on improving the accuracy of the perceptron based - One Side Class Perceptron algorithm via the use of Genetic Programming. We are concerned with finding a proper balance between the three basic requirements for malware detection algorithms: (a) that their training time on large datasets falls below acceptable upper limits; (b) that their false positive rate (clean/legitimate files/software wrongly classified as malware) is as close as possible to 0 and (c) that their detection rate is as close as possible to 1. When the first two requirements are set as objectives for the design of detection algorithms, it often happens that the third objective is missed: the detection rate is low. This study focuses on improving the detection rate while preserving the small training time and the low rate of false positives. Another concern is to use the perceptron-based algorithm's good performance on linearly separable data, by extracting features from existing ones. In order to keep the overall training time low, the huge search space of possible extracted features is efficiently explored in terms of time and memory foot-print using Genetic Programming; better separability is sought for. For experiments we used a dataset consisting of 350,000 executable files with an initial set of 300 Boolean features describing each of them. The feature-extraction algorithm is implemented in a parallel manner in order to cope with the size of the data set. We also tested different ways of controlling the growth in size of the variable-length chromosomes. The experimental results show that the features produced by this method are better than the best ones obtained through mapping allowing for an increase in detection rate.","PeriodicalId":6488,"journal":{"name":"2015 17th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)","volume":"157 1","pages":"224-231"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 17th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNASC.2015.43","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
This paper extends the authors' previous research on a malware detection method, focusing on improving the accuracy of the perceptron based - One Side Class Perceptron algorithm via the use of Genetic Programming. We are concerned with finding a proper balance between the three basic requirements for malware detection algorithms: (a) that their training time on large datasets falls below acceptable upper limits; (b) that their false positive rate (clean/legitimate files/software wrongly classified as malware) is as close as possible to 0 and (c) that their detection rate is as close as possible to 1. When the first two requirements are set as objectives for the design of detection algorithms, it often happens that the third objective is missed: the detection rate is low. This study focuses on improving the detection rate while preserving the small training time and the low rate of false positives. Another concern is to use the perceptron-based algorithm's good performance on linearly separable data, by extracting features from existing ones. In order to keep the overall training time low, the huge search space of possible extracted features is efficiently explored in terms of time and memory foot-print using Genetic Programming; better separability is sought for. For experiments we used a dataset consisting of 350,000 executable files with an initial set of 300 Boolean features describing each of them. The feature-extraction algorithm is implemented in a parallel manner in order to cope with the size of the data set. We also tested different ways of controlling the growth in size of the variable-length chromosomes. The experimental results show that the features produced by this method are better than the best ones obtained through mapping allowing for an increase in detection rate.