Feature Extraction Using Genetic Programming with Applications in Malware Detection

Cristina Vatamanu, Dragos Gavrilut, Razvan Benchea, H. Luchian
{"title":"Feature Extraction Using Genetic Programming with Applications in Malware Detection","authors":"Cristina Vatamanu, Dragos Gavrilut, Razvan Benchea, H. Luchian","doi":"10.1109/SYNASC.2015.43","DOIUrl":null,"url":null,"abstract":"This paper extends the authors' previous research on a malware detection method, focusing on improving the accuracy of the perceptron based - One Side Class Perceptron algorithm via the use of Genetic Programming. We are concerned with finding a proper balance between the three basic requirements for malware detection algorithms: (a) that their training time on large datasets falls below acceptable upper limits; (b) that their false positive rate (clean/legitimate files/software wrongly classified as malware) is as close as possible to 0 and (c) that their detection rate is as close as possible to 1. When the first two requirements are set as objectives for the design of detection algorithms, it often happens that the third objective is missed: the detection rate is low. This study focuses on improving the detection rate while preserving the small training time and the low rate of false positives. Another concern is to use the perceptron-based algorithm's good performance on linearly separable data, by extracting features from existing ones. In order to keep the overall training time low, the huge search space of possible extracted features is efficiently explored in terms of time and memory foot-print using Genetic Programming; better separability is sought for. For experiments we used a dataset consisting of 350,000 executable files with an initial set of 300 Boolean features describing each of them. The feature-extraction algorithm is implemented in a parallel manner in order to cope with the size of the data set. We also tested different ways of controlling the growth in size of the variable-length chromosomes. The experimental results show that the features produced by this method are better than the best ones obtained through mapping allowing for an increase in detection rate.","PeriodicalId":6488,"journal":{"name":"2015 17th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)","volume":"157 1","pages":"224-231"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 17th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNASC.2015.43","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

This paper extends the authors' previous research on a malware detection method, focusing on improving the accuracy of the perceptron based - One Side Class Perceptron algorithm via the use of Genetic Programming. We are concerned with finding a proper balance between the three basic requirements for malware detection algorithms: (a) that their training time on large datasets falls below acceptable upper limits; (b) that their false positive rate (clean/legitimate files/software wrongly classified as malware) is as close as possible to 0 and (c) that their detection rate is as close as possible to 1. When the first two requirements are set as objectives for the design of detection algorithms, it often happens that the third objective is missed: the detection rate is low. This study focuses on improving the detection rate while preserving the small training time and the low rate of false positives. Another concern is to use the perceptron-based algorithm's good performance on linearly separable data, by extracting features from existing ones. In order to keep the overall training time low, the huge search space of possible extracted features is efficiently explored in terms of time and memory foot-print using Genetic Programming; better separability is sought for. For experiments we used a dataset consisting of 350,000 executable files with an initial set of 300 Boolean features describing each of them. The feature-extraction algorithm is implemented in a parallel manner in order to cope with the size of the data set. We also tested different ways of controlling the growth in size of the variable-length chromosomes. The experimental results show that the features produced by this method are better than the best ones obtained through mapping allowing for an increase in detection rate.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
遗传编程特征提取及其在恶意软件检测中的应用
本文扩展了作者之前对恶意软件检测方法的研究,重点是通过使用遗传规划来提高基于感知机的单边类感知机算法的准确性。我们关注的是在恶意软件检测算法的三个基本要求之间找到适当的平衡:(a)它们在大数据集上的训练时间低于可接受的上限;(b)它们的误报率(干净/合法文件/软件被错误归类为恶意软件)尽可能接近于0,(c)它们的检测率尽可能接近于1。当将前两个要求作为检测算法设计的目标时,往往会忽略第三个目标:检测率低。本研究的重点是提高检测率,同时保持较短的训练时间和较低的误报率。另一个问题是利用基于感知器的算法在线性可分数据上的良好性能,从现有数据中提取特征。为了保持较低的整体训练时间,利用遗传规划从时间和内存占用两方面有效地挖掘了可能提取的特征的巨大搜索空间;寻求更好的可分离性。在实验中,我们使用了一个由350,000个可执行文件组成的数据集,初始集有300个布尔特征来描述每个文件。为了应对数据集的规模,特征提取算法采用并行方式实现。我们还测试了控制变长染色体生长的不同方法。实验结果表明,该方法产生的特征比通过映射获得的最佳特征更好,从而提高了检测率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Incremental Reasoning on Strongly Distributed Multi-agent Systems Extensions over OpenCL for Latency Reduction and Critical Applications An Improved Upper-Bound Algorithm for Non-preemptive Task Scheduling Adaptations of the k-Means Algorithm to Community Detection in Parallel Environments Improving Malware Detection Response Time with Behavior-Based Statistical Analysis Techniques
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1