基于字节的深度神经网络恶意软件分类激活分析

2019 IEEE Security and Privacy Workshops (SPW) Pub Date : 2019-03-12 DOI:10.1109/SPW.2019.00017

Scott E. Coull, Christopher Gardner

{"title":"基于字节的深度神经网络恶意软件分类激活分析","authors":"Scott E. Coull, Christopher Gardner","doi":"10.1109/SPW.2019.00017","DOIUrl":null,"url":null,"abstract":"Feature engineering is one of the most costly aspects of developing effective machine learning models, and that cost is even greater in specialized problem domains, like malware classification, where expert skills are necessary to identify useful features. Recent work, however, has shown that deep learning models can be used to automatically learn feature representations directly from the raw, unstructured bytes of the binaries themselves. In this paper, we explore what these models are learning about malware. To do so, we examine the learned features at multiple levels of resolution, from individual byte embeddings to end-to-end analysis of the model. At each step, we connect these byte-oriented activations to their original semantics through parsing and disassembly of the binary to arrive at human-understandable features. Through our results, we identify several interesting features learned by the model and their connection to manually-derived features typically used by traditional machine learning models. Additionally, we explore the impact of training data volume and regularization on the quality of the learned features and the efficacy of the classifiers, revealing the somewhat paradoxical insight that better generalization does not necessarily result in better performance for byte-based malware classifiers.","PeriodicalId":125351,"journal":{"name":"2019 IEEE Security and Privacy Workshops (SPW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":"{\"title\":\"Activation Analysis of a Byte-Based Deep Neural Network for Malware Classification\",\"authors\":\"Scott E. Coull, Christopher Gardner\",\"doi\":\"10.1109/SPW.2019.00017\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Feature engineering is one of the most costly aspects of developing effective machine learning models, and that cost is even greater in specialized problem domains, like malware classification, where expert skills are necessary to identify useful features. Recent work, however, has shown that deep learning models can be used to automatically learn feature representations directly from the raw, unstructured bytes of the binaries themselves. In this paper, we explore what these models are learning about malware. To do so, we examine the learned features at multiple levels of resolution, from individual byte embeddings to end-to-end analysis of the model. At each step, we connect these byte-oriented activations to their original semantics through parsing and disassembly of the binary to arrive at human-understandable features. Through our results, we identify several interesting features learned by the model and their connection to manually-derived features typically used by traditional machine learning models. Additionally, we explore the impact of training data volume and regularization on the quality of the learned features and the efficacy of the classifiers, revealing the somewhat paradoxical insight that better generalization does not necessarily result in better performance for byte-based malware classifiers.\",\"PeriodicalId\":125351,\"journal\":{\"name\":\"2019 IEEE Security and Privacy Workshops (SPW)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-03-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"37\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE Security and Privacy Workshops (SPW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SPW.2019.00017\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Security and Privacy Workshops (SPW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPW.2019.00017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 37

摘要

特征工程是开发有效的机器学习模型最昂贵的方面之一，在专门的问题领域，比如恶意软件分类，这种成本甚至更大，在这些领域，识别有用的特征需要专家技能。然而，最近的研究表明，深度学习模型可以直接从二进制文件本身的原始、非结构化字节中自动学习特征表示。在本文中，我们探讨了这些模型对恶意软件的了解。为此，我们在多个分辨率级别上检查学习到的特征，从单个字节嵌入到模型的端到端分析。在每一步中，我们通过解析和反汇编二进制文件，将这些面向字节的激活连接到它们的原始语义，以获得人类可以理解的特征。通过我们的结果，我们确定了模型学习到的几个有趣的特征，以及它们与传统机器学习模型通常使用的手动衍生特征的联系。此外，我们探讨了训练数据量和正则化对学习特征质量和分类器效率的影响，揭示了更好的泛化并不一定会给基于字节的恶意软件分类器带来更好的性能这一有点矛盾的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Activation Analysis of a Byte-Based Deep Neural Network for Malware Classification

Feature engineering is one of the most costly aspects of developing effective machine learning models, and that cost is even greater in specialized problem domains, like malware classification, where expert skills are necessary to identify useful features. Recent work, however, has shown that deep learning models can be used to automatically learn feature representations directly from the raw, unstructured bytes of the binaries themselves. In this paper, we explore what these models are learning about malware. To do so, we examine the learned features at multiple levels of resolution, from individual byte embeddings to end-to-end analysis of the model. At each step, we connect these byte-oriented activations to their original semantics through parsing and disassembly of the binary to arrive at human-understandable features. Through our results, we identify several interesting features learned by the model and their connection to manually-derived features typically used by traditional machine learning models. Additionally, we explore the impact of training data volume and regularization on the quality of the learned features and the efficacy of the classifiers, revealing the somewhat paradoxical insight that better generalization does not necessarily result in better performance for byte-based malware classifiers.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE Security and Privacy Workshops (SPW)

自引率

0.00%

发文量