恶意软件分类的自然语言处理方法

IF 1.5 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Journal of Computer Virology and Hacking Techniques Pub Date : 2023-10-20 DOI:10.1007/s11416-023-00506-w

Ritik Mehta, Olha Jurečková, Mark Stamp

{"title":"恶意软件分类的自然语言处理方法","authors":"Ritik Mehta, Olha Jurečková, Mark Stamp","doi":"10.1007/s11416-023-00506-w","DOIUrl":null,"url":null,"abstract":"Many different machine learning and deep learning techniques have been successfully employed for malware detection and classification. Examples of popular learning techniques in the malware domain include Hidden Markov Models (HMM), Random Forests (RF), Convolutional Neural Networks (CNN), Support Vector Machines (SVM), and Recurrent Neural Networks (RNN) such as Long Short-Term Memory (LSTM) networks. In this research, we consider a hybrid architecture, where HMMs are trained on opcode sequences, and the resulting hidden states of these trained HMMs are used as feature vectors in various classifiers. In this context, extracting the HMM hidden state sequences can be viewed as a form of feature engineering that is somewhat analogous to techniques that are commonly employed in Natural Language Processing (NLP). We find that this NLP-based approach outperforms other popular techniques on a challenging malware dataset, with an HMM-Random Forest model yielding the best results.","PeriodicalId":15545,"journal":{"name":"Journal of Computer Virology and Hacking Techniques","volume":null,"pages":null},"PeriodicalIF":1.5000,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A natural language processing approach to Malware classification\",\"authors\":\"Ritik Mehta, Olha Jurečková, Mark Stamp\",\"doi\":\"10.1007/s11416-023-00506-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many different machine learning and deep learning techniques have been successfully employed for malware detection and classification. Examples of popular learning techniques in the malware domain include Hidden Markov Models (HMM), Random Forests (RF), Convolutional Neural Networks (CNN), Support Vector Machines (SVM), and Recurrent Neural Networks (RNN) such as Long Short-Term Memory (LSTM) networks. In this research, we consider a hybrid architecture, where HMMs are trained on opcode sequences, and the resulting hidden states of these trained HMMs are used as feature vectors in various classifiers. In this context, extracting the HMM hidden state sequences can be viewed as a form of feature engineering that is somewhat analogous to techniques that are commonly employed in Natural Language Processing (NLP). We find that this NLP-based approach outperforms other popular techniques on a challenging malware dataset, with an HMM-Random Forest model yielding the best results.\",\"PeriodicalId\":15545,\"journal\":{\"name\":\"Journal of Computer Virology and Hacking Techniques\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2023-10-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Computer Virology and Hacking Techniques\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s11416-023-00506-w\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Virology and Hacking Techniques","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s11416-023-00506-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

许多不同的机器学习和深度学习技术已经成功地用于恶意软件检测和分类。恶意软件领域流行的学习技术的例子包括隐马尔可夫模型(HMM)、随机森林(RF)、卷积神经网络(CNN)、支持向量机(SVM)和循环神经网络(RNN)，如长短期记忆(LSTM)网络。在这项研究中，我们考虑了一种混合架构，其中hmm在操作码序列上进行训练，并将这些训练好的hmm的隐藏状态用作各种分类器的特征向量。在这种情况下，可以将HMM隐藏状态序列的提取视为特征工程的一种形式，这在某种程度上类似于自然语言处理(NLP)中常用的技术。我们发现这种基于nlp的方法在具有挑战性的恶意软件数据集上优于其他流行的技术，其中HMM-Random Forest模型产生了最好的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A natural language processing approach to Malware classification

Many different machine learning and deep learning techniques have been successfully employed for malware detection and classification. Examples of popular learning techniques in the malware domain include Hidden Markov Models (HMM), Random Forests (RF), Convolutional Neural Networks (CNN), Support Vector Machines (SVM), and Recurrent Neural Networks (RNN) such as Long Short-Term Memory (LSTM) networks. In this research, we consider a hybrid architecture, where HMMs are trained on opcode sequences, and the resulting hidden states of these trained HMMs are used as feature vectors in various classifiers. In this context, extracting the HMM hidden state sequences can be viewed as a form of feature engineering that is somewhat analogous to techniques that are commonly employed in Natural Language Processing (NLP). We find that this NLP-based approach outperforms other popular techniques on a challenging malware dataset, with an HMM-Random Forest model yielding the best results.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Computer Virology and Hacking Techniques COMPUTER SCIENCE, INFORMATION SYSTEMS-

CiteScore

4.00

自引率

13.30%

发文量

期刊介绍： The field of computer virus prevention has rapidly taken an important position in our technological and information society. Viral attacks increase year after year, and antiviral efforts continually face new challenges. Beneficial applications of technologies based on scientific computer virology are still very limited. The theoretical aspects of the virus problem are only rarely considered, although many interesting and important open problems still exist. Little proactive research is focused on predicting the future of viral attacks.The Journal of Computer Virology and Hacking Techniques is an independent scientific and technical journal dedicated to viral and antiviral computer technologies. Both theoretical and experimental aspects will be considered; papers emphasizing the theoretical aspects are especially welcome. The topics covered by this journal include, but are certainly not limited to:- Mathematical aspects and theoretical fundamentals of computer virology - Algorithmics and computer virology - Computer immunology and biological models for computers - Reverse engineering (hardware and software) - Viral and antiviral technologies - Cryptology and steganography tools and techniques - Applications in computer virology - Virology and IDS - Hardware hacking, and free and open hardware - Operating system, network, and embedded systems security - Social engineeringIn addition, since computational problems are of practical interest, papers on the computational aspects of computer virology are welcome. It is expected that the areas covered by this journal will change as new technologies, methodologies, challenges and applications develop. Hacking involves understanding technology intimately and in depth in order to use it in an operational way. Hackers are complementary to academics in that they favour the result over the methods and over the theory, while academics favour the formalization and the methods -- explaining is not operating and operating is not explaining. The aim of the journal in this respect is to build a bridge between the two communities for the benefit of technology and science.The aim of the Journal of Computer Virology and Hacking Techniques is to promote constructive research in computer virology by publishing technical and scientific results related to this research area. Submitted papers will be judged primarily by their content, their originality and their technical and scientific quality. Contributions should comprise novel and previously unpublished material.However, prior publication in conference proceedings of an abstract, summary, or other abbreviated, preliminary form of the material should not preclude publication in this journal when notice of such prior or concurrent publication is given with the submission. In addition to full-length theoretical and technical articles, short communications or notes are acceptable. Survey papers will be accepted with a prior invitation only. Special issues devoted to a single topic are also planned.The policy of the journal is to maintain strict refereeing procedures, to perform a high quality peer-review of each submitted paper, and to send notification to the author(s) with as short a delay as possible. Accepted papers will normally be published within one year of submission at the latest. The journal will be published four times a year. Note: As far as new viral techniques are concerned, the journal strongly encourages authors to consider algorithmic aspects rather than the actual source code of a particular virus. Nonetheless, papers containing viral source codes may be accepted provided that a scientific approach is maintained and that inclusion of the source code is necessary for the presentation of the research. No paper containing a viral source code will be considered or accepted unless the complete source code is communicated to the Editor-in-Chief. No publication will occur before antiviral companies receive this source code to update/upgrade their products.The final objective is, once again, proactive defence.This journal was previously known as Journal in Computer Virology. It is published by Springer France.