{"title":"恶意软件分类的自然语言处理方法","authors":"Ritik Mehta, Olha Jurečková, Mark Stamp","doi":"10.1007/s11416-023-00506-w","DOIUrl":null,"url":null,"abstract":"Many different machine learning and deep learning techniques have been successfully employed for malware detection and classification. Examples of popular learning techniques in the malware domain include Hidden Markov Models (HMM), Random Forests (RF), Convolutional Neural Networks (CNN), Support Vector Machines (SVM), and Recurrent Neural Networks (RNN) such as Long Short-Term Memory (LSTM) networks. In this research, we consider a hybrid architecture, where HMMs are trained on opcode sequences, and the resulting hidden states of these trained HMMs are used as feature vectors in various classifiers. In this context, extracting the HMM hidden state sequences can be viewed as a form of feature engineering that is somewhat analogous to techniques that are commonly employed in Natural Language Processing (NLP). We find that this NLP-based approach outperforms other popular techniques on a challenging malware dataset, with an HMM-Random Forest model yielding the best results.","PeriodicalId":15545,"journal":{"name":"Journal of Computer Virology and Hacking Techniques","volume":null,"pages":null},"PeriodicalIF":1.5000,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A natural language processing approach to Malware classification\",\"authors\":\"Ritik Mehta, Olha Jurečková, Mark Stamp\",\"doi\":\"10.1007/s11416-023-00506-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many different machine learning and deep learning techniques have been successfully employed for malware detection and classification. Examples of popular learning techniques in the malware domain include Hidden Markov Models (HMM), Random Forests (RF), Convolutional Neural Networks (CNN), Support Vector Machines (SVM), and Recurrent Neural Networks (RNN) such as Long Short-Term Memory (LSTM) networks. In this research, we consider a hybrid architecture, where HMMs are trained on opcode sequences, and the resulting hidden states of these trained HMMs are used as feature vectors in various classifiers. In this context, extracting the HMM hidden state sequences can be viewed as a form of feature engineering that is somewhat analogous to techniques that are commonly employed in Natural Language Processing (NLP). We find that this NLP-based approach outperforms other popular techniques on a challenging malware dataset, with an HMM-Random Forest model yielding the best results.\",\"PeriodicalId\":15545,\"journal\":{\"name\":\"Journal of Computer Virology and Hacking Techniques\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2023-10-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Computer Virology and Hacking Techniques\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s11416-023-00506-w\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Virology and Hacking Techniques","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s11416-023-00506-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
A natural language processing approach to Malware classification
Many different machine learning and deep learning techniques have been successfully employed for malware detection and classification. Examples of popular learning techniques in the malware domain include Hidden Markov Models (HMM), Random Forests (RF), Convolutional Neural Networks (CNN), Support Vector Machines (SVM), and Recurrent Neural Networks (RNN) such as Long Short-Term Memory (LSTM) networks. In this research, we consider a hybrid architecture, where HMMs are trained on opcode sequences, and the resulting hidden states of these trained HMMs are used as feature vectors in various classifiers. In this context, extracting the HMM hidden state sequences can be viewed as a form of feature engineering that is somewhat analogous to techniques that are commonly employed in Natural Language Processing (NLP). We find that this NLP-based approach outperforms other popular techniques on a challenging malware dataset, with an HMM-Random Forest model yielding the best results.
期刊介绍:
The field of computer virus prevention has rapidly taken an important position in our technological and information society. Viral attacks increase year after year, and antiviral efforts continually face new challenges. Beneficial applications of technologies based on scientific computer virology are still very limited. The theoretical aspects of the virus problem are only rarely considered, although many interesting and important open problems still exist. Little proactive research is focused on predicting the future of viral attacks.The Journal of Computer Virology and Hacking Techniques is an independent scientific and technical journal dedicated to viral and antiviral computer technologies. Both theoretical and experimental aspects will be considered; papers emphasizing the theoretical aspects are especially welcome. The topics covered by this journal include, but are certainly not limited to:- Mathematical aspects and theoretical fundamentals of computer virology - Algorithmics and computer virology - Computer immunology and biological models for computers - Reverse engineering (hardware and software) - Viral and antiviral technologies - Cryptology and steganography tools and techniques - Applications in computer virology - Virology and IDS - Hardware hacking, and free and open hardware - Operating system, network, and embedded systems security - Social engineeringIn addition, since computational problems are of practical interest, papers on the computational aspects of computer virology are welcome. It is expected that the areas covered by this journal will change as new technologies, methodologies, challenges and applications develop. Hacking involves understanding technology intimately and in depth in order to use it in an operational way. Hackers are complementary to academics in that they favour the result over the methods and over the theory, while academics favour the formalization and the methods -- explaining is not operating and operating is not explaining. The aim of the journal in this respect is to build a bridge between the two communities for the benefit of technology and science.The aim of the Journal of Computer Virology and Hacking Techniques is to promote constructive research in computer virology by publishing technical and scientific results related to this research area. Submitted papers will be judged primarily by their content, their originality and their technical and scientific quality. Contributions should comprise novel and previously unpublished material.However, prior publication in conference proceedings of an abstract, summary, or other abbreviated, preliminary form of the material should not preclude publication in this journal when notice of such prior or concurrent publication is given with the submission. In addition to full-length theoretical and technical articles, short communications or notes are acceptable. Survey papers will be accepted with a prior invitation only. Special issues devoted to a single topic are also planned.The policy of the journal is to maintain strict refereeing procedures, to perform a high quality peer-review of each submitted paper, and to send notification to the author(s) with as short a delay as possible. Accepted papers will normally be published within one year of submission at the latest. The journal will be published four times a year.
Note: As far as new viral techniques are concerned, the journal strongly encourages authors to consider algorithmic aspects rather than the actual source code of a particular virus. Nonetheless, papers containing viral source codes may be accepted provided that a scientific approach is maintained and that inclusion of the source code is necessary for the presentation of the research. No paper containing a viral source code will be considered or accepted unless the complete source code is communicated to the Editor-in-Chief. No publication will occur before antiviral companies receive this source code to update/upgrade their products.The final objective is, once again, proactive defence.This journal was previously known as Journal in Computer Virology. It is published by Springer France.