{"title":"基于抽象语法树简化和显式持续时间递归网络的 PHP 恶意 webhell 检测","authors":"","doi":"10.1016/j.cose.2024.104049","DOIUrl":null,"url":null,"abstract":"<div><p>Malicious webshells are the most common attack scripts used by attackers in web penetration. Attackers typically obfuscate strings of PHP-based malicious webshells and encrypt communication traffic to bypass security devices. In this case, the opcode sequences of the PHP-based malicious webshells become excessively long and contain many irrelevant features, which affect the efficacy of the detection method. This study proposes a new PHP-based malicious webshell detection method. The proposed method introduces three simplification strategies for the three main types of nodes in the abstract syntax trees of PHP scripts to reduce the length and noise of opcode sequences of PHP-based malicious webshells. An explicit duration recurrent network (EDRN), a recurrent neural network based on an extended hidden semi-Markov model, is used to detect malicious webshells. Word2vec is adopted to convert the opcode sequences of the PHP scripts into vectors that serve as the input for the EDRN. Experiments were conducted using public datasets collected from GitHub. The experimental results indicated that EDRN outperformed popular recurrent neural networks. The proposed method demonstrated superior performance compared with several state-of-the-art approaches and mainstream tools, achieving an accuracy of 0.993, an F1 score of 0.990, and a recall rate of 0.991. When only 20% of the datasets were used for training, the proposed method achieved accuracy, recall, and F1 scores of 0.985, 0.983, and 0.980, respectively, significantly outperforming existing approaches.</p></div>","PeriodicalId":51004,"journal":{"name":"Computers & Security","volume":null,"pages":null},"PeriodicalIF":4.8000,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PHP-based malicious webshell detection based on abstract syntax tree simplification and explicit duration recurrent networks\",\"authors\":\"\",\"doi\":\"10.1016/j.cose.2024.104049\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Malicious webshells are the most common attack scripts used by attackers in web penetration. Attackers typically obfuscate strings of PHP-based malicious webshells and encrypt communication traffic to bypass security devices. In this case, the opcode sequences of the PHP-based malicious webshells become excessively long and contain many irrelevant features, which affect the efficacy of the detection method. This study proposes a new PHP-based malicious webshell detection method. The proposed method introduces three simplification strategies for the three main types of nodes in the abstract syntax trees of PHP scripts to reduce the length and noise of opcode sequences of PHP-based malicious webshells. An explicit duration recurrent network (EDRN), a recurrent neural network based on an extended hidden semi-Markov model, is used to detect malicious webshells. Word2vec is adopted to convert the opcode sequences of the PHP scripts into vectors that serve as the input for the EDRN. Experiments were conducted using public datasets collected from GitHub. The experimental results indicated that EDRN outperformed popular recurrent neural networks. The proposed method demonstrated superior performance compared with several state-of-the-art approaches and mainstream tools, achieving an accuracy of 0.993, an F1 score of 0.990, and a recall rate of 0.991. When only 20% of the datasets were used for training, the proposed method achieved accuracy, recall, and F1 scores of 0.985, 0.983, and 0.980, respectively, significantly outperforming existing approaches.</p></div>\",\"PeriodicalId\":51004,\"journal\":{\"name\":\"Computers & Security\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.8000,\"publicationDate\":\"2024-08-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computers & Security\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167404824003547\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Security","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167404824003547","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
PHP-based malicious webshell detection based on abstract syntax tree simplification and explicit duration recurrent networks
Malicious webshells are the most common attack scripts used by attackers in web penetration. Attackers typically obfuscate strings of PHP-based malicious webshells and encrypt communication traffic to bypass security devices. In this case, the opcode sequences of the PHP-based malicious webshells become excessively long and contain many irrelevant features, which affect the efficacy of the detection method. This study proposes a new PHP-based malicious webshell detection method. The proposed method introduces three simplification strategies for the three main types of nodes in the abstract syntax trees of PHP scripts to reduce the length and noise of opcode sequences of PHP-based malicious webshells. An explicit duration recurrent network (EDRN), a recurrent neural network based on an extended hidden semi-Markov model, is used to detect malicious webshells. Word2vec is adopted to convert the opcode sequences of the PHP scripts into vectors that serve as the input for the EDRN. Experiments were conducted using public datasets collected from GitHub. The experimental results indicated that EDRN outperformed popular recurrent neural networks. The proposed method demonstrated superior performance compared with several state-of-the-art approaches and mainstream tools, achieving an accuracy of 0.993, an F1 score of 0.990, and a recall rate of 0.991. When only 20% of the datasets were used for training, the proposed method achieved accuracy, recall, and F1 scores of 0.985, 0.983, and 0.980, respectively, significantly outperforming existing approaches.
期刊介绍:
Computers & Security is the most respected technical journal in the IT security field. With its high-profile editorial board and informative regular features and columns, the journal is essential reading for IT security professionals around the world.
Computers & Security provides you with a unique blend of leading edge research and sound practical management advice. It is aimed at the professional involved with computer security, audit, control and data integrity in all sectors - industry, commerce and academia. Recognized worldwide as THE primary source of reference for applied research and technical expertise it is your first step to fully secure systems.