Analysing corpus of office documents for macro-based attacks using Machine Learning

Global Transitions Proceedings Pub Date : 2022-06-01 DOI:10.1016/j.gltp.2022.04.004

V Ravi, S.P. Gururaj, H.K. Vedamurthy, M.B. Nirmala

{"title":"Analysing corpus of office documents for macro-based attacks using Machine Learning","authors":"V Ravi, S.P. Gururaj, H.K. Vedamurthy, M.B. Nirmala","doi":"10.1016/j.gltp.2022.04.004","DOIUrl":null,"url":null,"abstract":"<div><p>Macro-based malware attacks are on the rise in recent cyber-attacks using malicious code written in visual basic code which can be used to target computers to achieve various exploitations. Macro malware can be obfuscated using various tools and easily evade antivirus software. To detect this macro malware, several methods of machine learning techniques have been proposed with an inadequate dataset for both benign and malicious macro codes which are not reproducible and evaluated on unbalanced datasets. In this paper, use of word embedding technique such as Word2Vec embedding is used for code analysis is proposed to analyze and process macro code written in visual basic language to understand and detect the attack vector before opening the documents. The proposed word embedding technique, called <em>Obfuscated-Word2vec</em> is proposed to detect obfuscated keywords, Obfuscated function names from the macro code and classify them as obfuscated or benign function calls which are later used as feature vectors to train models to extract the most relevant features from macro code and even to help the classifiers to detect more accurately as a downloader, dropper malware, shellcode, PowerShell exploits, etc. Experimental results show that proposed method is reproducible and could detect completely new macro malware by analyzing the macro code by the help of Random forest classifier with 82.65 percent accuracy.</p></div>","PeriodicalId":100588,"journal":{"name":"Global Transitions Proceedings","volume":"3 1","pages":"Pages 20-24"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666285X22000401/pdfft?md5=e7b876b452a7172444358a89eb62dde6&pid=1-s2.0-S2666285X22000401-main.pdf","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Global Transitions Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666285X22000401","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Macro-based malware attacks are on the rise in recent cyber-attacks using malicious code written in visual basic code which can be used to target computers to achieve various exploitations. Macro malware can be obfuscated using various tools and easily evade antivirus software. To detect this macro malware, several methods of machine learning techniques have been proposed with an inadequate dataset for both benign and malicious macro codes which are not reproducible and evaluated on unbalanced datasets. In this paper, use of word embedding technique such as Word2Vec embedding is used for code analysis is proposed to analyze and process macro code written in visual basic language to understand and detect the attack vector before opening the documents. The proposed word embedding technique, called Obfuscated-Word2vec is proposed to detect obfuscated keywords, Obfuscated function names from the macro code and classify them as obfuscated or benign function calls which are later used as feature vectors to train models to extract the most relevant features from macro code and even to help the classifiers to detect more accurately as a downloader, dropper malware, shellcode, PowerShell exploits, etc. Experimental results show that proposed method is reproducible and could detect completely new macro malware by analyzing the macro code by the help of Random forest classifier with 82.65 percent accuracy.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用机器学习分析基于宏的攻击的办公文档语料库

基于宏的恶意软件攻击在最近的网络攻击中呈上升趋势，这些攻击使用visual basic代码编写的恶意代码可以用来攻击计算机以实现各种利用。宏恶意软件可以使用各种工具混淆，很容易逃避杀毒软件。为了检测这种宏恶意软件，已经提出了几种机器学习技术方法，这些方法具有不充分的数据集，用于良性和恶意宏代码，这些宏代码不可复制并在不平衡数据集上进行评估。本文提出利用Word2Vec嵌入等词嵌入技术进行代码分析，对用visual basic语言编写的宏代码进行分析和处理，在打开文档之前理解和检测攻击向量。提出的词嵌入技术，称为obfusated - word2vec，用于从宏代码中检测被混淆的关键字、被混淆的函数名，并将其分类为被混淆的或良性的函数调用，这些函数调用随后用作特征向量来训练模型，以从宏代码中提取最相关的特征，甚至帮助分类器更准确地检测downloader、droppper恶意软件、shellcode、PowerShell漏洞等。实验结果表明，该方法具有良好的可重复性，可以利用随机森林分类器对宏代码进行分析，检测出全新的宏恶意软件，准确率达到82.65%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Global Transitions Proceedings

自引率

0.00%

发文量