Yunan Zhang, Chenghao Rong, Qingjia Huang, Yang Wu, Zeming Yang, Jianguo Jiang
{"title":"Based on Multi-features and Clustering Ensemble Method for Automatic Malware Categorization","authors":"Yunan Zhang, Chenghao Rong, Qingjia Huang, Yang Wu, Zeming Yang, Jianguo Jiang","doi":"10.1109/Trustcom/BigDataSE/ICESS.2017.222","DOIUrl":null,"url":null,"abstract":"Automatic malware categorization plays an important role in combating the current large volume of malware and aiding the corresponding forensics. Generally, there are lot of sample information could be extracted with the static tools and dynamic sandbox for malware analysis. Combine these obtained features effectively for further analysis would provides us a better understanding. On the other hand, most current works on malware analysis are based on single category of machine learning algorithm to categorize samples. However, different clustering algorithms have their own strengths and weaknesses. And then, how to combine the merits of the multiple categories of features and algorithms to further improve the analysis result is very critical. In this paper, we propose a novel scalable malware analysis framework to exploit the complementary nature of different features and algorithms to optimally integrate their results. By using the concept of clustering ensemble, our system combines partitions from individual category of feature and algorithm to obtain better quality and robustness. Our system composed of the following three parts: (1) extract multiple categories of static and dynamic features; (2) use the k-means and hierarchical clustering algorithms to construct the base clustering; (3) proposed an efficient method based on mixture model clustering ensemble to conduct an effective clustering analysis. We have evaluated our method on two malware datasets, namely the Microsoft malware dataset and our own malware dataset which contained 10868 and 53760 samples respectively. Our experiment results show that our method could categorize malware with better quality and robustness. Also, our method has certain advantages in the system run time and memory consumption compared with the state-of-the art malware analysis works","PeriodicalId":170253,"journal":{"name":"2017 IEEE Trustcom/BigDataSE/ICESS","volume":"87 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Trustcom/BigDataSE/ICESS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.222","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 18
Abstract
Automatic malware categorization plays an important role in combating the current large volume of malware and aiding the corresponding forensics. Generally, there are lot of sample information could be extracted with the static tools and dynamic sandbox for malware analysis. Combine these obtained features effectively for further analysis would provides us a better understanding. On the other hand, most current works on malware analysis are based on single category of machine learning algorithm to categorize samples. However, different clustering algorithms have their own strengths and weaknesses. And then, how to combine the merits of the multiple categories of features and algorithms to further improve the analysis result is very critical. In this paper, we propose a novel scalable malware analysis framework to exploit the complementary nature of different features and algorithms to optimally integrate their results. By using the concept of clustering ensemble, our system combines partitions from individual category of feature and algorithm to obtain better quality and robustness. Our system composed of the following three parts: (1) extract multiple categories of static and dynamic features; (2) use the k-means and hierarchical clustering algorithms to construct the base clustering; (3) proposed an efficient method based on mixture model clustering ensemble to conduct an effective clustering analysis. We have evaluated our method on two malware datasets, namely the Microsoft malware dataset and our own malware dataset which contained 10868 and 53760 samples respectively. Our experiment results show that our method could categorize malware with better quality and robustness. Also, our method has certain advantages in the system run time and memory consumption compared with the state-of-the art malware analysis works