Semi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection

IF 2.8 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Privacy and Security Pub Date : 2023-11-13 DOI:10.1145/3624567

Maksim E. Eren, Manish Bhattarai, Robert J. Joyce, Edward Raff, Charles Nicholas, Boian S. Alexandrov

{"title":"Semi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection","authors":"Maksim E. Eren, Manish Bhattarai, Robert J. Joyce, Edward Raff, Charles Nicholas, Boian S. Alexandrov","doi":"10.1145/3624567","DOIUrl":null,"url":null,"abstract":"Identification of the family to which a malware specimen belongs is essential in understanding the behavior of the malware and developing mitigation strategies. Solutions proposed by prior work, however, are often not practicable due to the lack of realistic evaluation factors. These factors include learning under class imbalance, the ability to identify new malware, and the cost of production-quality labeled data. In practice, deployed models face prominent, rare, and new malware families. At the same time, obtaining a large quantity of up-to-date labeled malware for training a model can be expensive. In this article, we address these problems and propose a novel hierarchical semi-supervised algorithm, which we call the HNMFk Classifier , that can be used in the early stages of the malware family labeling process. Our method is based on non-negative matrix factorization with automatic model selection, that is, with an estimation of the number of clusters. With HNMFk Classifier , we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance. Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families and helps with maintaining the performance of the model when a low quantity of labeled data is used. We perform bulk classification of nearly 2,900 both rare and prominent malware families, through static analysis, using nearly 388,000 samples from the EMBER-2018 corpus. In our experiments, we surpass both supervised and semi-supervised baseline models with an F1 score of 0.80.","PeriodicalId":56050,"journal":{"name":"ACM Transactions on Privacy and Security","volume":"4 3","pages":"0"},"PeriodicalIF":2.8000,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Privacy and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3624567","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 1

Abstract

Identification of the family to which a malware specimen belongs is essential in understanding the behavior of the malware and developing mitigation strategies. Solutions proposed by prior work, however, are often not practicable due to the lack of realistic evaluation factors. These factors include learning under class imbalance, the ability to identify new malware, and the cost of production-quality labeled data. In practice, deployed models face prominent, rare, and new malware families. At the same time, obtaining a large quantity of up-to-date labeled malware for training a model can be expensive. In this article, we address these problems and propose a novel hierarchical semi-supervised algorithm, which we call the HNMFk Classifier , that can be used in the early stages of the malware family labeling process. Our method is based on non-negative matrix factorization with automatic model selection, that is, with an estimation of the number of clusters. With HNMFk Classifier , we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance. Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families and helps with maintaining the performance of the model when a low quantity of labeled data is used. We perform bulk classification of nearly 2,900 both rare and prominent malware families, through static analysis, using nearly 388,000 samples from the EMBER-2018 corpus. In our experiments, we surpass both supervised and semi-supervised baseline models with an F1 score of 0.80.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于层次非负矩阵分解和自动模型选择的极端类不平衡下的半监督恶意软件分类

识别恶意软件样本所属的家族对于理解恶意软件的行为和制定缓解策略至关重要。然而，由于缺乏现实的评价因素，以往工作提出的解决方案往往不可行。这些因素包括在班级不平衡的情况下学习，识别新恶意软件的能力，以及生产质量标记数据的成本。实际上，部署的模型面临着突出的、罕见的和新的恶意软件家族。与此同时，获取大量最新标记的恶意软件来训练模型可能是昂贵的。在本文中，我们解决了这些问题，并提出了一种新的分层半监督算法，我们称之为HNMFk分类器，可用于恶意软件家族标记过程的早期阶段。我们的方法是基于自动模型选择的非负矩阵分解，即对聚类数量进行估计。利用HNMFk分类器，我们利用恶意软件数据的层次结构和半监督设置，使我们能够在极端类别不平衡的情况下对恶意软件家族进行分类。我们的解决方案可以执行弃权预测或拒绝选项，这在识别新的恶意软件家族方面产生了有希望的结果，并有助于在使用少量标记数据时保持模型的性能。通过静态分析，我们使用来自2018年12月语料库的近38.8万个样本，对近2900个罕见和突出的恶意软件家族进行了批量分类。在我们的实验中，我们以0.80的F1分数超越了监督和半监督基线模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Privacy and Security Computer Science-General Computer Science

CiteScore

5.20

自引率

0.00%

发文量

期刊介绍： ACM Transactions on Privacy and Security (TOPS) (formerly known as TISSEC) publishes high-quality research results in the fields of information and system security and privacy. Studies addressing all aspects of these fields are welcomed, ranging from technologies, to systems and applications, to the crafting of policies.

期刊最新文献

ZPredict: ML-Based IPID Side-channel Measurements ZTA-IoT: A Novel Architecture for Zero-Trust in IoT Systems and an Ensuing Usage Control Model Security Analysis of the Consumer Remote SIM Provisioning Protocol X-squatter: AI Multilingual Generation of Cross-Language Sound-squatting Toward Robust ASR System against Audio Adversarial Examples using Agitated Logit