通过有效的原型选择减少恶意软件标记工作

2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS) Pub Date : 2022-03-01 DOI:10.1109/ICECCS54210.2022.00011

Guanhong Chen, Shuang Liu

{"title":"通过有效的原型选择减少恶意软件标记工作","authors":"Guanhong Chen, Shuang Liu","doi":"10.1109/ICECCS54210.2022.00011","DOIUrl":null,"url":null,"abstract":"Malware detection and malware family classification are of great importance to network and system security. Currently, the wide adoption of deep learning models has greatly improved the performance of those tasks. However, deep-learning-based methods greatly rely on large-scale high-quality datasets, which require manual labeling. Obtaining a large-scale high-quality labeled dataset is extremely difficult for malware due to the domain knowledge required. In this work, we propose to reduce the manual labeling efforts by selecting a representative subset of instances, which has the same distribution as the original full dataset. Our method effectively reduces the workload of labeling while maintaining the accuracy degradation of the classification model within an acceptable threshold. We compare our method with the random sampling method on two widely adopted datasets and the evaluation results show that our method achieves significant improvements over the baseline method. In particular, with only 20% of the data selected, our method has only a 2.68 % degradation in classification performance compared to the full set, while the baseline method has a 6.78 % performance loss. We also compare the effects of factors such as training strategy and model structure on the final results, providing some guidance for subsequent research.","PeriodicalId":344493,"journal":{"name":"2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Reducing Malware labeling Efforts Through Efficient Prototype Selection\",\"authors\":\"Guanhong Chen, Shuang Liu\",\"doi\":\"10.1109/ICECCS54210.2022.00011\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Malware detection and malware family classification are of great importance to network and system security. Currently, the wide adoption of deep learning models has greatly improved the performance of those tasks. However, deep-learning-based methods greatly rely on large-scale high-quality datasets, which require manual labeling. Obtaining a large-scale high-quality labeled dataset is extremely difficult for malware due to the domain knowledge required. In this work, we propose to reduce the manual labeling efforts by selecting a representative subset of instances, which has the same distribution as the original full dataset. Our method effectively reduces the workload of labeling while maintaining the accuracy degradation of the classification model within an acceptable threshold. We compare our method with the random sampling method on two widely adopted datasets and the evaluation results show that our method achieves significant improvements over the baseline method. In particular, with only 20% of the data selected, our method has only a 2.68 % degradation in classification performance compared to the full set, while the baseline method has a 6.78 % performance loss. We also compare the effects of factors such as training strategy and model structure on the final results, providing some guidance for subsequent research.\",\"PeriodicalId\":344493,\"journal\":{\"name\":\"2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS)\",\"volume\":\"33 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICECCS54210.2022.00011\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICECCS54210.2022.00011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

恶意软件检测和分类对网络和系统安全具有重要意义。目前，深度学习模型的广泛采用大大提高了这些任务的性能。然而，基于深度学习的方法在很大程度上依赖于大规模的高质量数据集，这需要人工标记。由于需要领域知识，获得大规模高质量的标记数据集对于恶意软件来说是极其困难的。在这项工作中，我们建议通过选择与原始完整数据集具有相同分布的具有代表性的实例子集来减少人工标记工作。我们的方法有效地减少了标注工作量，同时将分类模型的精度退化保持在可接受的阈值内。我们在两个广泛采用的数据集上与随机抽样方法进行了比较，评估结果表明，我们的方法比基线方法有了显著的改进。特别是，在只选择了20%的数据的情况下，我们的方法在分类性能上只比完整集下降了2.68%，而基线方法的性能损失为6.78%。我们还比较了训练策略和模型结构等因素对最终结果的影响，为后续研究提供一定的指导。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Reducing Malware labeling Efforts Through Efficient Prototype Selection

Malware detection and malware family classification are of great importance to network and system security. Currently, the wide adoption of deep learning models has greatly improved the performance of those tasks. However, deep-learning-based methods greatly rely on large-scale high-quality datasets, which require manual labeling. Obtaining a large-scale high-quality labeled dataset is extremely difficult for malware due to the domain knowledge required. In this work, we propose to reduce the manual labeling efforts by selecting a representative subset of instances, which has the same distribution as the original full dataset. Our method effectively reduces the workload of labeling while maintaining the accuracy degradation of the classification model within an acceptable threshold. We compare our method with the random sampling method on two widely adopted datasets and the evaluation results show that our method achieves significant improvements over the baseline method. In particular, with only 20% of the data selected, our method has only a 2.68 % degradation in classification performance compared to the full set, while the baseline method has a 6.78 % performance loss. We also compare the effects of factors such as training strategy and model structure on the final results, providing some guidance for subsequent research.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS)

自引率

0.00%

发文量

期刊最新文献

Parameter Sensitive Pointer Analysis for Java Optimizing Parallel Java Streams Parameterized Design and Formal Verification of Multi-ported Memory Extension-Compression Learning: A deep learning code search method that simulates reading habits Proceedings 2022 26th International Conference on Engineering of Complex Computer Systems [Title page iii]