通过有效的原型选择减少恶意软件标记工作

Guanhong Chen, Shuang Liu
{"title":"通过有效的原型选择减少恶意软件标记工作","authors":"Guanhong Chen, Shuang Liu","doi":"10.1109/ICECCS54210.2022.00011","DOIUrl":null,"url":null,"abstract":"Malware detection and malware family classification are of great importance to network and system security. Currently, the wide adoption of deep learning models has greatly improved the performance of those tasks. However, deep-learning-based methods greatly rely on large-scale high-quality datasets, which require manual labeling. Obtaining a large-scale high-quality labeled dataset is extremely difficult for malware due to the domain knowledge required. In this work, we propose to reduce the manual labeling efforts by selecting a representative subset of instances, which has the same distribution as the original full dataset. Our method effectively reduces the workload of labeling while maintaining the accuracy degradation of the classification model within an acceptable threshold. We compare our method with the random sampling method on two widely adopted datasets and the evaluation results show that our method achieves significant improvements over the baseline method. In particular, with only 20% of the data selected, our method has only a 2.68 % degradation in classification performance compared to the full set, while the baseline method has a 6.78 % performance loss. We also compare the effects of factors such as training strategy and model structure on the final results, providing some guidance for subsequent research.","PeriodicalId":344493,"journal":{"name":"2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Reducing Malware labeling Efforts Through Efficient Prototype Selection\",\"authors\":\"Guanhong Chen, Shuang Liu\",\"doi\":\"10.1109/ICECCS54210.2022.00011\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Malware detection and malware family classification are of great importance to network and system security. Currently, the wide adoption of deep learning models has greatly improved the performance of those tasks. However, deep-learning-based methods greatly rely on large-scale high-quality datasets, which require manual labeling. Obtaining a large-scale high-quality labeled dataset is extremely difficult for malware due to the domain knowledge required. In this work, we propose to reduce the manual labeling efforts by selecting a representative subset of instances, which has the same distribution as the original full dataset. Our method effectively reduces the workload of labeling while maintaining the accuracy degradation of the classification model within an acceptable threshold. We compare our method with the random sampling method on two widely adopted datasets and the evaluation results show that our method achieves significant improvements over the baseline method. In particular, with only 20% of the data selected, our method has only a 2.68 % degradation in classification performance compared to the full set, while the baseline method has a 6.78 % performance loss. We also compare the effects of factors such as training strategy and model structure on the final results, providing some guidance for subsequent research.\",\"PeriodicalId\":344493,\"journal\":{\"name\":\"2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS)\",\"volume\":\"33 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICECCS54210.2022.00011\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICECCS54210.2022.00011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

恶意软件检测和分类对网络和系统安全具有重要意义。目前,深度学习模型的广泛采用大大提高了这些任务的性能。然而,基于深度学习的方法在很大程度上依赖于大规模的高质量数据集,这需要人工标记。由于需要领域知识,获得大规模高质量的标记数据集对于恶意软件来说是极其困难的。在这项工作中,我们建议通过选择与原始完整数据集具有相同分布的具有代表性的实例子集来减少人工标记工作。我们的方法有效地减少了标注工作量,同时将分类模型的精度退化保持在可接受的阈值内。我们在两个广泛采用的数据集上与随机抽样方法进行了比较,评估结果表明,我们的方法比基线方法有了显著的改进。特别是,在只选择了20%的数据的情况下,我们的方法在分类性能上只比完整集下降了2.68%,而基线方法的性能损失为6.78%。我们还比较了训练策略和模型结构等因素对最终结果的影响,为后续研究提供一定的指导。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Reducing Malware labeling Efforts Through Efficient Prototype Selection
Malware detection and malware family classification are of great importance to network and system security. Currently, the wide adoption of deep learning models has greatly improved the performance of those tasks. However, deep-learning-based methods greatly rely on large-scale high-quality datasets, which require manual labeling. Obtaining a large-scale high-quality labeled dataset is extremely difficult for malware due to the domain knowledge required. In this work, we propose to reduce the manual labeling efforts by selecting a representative subset of instances, which has the same distribution as the original full dataset. Our method effectively reduces the workload of labeling while maintaining the accuracy degradation of the classification model within an acceptable threshold. We compare our method with the random sampling method on two widely adopted datasets and the evaluation results show that our method achieves significant improvements over the baseline method. In particular, with only 20% of the data selected, our method has only a 2.68 % degradation in classification performance compared to the full set, while the baseline method has a 6.78 % performance loss. We also compare the effects of factors such as training strategy and model structure on the final results, providing some guidance for subsequent research.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Parameter Sensitive Pointer Analysis for Java Optimizing Parallel Java Streams Parameterized Design and Formal Verification of Multi-ported Memory Extension-Compression Learning: A deep learning code search method that simulates reading habits Proceedings 2022 26th International Conference on Engineering of Complex Computer Systems [Title page iii]
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1