FCG-MFD: Benchmark function call graph-based dataset for malware family detection

IF 7.7 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Journal of Network and Computer Applications Pub Date : 2024-11-07 DOI:10.1016/j.jnca.2024.104050
Hassan Jalil Hadi , Yue Cao , Sifan Li , Naveed Ahmad , Mohammed Ali Alshara
{"title":"FCG-MFD: Benchmark function call graph-based dataset for malware family detection","authors":"Hassan Jalil Hadi ,&nbsp;Yue Cao ,&nbsp;Sifan Li ,&nbsp;Naveed Ahmad ,&nbsp;Mohammed Ali Alshara","doi":"10.1016/j.jnca.2024.104050","DOIUrl":null,"url":null,"abstract":"<div><div>Cyber crimes related to malware families are on the rise. This growth persists despite the prevalence of various antivirus software and approaches for malware detection and classification. Security experts have implemented Machine Learning (ML) techniques to identify these cyber-crimes. However, these approaches demand updated malware datasets for continuous improvements amid the evolving sophistication of malware strains. Thus, we present the FCG-MFD, a benchmark dataset with extensive Function Call Graphs (FCG) for malware family detection. This dataset guarantees resistance against emerging malware families by enabling security systems. Our dataset has two sub-datasets (FCG &amp; Metadata) (1,00,000 samples) from VirusSamples, Virusshare, VirusSign, theZoo, Vx-underground, and MalwareBazaar curated using FCGs and metadata to optimize the efficacy of ML algorithms. We suggest a new malware analysis technique using FCGs and graph embedding networks, offering a solution to the complexity of feature engineering in ML-based malware analysis. Our approach to extracting semantic features via the Natural Language Processing (NLP) method is inspired by tasks involving sentences and words, respectively, for functions and instructions. We leverage a node2vec mechanism-based graph embedding network to generate malware embedding vectors. These vectors enable automated and efficient malware analysis by combining structural and semantic features. We use two datasets (FCG &amp; Metadata) to assess FCG-MFD performance. F1-Scores of 99.14% and 99.28% are competitive with State-of-the-art (SOTA) methods.</div></div>","PeriodicalId":54784,"journal":{"name":"Journal of Network and Computer Applications","volume":"233 ","pages":"Article 104050"},"PeriodicalIF":7.7000,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Network and Computer Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1084804524002273","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Cyber crimes related to malware families are on the rise. This growth persists despite the prevalence of various antivirus software and approaches for malware detection and classification. Security experts have implemented Machine Learning (ML) techniques to identify these cyber-crimes. However, these approaches demand updated malware datasets for continuous improvements amid the evolving sophistication of malware strains. Thus, we present the FCG-MFD, a benchmark dataset with extensive Function Call Graphs (FCG) for malware family detection. This dataset guarantees resistance against emerging malware families by enabling security systems. Our dataset has two sub-datasets (FCG & Metadata) (1,00,000 samples) from VirusSamples, Virusshare, VirusSign, theZoo, Vx-underground, and MalwareBazaar curated using FCGs and metadata to optimize the efficacy of ML algorithms. We suggest a new malware analysis technique using FCGs and graph embedding networks, offering a solution to the complexity of feature engineering in ML-based malware analysis. Our approach to extracting semantic features via the Natural Language Processing (NLP) method is inspired by tasks involving sentences and words, respectively, for functions and instructions. We leverage a node2vec mechanism-based graph embedding network to generate malware embedding vectors. These vectors enable automated and efficient malware analysis by combining structural and semantic features. We use two datasets (FCG & Metadata) to assess FCG-MFD performance. F1-Scores of 99.14% and 99.28% are competitive with State-of-the-art (SOTA) methods.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
FCG-MFD:基于函数调用图的恶意软件族检测基准数据集
与恶意软件家族有关的网络犯罪呈上升趋势。尽管各种杀毒软件和恶意软件检测与分类方法已经普及,但这种增长趋势依然存在。安全专家采用机器学习(ML)技术来识别这些网络犯罪。然而,这些方法需要更新恶意软件数据集,以便在恶意软件种类不断演变的情况下持续改进。因此,我们提出了 FCG-MFD,这是一个具有大量函数调用图(FCG)的基准数据集,用于恶意软件家族检测。该数据集能确保安全系统抵御新出现的恶意软件家族。我们的数据集包含两个子数据集(FCG & Metadata)(1,00,000 个样本),分别来自 VirusSamples、Virusshare、VirusSign、theZoo、Vx-underground 和 MalwareBazaar,这些数据集利用 FCG 和元数据来优化 ML 算法的功效。我们提出了一种使用 FCG 和图嵌入网络的新型恶意软件分析技术,为基于 ML 的恶意软件分析中复杂的特征工程提供了解决方案。我们通过自然语言处理(NLP)方法提取语义特征的灵感来自分别涉及函数和指令的句子和单词任务。我们利用基于 node2vec 机制的图嵌入网络生成恶意软件嵌入向量。通过结合结构和语义特征,这些向量可实现自动、高效的恶意软件分析。我们使用两个数据集(FCG & Metadata)来评估 FCG-MFD 的性能。F1 分数分别为 99.14% 和 99.28%,与最先进的 (SOTA) 方法相比具有竞争力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Network and Computer Applications
Journal of Network and Computer Applications 工程技术-计算机:跨学科应用
CiteScore
21.50
自引率
3.40%
发文量
142
审稿时长
37 days
期刊介绍: The Journal of Network and Computer Applications welcomes research contributions, surveys, and notes in all areas relating to computer networks and applications thereof. Sample topics include new design techniques, interesting or novel applications, components or standards; computer networks with tools such as WWW; emerging standards for internet protocols; Wireless networks; Mobile Computing; emerging computing models such as cloud computing, grid computing; applications of networked systems for remote collaboration and telemedicine, etc. The journal is abstracted and indexed in Scopus, Engineering Index, Web of Science, Science Citation Index Expanded and INSPEC.
期刊最新文献
SAT-Net: A staggered attention network using graph neural networks for encrypted traffic classification Editorial Board Particle swarm optimization tuned multi-headed long short-term memory networks approach for fuel prices forecasting FCG-MFD: Benchmark function call graph-based dataset for malware family detection Deep learning frameworks for cognitive radio networks: Review and open research challenges
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1