Blind protocol identification using synthetic dataset: A case study on geographic protocols

IF 2.2 4区医学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Forensic Science International-Digital Investigation Pub Date : 2025-03-13 DOI:10.1016/j.fsidi.2025.301911

Mohammad Abbasi-Azar , Mehdi Teimouri , Mohsen Nikray

{"title":"Blind protocol identification using synthetic dataset: A case study on geographic protocols","authors":"Mohammad Abbasi-Azar , Mehdi Teimouri , Mohsen Nikray","doi":"10.1016/j.fsidi.2025.301911","DOIUrl":null,"url":null,"abstract":"<div><div>Network forensics faces major challenges, including increasingly sophisticated cyberattacks and the difficulty of obtaining labeled datasets for training AI-driven security tools. Blind Protocol Identification (BPI), essential for detecting covert data transfers, is particularly impacted by these data limitations. This paper introduces a novel and inherently scalable method for generating synthetic datasets tailored for BPI in network forensics. Our approach emphasizes feature engineering and a statistical-analytical model of feature distributions to address the scarcity and imbalance of labeled data. We demonstrate the effectiveness of this method through a case study on geographic protocols, where we train Random Forest models using only synthetic datasets and evaluate their performance on real-world traffic. This work presents a promising solution to the data challenges in BPI, enabling reliable protocol identification while maintaining data privacy and overcoming traditional data collection limitations.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"53 ","pages":"Article 301911"},"PeriodicalIF":2.2000,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Forensic Science International-Digital Investigation","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666281725000502","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Network forensics faces major challenges, including increasingly sophisticated cyberattacks and the difficulty of obtaining labeled datasets for training AI-driven security tools. Blind Protocol Identification (BPI), essential for detecting covert data transfers, is particularly impacted by these data limitations. This paper introduces a novel and inherently scalable method for generating synthetic datasets tailored for BPI in network forensics. Our approach emphasizes feature engineering and a statistical-analytical model of feature distributions to address the scarcity and imbalance of labeled data. We demonstrate the effectiveness of this method through a case study on geographic protocols, where we train Random Forest models using only synthetic datasets and evaluate their performance on real-world traffic. This work presents a promising solution to the data challenges in BPI, enabling reliable protocol identification while maintaining data privacy and overcoming traditional data collection limitations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于合成数据集的协议盲识别：以地理协议为例

网络取证面临着重大挑战，包括日益复杂的网络攻击，以及难以获得标记数据集来训练人工智能驱动的安全工具。盲协议识别（BPI）对于检测隐蔽数据传输至关重要，尤其受到这些数据限制的影响。本文介绍了一种新颖且具有固有可扩展性的方法，用于生成针对网络取证中BPI定制的合成数据集。我们的方法强调特征工程和特征分布的统计分析模型，以解决标记数据的稀缺性和不平衡性。我们通过地理协议的案例研究证明了这种方法的有效性，其中我们仅使用合成数据集训练随机森林模型，并评估其在现实世界流量中的性能。这项工作为BPI中的数据挑战提供了一个有希望的解决方案，在保持数据隐私和克服传统数据收集限制的同时实现可靠的协议识别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊