Mohammad Abbasi-Azar , Mehdi Teimouri , Mohsen Nikray
{"title":"Blind protocol identification using synthetic dataset: A case study on geographic protocols","authors":"Mohammad Abbasi-Azar , Mehdi Teimouri , Mohsen Nikray","doi":"10.1016/j.fsidi.2025.301911","DOIUrl":null,"url":null,"abstract":"<div><div>Network forensics faces major challenges, including increasingly sophisticated cyberattacks and the difficulty of obtaining labeled datasets for training AI-driven security tools. Blind Protocol Identification (BPI), essential for detecting covert data transfers, is particularly impacted by these data limitations. This paper introduces a novel and inherently scalable method for generating synthetic datasets tailored for BPI in network forensics. Our approach emphasizes feature engineering and a statistical-analytical model of feature distributions to address the scarcity and imbalance of labeled data. We demonstrate the effectiveness of this method through a case study on geographic protocols, where we train Random Forest models using only synthetic datasets and evaluate their performance on real-world traffic. This work presents a promising solution to the data challenges in BPI, enabling reliable protocol identification while maintaining data privacy and overcoming traditional data collection limitations.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"53 ","pages":"Article 301911"},"PeriodicalIF":2.0000,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Forensic Science International-Digital Investigation","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666281725000502","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Network forensics faces major challenges, including increasingly sophisticated cyberattacks and the difficulty of obtaining labeled datasets for training AI-driven security tools. Blind Protocol Identification (BPI), essential for detecting covert data transfers, is particularly impacted by these data limitations. This paper introduces a novel and inherently scalable method for generating synthetic datasets tailored for BPI in network forensics. Our approach emphasizes feature engineering and a statistical-analytical model of feature distributions to address the scarcity and imbalance of labeled data. We demonstrate the effectiveness of this method through a case study on geographic protocols, where we train Random Forest models using only synthetic datasets and evaluate their performance on real-world traffic. This work presents a promising solution to the data challenges in BPI, enabling reliable protocol identification while maintaining data privacy and overcoming traditional data collection limitations.