Sen Zhao;Shouguo Yang;Zhen Wang;Yongji Liu;Hongsong Zhu;Limin Sun
{"title":"Crafting Binary Protocol Reversing via Deep Learning With Knowledge-Driven Augmentation","authors":"Sen Zhao;Shouguo Yang;Zhen Wang;Yongji Liu;Hongsong Zhu;Limin Sun","doi":"10.1109/TNET.2024.3468350","DOIUrl":null,"url":null,"abstract":"Protocol reverse engineering (PRE) serves as an instrumental tool in various security research, such as protocol fuzzing and intrusion detection. Its primary objective lies in uncovering the format, semantics, and behavior of an unknown protocol without prior information. This paper presents DL-ProS2, a deep learning-based approach for binary protocol reversing, focusing on format segmentation and semantic inference from network traffic. Our approach is underpinned by highlighting the effectiveness of multi-scale features within the network traffic for identifying various types of fields and semantics. Based on this, DL-ProS2 employs a comprehensive end-to-end model that integrates U-Net, siamese network, and BiLSTM-CRF, which enables the effective analysis of unknown protocol traffic to extract the field boundaries and semantics. Meanwhile, to address the issue of limited data diversity and coverage, we implement an innovative knowledge-driven traffic simulation technique. This method harnesses the ChatGPT to extract protocol knowledge from publicly available protocol documents, such as RFCs, as the foundational rules for the simulation. Empirical results substantiate the efficacy of our approach, demonstrating precision rates exceeding 0.95 and recall rates surpassing 0.97 for partially unknown protocol format segmentation and semantic inference. It also retains effectiveness in the inference of completely unknown protocols, with average precision and recall rates of 0.69 and 0.62 for format segmentation, and 0.43 and 0.47 for semantic inference, respectively.","PeriodicalId":13443,"journal":{"name":"IEEE/ACM Transactions on Networking","volume":"32 6","pages":"5399-5414"},"PeriodicalIF":3.0000,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Networking","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10713284/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Protocol reverse engineering (PRE) serves as an instrumental tool in various security research, such as protocol fuzzing and intrusion detection. Its primary objective lies in uncovering the format, semantics, and behavior of an unknown protocol without prior information. This paper presents DL-ProS2, a deep learning-based approach for binary protocol reversing, focusing on format segmentation and semantic inference from network traffic. Our approach is underpinned by highlighting the effectiveness of multi-scale features within the network traffic for identifying various types of fields and semantics. Based on this, DL-ProS2 employs a comprehensive end-to-end model that integrates U-Net, siamese network, and BiLSTM-CRF, which enables the effective analysis of unknown protocol traffic to extract the field boundaries and semantics. Meanwhile, to address the issue of limited data diversity and coverage, we implement an innovative knowledge-driven traffic simulation technique. This method harnesses the ChatGPT to extract protocol knowledge from publicly available protocol documents, such as RFCs, as the foundational rules for the simulation. Empirical results substantiate the efficacy of our approach, demonstrating precision rates exceeding 0.95 and recall rates surpassing 0.97 for partially unknown protocol format segmentation and semantic inference. It also retains effectiveness in the inference of completely unknown protocols, with average precision and recall rates of 0.69 and 0.62 for format segmentation, and 0.43 and 0.47 for semantic inference, respectively.
期刊介绍:
The IEEE/ACM Transactions on Networking’s high-level objective is to publish high-quality, original research results derived from theoretical or experimental exploration of the area of communication/computer networking, covering all sorts of information transport networks over all sorts of physical layer technologies, both wireline (all kinds of guided media: e.g., copper, optical) and wireless (e.g., radio-frequency, acoustic (e.g., underwater), infra-red), or hybrids of these. The journal welcomes applied contributions reporting on novel experiences and experiments with actual systems.