{"title":"Strtune: Data Dependence-Based Code Slicing for Binary Similarity Detection With Fine-Tuned Representation","authors":"Kaiyan He;Yikun Hu;Xuehui Li;Yunhao Song;Yubo Zhao;Dawu Gu","doi":"10.1109/TIFS.2024.3484944","DOIUrl":null,"url":null,"abstract":"Binary Code Similarity Detection (BCSD) is significant for software security as it can address binary tasks such as malicious code snippets identification and binary patch analysis by comparing code patterns. Recently, there has been a growing focus on artificial intelligence-based approaches in BCSD due to their scalability and generalization. Because binaries are compiled with different compilation configurations, existing approaches still face notable limitations when comparing binary similarity. First, BCSD requires analysis on code behavior, and existing work claims to extract semantic, but actually still makes analysis in terms of syntax. Second, directly extracting features from assembly sequences, existing work cannot address the issues of instruction reordering and different syntax expressions caused by various compilation configurations. In this paper, we propose STRTUNE, which slices binary code based on data dependence and perform slice-level fine-tuning. To address the first limitation, STRTUNE performs backward slicing based on data dependence to capture how a value is computed along the execution. Each slice reflects the collecting semantics of the code, which is stable across different compilation configurations. STRTUNE introduces flow types to emphasize the independence of computations between slices, forming a graph representation. To overcome the second limitation, based on slices corresponding to the same value computation but having different syntax representation, STRTUNE utilizes a Siamese Network to fine-tune such pairs, making their representations closer in the feature space. This allows the cross-graph attention to focus more on the matching of similar slices based on slice contents and flow types involved. Our evaluation results demonstrate the effectiveness and practicality of STRTUNE. We show that STRTUNE outperforms the state-of-the-art methods for BCSD, achieving a Recall@1 that is 25.3% and 22.2% higher than jTrans and GMN in the task of function retrieval cross optimization in x64.","PeriodicalId":13492,"journal":{"name":"IEEE Transactions on Information Forensics and Security","volume":"19 ","pages":"10233-10245"},"PeriodicalIF":6.3000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Forensics and Security","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10750885/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Binary Code Similarity Detection (BCSD) is significant for software security as it can address binary tasks such as malicious code snippets identification and binary patch analysis by comparing code patterns. Recently, there has been a growing focus on artificial intelligence-based approaches in BCSD due to their scalability and generalization. Because binaries are compiled with different compilation configurations, existing approaches still face notable limitations when comparing binary similarity. First, BCSD requires analysis on code behavior, and existing work claims to extract semantic, but actually still makes analysis in terms of syntax. Second, directly extracting features from assembly sequences, existing work cannot address the issues of instruction reordering and different syntax expressions caused by various compilation configurations. In this paper, we propose STRTUNE, which slices binary code based on data dependence and perform slice-level fine-tuning. To address the first limitation, STRTUNE performs backward slicing based on data dependence to capture how a value is computed along the execution. Each slice reflects the collecting semantics of the code, which is stable across different compilation configurations. STRTUNE introduces flow types to emphasize the independence of computations between slices, forming a graph representation. To overcome the second limitation, based on slices corresponding to the same value computation but having different syntax representation, STRTUNE utilizes a Siamese Network to fine-tune such pairs, making their representations closer in the feature space. This allows the cross-graph attention to focus more on the matching of similar slices based on slice contents and flow types involved. Our evaluation results demonstrate the effectiveness and practicality of STRTUNE. We show that STRTUNE outperforms the state-of-the-art methods for BCSD, achieving a Recall@1 that is 25.3% and 22.2% higher than jTrans and GMN in the task of function retrieval cross optimization in x64.
期刊介绍:
The IEEE Transactions on Information Forensics and Security covers the sciences, technologies, and applications relating to information forensics, information security, biometrics, surveillance and systems applications that incorporate these features