Yuntao Zhang, Yanhao Wang, Yuwei Liu, Zhengyuan Pang, B. Fang
{"title":"PDG2Vec: Identify the Binary Function Similarity with Program Dependence Graph","authors":"Yuntao Zhang, Yanhao Wang, Yuwei Liu, Zhengyuan Pang, B. Fang","doi":"10.1109/QRS57517.2022.00061","DOIUrl":null,"url":null,"abstract":"Binary code similarity identification is an important technique applied to many security applications (e.g., plagiarism detection, bug search). The primary challenge of this research topic is how to extract sufficient information from the binary code for similarity comparison. Although numerous approaches have been proposed to address the challenge, most of them leverage features determined by human experience or extracted using machine learning methods and ignore some critical technique semantic information. Additionally, they assess their approach exclusively in laboratory environments and lack real-world datasets. Both problems lead to the limited effectiveness of these methods in real application scenarios (e.g., vulnerable function search).In this paper, we propose a novel approach PDG2Vec, which extracts the data dependence graph and control dependence graph (i.e., program dependence graph (PDG)) as the features of functions and uses them for identifying function similarity. Meanwhile, we design several strategies to optimize the PDG’s construction and use them in similarity comparison to balance time-consuming and accuracy. We implement the prototype of PDG2Vec, which can perform binary code similarity comparison across architectures of x86, x86_64, MIPS32, ARM32, and ARM64. We evaluate PDG2Vec with two datasets. The experimental results show that PDG2Vec is resilient to cross-architecture and extracts more precise semantics than other approaches. Moreover, PDG2Vec outperforms the state-of-the-art tools in the vulnerable function search scenario and has excellent performance.","PeriodicalId":143812,"journal":{"name":"2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/QRS57517.2022.00061","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Binary code similarity identification is an important technique applied to many security applications (e.g., plagiarism detection, bug search). The primary challenge of this research topic is how to extract sufficient information from the binary code for similarity comparison. Although numerous approaches have been proposed to address the challenge, most of them leverage features determined by human experience or extracted using machine learning methods and ignore some critical technique semantic information. Additionally, they assess their approach exclusively in laboratory environments and lack real-world datasets. Both problems lead to the limited effectiveness of these methods in real application scenarios (e.g., vulnerable function search).In this paper, we propose a novel approach PDG2Vec, which extracts the data dependence graph and control dependence graph (i.e., program dependence graph (PDG)) as the features of functions and uses them for identifying function similarity. Meanwhile, we design several strategies to optimize the PDG’s construction and use them in similarity comparison to balance time-consuming and accuracy. We implement the prototype of PDG2Vec, which can perform binary code similarity comparison across architectures of x86, x86_64, MIPS32, ARM32, and ARM64. We evaluate PDG2Vec with two datasets. The experimental results show that PDG2Vec is resilient to cross-architecture and extracts more precise semantics than other approaches. Moreover, PDG2Vec outperforms the state-of-the-art tools in the vulnerable function search scenario and has excellent performance.