Cross-Language Binary-Source Code Matching with Intermediate Representations

被引:10
|
作者
Gui, Yi [1 ]
Wan, Yao [1 ]
Zhang, Hongyu [2 ]
Huang, Huifang [3 ]
Sui, Yulei [4 ]
Xu, Guandong [4 ]
Shao, Zhiyuan [1 ]
Jin, Hai [1 ]
机构
[1] Huazhong Univ Sci & Technol, Natl Engn Res Ctr Big Data Technol & Syst, Sch Comp Sci & Technol,Cluster & Grid Comp Lab, Serv Comp Technol & Syst Lab, Wuhan, Peoples R China
[2] Univ Newcastle, Newcastle, NSW, Australia
[3] Huazhong Univ Sci & Technol, Sch Math & Stat, Wuhan, Peoples R China
[4] Univ Technol Sydney, Sch Comp Sci, Sydney, NSW, Australia
基金
中国国家自然科学基金;
关键词
Cross-language; clone detection; intermediate representation; binary code; code matching; deep learning;
D O I
10.1109/SANER53432.2022.00077
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Binary-source code matching plays an important role in many security and software engineering related tasks such as malware detection, reverse engineering and vulnerability assessment. Currently, several approaches have been proposed for binary-source code matching by jointly learning the embeddings of binary code and source code in a common vector space. Despite much effort, existing approaches target on matching the binary code and source code written in a single programming language. However, in practice, software applications are often written in different programming languages to cater for different requirements and computing platforms. Matching binary and source code across programming languages introduces additional challenges when maintaining multi-language and multi-platform applications. To this end, this paper formulates the problem of cross-language binary-source code matching, and develops a new dataset for this new problem. We present a novel approach XLIR, which is a Transformer-based neural network by learning the intermediate representations for both binary and source code. To validate the effectiveness of XLIR, comprehensive experiments are conducted on two tasks of cross-language binary-source code matching, and cross-language source-source code matching, on top of our curated dataset. Experimental results and analysis show that our proposed XLIR with intermediate representations significantly outperforms other state-of-the-art models in both of the two tasks.
引用
收藏
页码:601 / 612
页数:12
相关论文
共 50 条
  • [1] GraphBinMatch: Graph-based Similarity Learning for Cross-Language Binary and Source Code Matching
    TehraniJamsaz, Ali
    Chen, Hanze
    Jannesari, Ali
    2024 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW 2024, 2024, : 506 - 515
  • [2] MaGnn: Binary-Source Code Matching by Modality-Sharing Graph Convolution for Binary Provenance Analysis
    Ou, Weihan
    Ding, Steven H. H.
    2023 IEEE 47TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE, COMPSAC, 2023, : 658 - 666
  • [3] Towards the Detection of Cross-Language Source Code Reuse
    Flores, Enrique
    Barron-Cedeno, Alberto
    Rosso, Paolo
    Moreno, Lidia
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, 2011, 6716 : 250 - 253
  • [4] Decompilation Based Deep Binary-Source Function Matching
    Wang, Xiaowei
    Yuan, Zimu
    Xiao, Yang
    Wang, Liyan
    Yao, Yican
    Chen, Haiming
    Huo, Wei
    SCIENCE OF CYBER SECURITY, SCISEC 2023, 2023, 14299 : 244 - 260
  • [5] Flowchart-Based Cross-Language Source Code Similarity Detection
    Zhang, Feng
    Li, Guofan
    Liu, Cong
    Song, Qian
    SCIENTIFIC PROGRAMMING, 2020, 2020
  • [6] Hierarchical Learning of Cross-Language Mappings through Distributed Vector Representations for Code
    Bui, Nghi D. Q.
    Jiang, Lingxiao
    2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: NEW IDEAS AND EMERGING TECHNOLOGIES RESULTS (ICSE-NIER), 2018, : 33 - 36
  • [7] Cross-Language Interoperability of Heterogeneous Code
    Stratikopoulos, Athanasios
    Blanaru, Florin
    Fumero, Juan
    Xekalaki, Maria
    Papadakis, Orion
    Kotselidis, Christos
    COMPANION PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON THE ART, SCIENCE, AND ENGINEERING OF PROGRAMMING, PROGRAMMING 2023, 2023, : 17 - 21
  • [8] Cross-Language Learning for Product Matching
    Peeters, Ralph
    Bizer, Christian
    COMPANION PROCEEDINGS OF THE WEB CONFERENCE 2022, WWW 2022 COMPANION, 2022, : 236 - 238
  • [9] Cross-language Source Code Clone Detection Based On Graph Neural Network
    Zhang, Yuguo
    Yang, Jia
    Ruan, Ou
    PROCEEDINGS OF 2024 3RD INTERNATIONAL CONFERENCE ON CRYPTOGRAPHY, NETWORK SECURITY AND COMMUNICATION TECHNOLOGY, CNSCT 2024, 2024, : 189 - 194
  • [10] Utilizing phonetic similarity for cross-source and cross-language toponym matching: a benchmark and prototype
    Sagi, Tomer
    Zaga, Moran
    Rusinek, Sinai
    Fekete, Marcell R.
    Bjerva, Johannes
    Hose, Katja
    LANGUAGE RESOURCES AND EVALUATION, 2025,