Cross-Language Binary-Source Code Matching with Intermediate Representations

被引:10
|
作者
Gui, Yi [1 ]
Wan, Yao [1 ]
Zhang, Hongyu [2 ]
Huang, Huifang [3 ]
Sui, Yulei [4 ]
Xu, Guandong [4 ]
Shao, Zhiyuan [1 ]
Jin, Hai [1 ]
机构
[1] Huazhong Univ Sci & Technol, Natl Engn Res Ctr Big Data Technol & Syst, Sch Comp Sci & Technol,Cluster & Grid Comp Lab, Serv Comp Technol & Syst Lab, Wuhan, Peoples R China
[2] Univ Newcastle, Newcastle, NSW, Australia
[3] Huazhong Univ Sci & Technol, Sch Math & Stat, Wuhan, Peoples R China
[4] Univ Technol Sydney, Sch Comp Sci, Sydney, NSW, Australia
基金
中国国家自然科学基金;
关键词
Cross-language; clone detection; intermediate representation; binary code; code matching; deep learning;
D O I
10.1109/SANER53432.2022.00077
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Binary-source code matching plays an important role in many security and software engineering related tasks such as malware detection, reverse engineering and vulnerability assessment. Currently, several approaches have been proposed for binary-source code matching by jointly learning the embeddings of binary code and source code in a common vector space. Despite much effort, existing approaches target on matching the binary code and source code written in a single programming language. However, in practice, software applications are often written in different programming languages to cater for different requirements and computing platforms. Matching binary and source code across programming languages introduces additional challenges when maintaining multi-language and multi-platform applications. To this end, this paper formulates the problem of cross-language binary-source code matching, and develops a new dataset for this new problem. We present a novel approach XLIR, which is a Transformer-based neural network by learning the intermediate representations for both binary and source code. To validate the effectiveness of XLIR, comprehensive experiments are conducted on two tasks of cross-language binary-source code matching, and cross-language source-source code matching, on top of our curated dataset. Experimental results and analysis show that our proposed XLIR with intermediate representations significantly outperforms other state-of-the-art models in both of the two tasks.
引用
收藏
页码:601 / 612
页数:12
相关论文
共 50 条
  • [41] Code-switching in young bilingual toddlers: A longitudinal, cross-language investigation
    Smolak, Erin
    de Anda, Stephanie
    Enriquez, Bianka
    Poulin-Dubois, Diane
    Friend, Margaret
    BILINGUALISM-LANGUAGE AND COGNITION, 2020, 23 (03) : 500 - 518
  • [42] CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision
    Wang, Hao
    Gao, Zeyu
    Zhang, Chao
    Sha, Zihan
    Sun, Mingyang
    Zhou, Yuchen
    Zhu, Wenyu
    Sun, Wenju
    Qiu, Han
    Xiao, Xi
    PROCEEDINGS OF THE 33RD ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2024, 2024, : 503 - 515
  • [43] Improving Cross-Language Code Clone Detection via Code Representation Learning and Graph Neural Networks
    Mehrotra, Nikita
    Sharma, Akash
    Jindal, Anmol
    Purandare, Rahul
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2023, 49 (11) : 4846 - 4868
  • [44] Recovery of high-level intermediate representations of algorithms from binary code
    Bugerya, Alexander Borisovich
    Kulagin, Ivan Ivanovich
    Padaryan, Vartan Andronikovich
    Solovev, Mikhail Aleksandrovich
    Tikhonov, Andrei Yur'evich
    2019 IVANNIKOV MEMORIAL WORKSHOP (IVMEM 2019), 2019, : 57 - 63
  • [45] Building Bridges in Computer Networks: A Nifty Assignment for Cross-Language Learning and Code Refactoring
    Akhmetov, Ildar
    Schmidt, Logan W.
    PROCEEDINGS OF THE 26TH WESTERN CANADIAN CONFERENCE ON COMPUTING EDUCATION, WCCCE 2024, 2024,
  • [46] A Hybrid Cross-Language Name Matching Technique using Novel Modified Levenshtein Distance
    Medhat, Doaa
    Hassan, Ahmed
    Salama, Cherif
    2015 TENTH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING & SYSTEMS (ICCES), 2015, : 204 - 209
  • [47] ZC3: Zero-Shot Cross-Language Code Clone Detection
    Li, Jia
    Tao, Chongyang
    Jin, Zhi
    Liu, Fang
    Li, Jia
    Li, Ge
    2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE, 2023, : 875 - 887
  • [48] Atlas: Automating Cross-Language Fuzzing on Android Closed-Source Libraries
    Xiong, Hao
    Dai, Qinming
    Chang, Rui
    Qiu, Mingran
    Wang, Renxiang
    Zhou, Yajin
    Shen, Wenbo
    PROCEEDINGS OF THE 33RD ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2024, 2024, : 350 - 362
  • [49] Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language
    Franco-Salvador, Marc
    Gupta, Parth
    Rosso, Paolo
    Banchs, Rafael E.
    KNOWLEDGE-BASED SYSTEMS, 2016, 111 : 87 - 99
  • [50] Quantifying the Adequacy of Neural Representations for a Cross-Language Phonetic Discrimination Task: Prediction of Individual Differences
    Raizada, Rajeev D. S.
    Tsao, Feng-Ming
    Liu, Huei-Mei
    Kuhl, Patricia K.
    CEREBRAL CORTEX, 2010, 20 (01) : 1 - 12