On the Role of Pre-trained Embeddings in Binary Code Analysis

被引:0
|
作者
Maier, Alwin [1 ]
Weissberg, Felix [2 ]
Rieck, Konrad [2 ,3 ]
机构
[1] Max Planck Inst Solar Syst Res, Gottingen, Germany
[2] Tech Univ Berlin, Berlin, Germany
[3] BIFOLD, Berlin, Germany
基金
欧洲研究理事会;
关键词
Transfer learning; Binary code analysis;
D O I
10.1145/3634737.3657029
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning has enabled remarkable progress in binary code analysis. In particular, pre-trained embeddings of assembly code have become a gold standard for solving analysis tasks, such as measuring code similarity or recognizing functions. These embeddings are capable of learning a vector representation from unlabeled code. In contrast to natural language processing, however, label information is not scarce for many tasks in binary code analysis. For example, labeled training data for function boundaries, optimization levels, and argument types can be easily derived from debug information provided by a compiler. Consequently, the main motivation of embeddings does not transfer directly to binary code analysis. In this paper, we explore the role of pre-trained embeddings from a critical perspective. To this end, we systematically evaluate recent embeddings for assembly code on five downstream tasks using a corpus of 1.2 million functions from the Debian distribution. We observe that several embeddings perform similarly when sufficient labeled data is available, and that differences reported in prior work are hardly noticeable. Surprisingly, we find that end-to-end learning without pre-training performs best on average, which calls into question the need for specialized embeddings. By varying the amount of labeled data, we eventually derive guidelines for when embeddings offer advantages and when end-to-end learning is preferable for binary code analysis.
引用
收藏
页码:795 / 810
页数:16
相关论文
共 50 条
  • [31] An Empirical Comparison of Pre-Trained Models of Source Code
    Niu, Changan
    Li, Chuanyi
    Ng, Vincent
    Chen, Dongxiao
    Ge, Jidong
    Luo, Bin
    2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ICSE, 2023, : 2136 - 2148
  • [32] Leveraging pre-trained language models for code generation
    Soliman, Ahmed
    Shaheen, Samir
    Hadhoud, Mayada
    COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (03) : 3955 - 3980
  • [33] Diet Code Is Healthy: Simplifying Programs for Pre-trained Models of Code
    Zhang, Zhaowei
    Zhang, Hongyu
    Shen, Beijun
    Gu, Xiaodong
    PROCEEDINGS OF THE 30TH ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2022, 2022, : 1073 - 1084
  • [34] Pre-Trained Semantic Embeddings for POI Categories Based on Multiple Contexts
    Bing, Junxiang
    Chen, Meng
    Yang, Min
    Huang, Weiming
    Gong, Yongshun
    Nie, Liqiang
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (09) : 8893 - 8904
  • [35] Nutrition Guided Recipe Search via Pre-trained Recipe Embeddings
    Li, Diya
    Zaki, Mohammed J.
    Chen, Ching-Hua
    2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOPS (ICDEW 2021), 2021, : 20 - 23
  • [36] Classification of Respiration Sounds Using Deep Pre-trained Audio Embeddings
    Meza, Carlos A. Galindo
    del Hoyo Ontiveros, Juan A.
    Lopez-Meyer, Paulo
    2021 IEEE LATIN AMERICAN CONFERENCE ON COMPUTATIONAL INTELLIGENCE (LA-CCI), 2021,
  • [37] Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding
    Dou, Zi-Yi
    Peng, Nanyun
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6362 - 6371
  • [38] Pre-trained Text Embeddings for Enhanced Text-to-Speech Synthesis
    Hayashi, Tomoki
    Watanabe, Shinji
    Toda, Tomoki
    Takeda, Kazuya
    Toshniwal, Shubham
    Livescu, Karen
    INTERSPEECH 2019, 2019, : 4430 - 4434
  • [39] Automated Employee Objective Matching Using Pre-trained Word Embeddings
    Ghanem, Mohab
    Elnaggar, Ahmed
    Mckinnon, Adam
    Debes, Christian
    Boisard, Olivier
    Matthes, Florian
    2021 IEEE 25TH INTERNATIONAL ENTERPRISE DISTRIBUTED OBJECT COMPUTING CONFERENCE (EDOC 2021), 2021, : 51 - 60
  • [40] An Empirical study on Pre-trained Embeddings and Language Models for Bot Detection
    Garcia-Silva, Andres
    Berrio, Cristian
    Manuel Gomez-Perez, Jose
    4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019), 2019, : 148 - 155