On the Role of Pre-trained Embeddings in Binary Code Analysis

被引:0
|
作者
Maier, Alwin [1 ]
Weissberg, Felix [2 ]
Rieck, Konrad [2 ,3 ]
机构
[1] Max Planck Inst Solar Syst Res, Gottingen, Germany
[2] Tech Univ Berlin, Berlin, Germany
[3] BIFOLD, Berlin, Germany
基金
欧洲研究理事会;
关键词
Transfer learning; Binary code analysis;
D O I
10.1145/3634737.3657029
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning has enabled remarkable progress in binary code analysis. In particular, pre-trained embeddings of assembly code have become a gold standard for solving analysis tasks, such as measuring code similarity or recognizing functions. These embeddings are capable of learning a vector representation from unlabeled code. In contrast to natural language processing, however, label information is not scarce for many tasks in binary code analysis. For example, labeled training data for function boundaries, optimization levels, and argument types can be easily derived from debug information provided by a compiler. Consequently, the main motivation of embeddings does not transfer directly to binary code analysis. In this paper, we explore the role of pre-trained embeddings from a critical perspective. To this end, we systematically evaluate recent embeddings for assembly code on five downstream tasks using a corpus of 1.2 million functions from the Debian distribution. We observe that several embeddings perform similarly when sufficient labeled data is available, and that differences reported in prior work are hardly noticeable. Surprisingly, we find that end-to-end learning without pre-training performs best on average, which calls into question the need for specialized embeddings. By varying the amount of labeled data, we eventually derive guidelines for when embeddings offer advantages and when end-to-end learning is preferable for binary code analysis.
引用
收藏
页码:795 / 810
页数:16
相关论文
共 50 条
  • [1] Optimizing Pre-Trained Code Embeddings With Triplet Loss for Code Smell Detection
    Nizam, Ali
    Islamoglu, Ertugrul
    Kerem Adali, Omer
    Aydin, Musa
    IEEE ACCESS, 2025, 13 : 31335 - 31350
  • [2] Pre-trained Embeddings for Entity Resolution: An Experimental Analysis
    Zeakis, Alexandros
    Papadakis, George
    Skoutas, Dimitrios
    Koubarakis, Manolis
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (09): : 2225 - 2238
  • [3] Debiasing Pre-trained Contextualised Embeddings
    Kaneko, Masahiro
    Bollegala, Danushka
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1256 - 1266
  • [4] Can pre-trained code embeddings improve model performance? Revisiting the use of code embeddings in software engineering tasks
    Zishuo Ding
    Heng Li
    Weiyi Shang
    Tse-Hsun Peter Chen
    Empirical Software Engineering, 2022, 27
  • [5] Can pre-trained code embeddings improve model performance? Revisiting the use of code embeddings in software engineering tasks
    Ding, Zishuo
    Li, Heng
    Shang, Weiyi
    Chen, Tse-Hsun
    EMPIRICAL SOFTWARE ENGINEERING, 2022, 27 (03)
  • [6] An in-depth analysis of pre-trained embeddings for entity resolution
    Zeakis, Alexandros
    Papadakis, George
    Skoutas, Dimitrios
    Koubarakis, Manolis
    VLDB JOURNAL, 2025, 34 (01):
  • [7] Sentiment analysis based on improved pre-trained word embeddings
    Rezaeinia, Seyed Mahdi
    Rahmani, Rouhollah
    Ghodsi, Ali
    Veisi, Hadi
    EXPERT SYSTEMS WITH APPLICATIONS, 2019, 117 : 139 - 147
  • [8] An Investigation of Pre-trained Embeddings in Dependency Parsing
    Carvalho de Araujo, Juliana C.
    Freitas, Claudia
    Pacheco, Marco Aurelio C.
    Forero-Mendoza, Leonardo A.
    COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2020, 2020, 12037 : 281 - 290
  • [9] Leveraging Pre-Trained Embeddings for Welsh Taggers
    Ezeani, Ignatius M.
    Piao, Scott
    Neale, Steven
    Rayson, Paul
    Knight, Dawn
    4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019), 2019, : 270 - 280
  • [10] An in-depth analysis of pre-trained embeddings for entity resolution: An in-depth analysis of pre-trained embeddings for entity resolution: A. Zeakis et al.
    Alexandros Zeakis
    George Papadakis
    Dimitrios Skoutas
    Manolis Koubarakis
    Zeakis, Alexandros (alzeakis@di.uoa.gr), 2025, 34 (01):