On the Role of Pre-trained Embeddings in Binary Code Analysis

被引:0
|
作者
Maier, Alwin [1 ]
Weissberg, Felix [2 ]
Rieck, Konrad [2 ,3 ]
机构
[1] Max Planck Inst Solar Syst Res, Gottingen, Germany
[2] Tech Univ Berlin, Berlin, Germany
[3] BIFOLD, Berlin, Germany
基金
欧洲研究理事会;
关键词
Transfer learning; Binary code analysis;
D O I
10.1145/3634737.3657029
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning has enabled remarkable progress in binary code analysis. In particular, pre-trained embeddings of assembly code have become a gold standard for solving analysis tasks, such as measuring code similarity or recognizing functions. These embeddings are capable of learning a vector representation from unlabeled code. In contrast to natural language processing, however, label information is not scarce for many tasks in binary code analysis. For example, labeled training data for function boundaries, optimization levels, and argument types can be easily derived from debug information provided by a compiler. Consequently, the main motivation of embeddings does not transfer directly to binary code analysis. In this paper, we explore the role of pre-trained embeddings from a critical perspective. To this end, we systematically evaluate recent embeddings for assembly code on five downstream tasks using a corpus of 1.2 million functions from the Debian distribution. We observe that several embeddings perform similarly when sufficient labeled data is available, and that differences reported in prior work are hardly noticeable. Surprisingly, we find that end-to-end learning without pre-training performs best on average, which calls into question the need for specialized embeddings. By varying the amount of labeled data, we eventually derive guidelines for when embeddings offer advantages and when end-to-end learning is preferable for binary code analysis.
引用
收藏
页码:795 / 810
页数:16
相关论文
共 50 条
  • [21] StrAE: Autoencoding for Pre-Trained Embeddings using Explicit Structure
    Opper, Mattia
    Prokhorov, Victor
    Siddharth, N.
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 7544 - 7560
  • [22] Analyzing the Potential of Pre-Trained Embeddings for Audio Classification Tasks
    Grollmisch, Sascha
    Cano, Estefania
    Kehling, Christian
    Taenzer, Michael
    28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 790 - 794
  • [23] Disambiguating Clinical Abbreviations using Pre-trained Word Embeddings
    Jaber, Areej
    Martinez, Paloma
    HEALTHINF: PROCEEDINGS OF THE 14TH INTERNATIONAL JOINT CONFERENCE ON BIOMEDICAL ENGINEERING SYSTEMS AND TECHNOLOGIES - VOL. 5: HEALTHINF, 2021, : 501 - 508
  • [24] Dictionary-based Debiasing of Pre-trained Word Embeddings
    Kaneko, Masahiro
    Bollegala, Danushka
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 212 - 223
  • [25] Embodying Pre-Trained Word Embeddings Through Robot Actions
    Toyoda, Minori
    Suzuki, Kanata
    Mori, Hiroki
    Hayashi, Yoshihiko
    Ogata, Tetsuya
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2021, 6 (02): : 4225 - 4232
  • [26] Online Fake News Detection using Pre-trained Embeddings
    Reshi, Junaid Ali
    Ali, Rashid
    2022 5TH INTERNATIONAL CONFERENCE ON MULTIMEDIA, SIGNAL PROCESSING AND COMMUNICATION TECHNOLOGIES (IMPACT), 2022,
  • [27] Gender-preserving Debiasing for Pre-trained Word Embeddings
    Kaneko, Masahiro
    Bollegala, Danushka
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1641 - 1650
  • [28] CODEFUSION: A Pre-trained Diffusion Model for Code Generation
    Singh, Mukul
    Cambronero, Jose
    Gulwani, Sumit
    Le, Vu
    Negreanu, Carina
    Verbruggen, Gust
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 11697 - 11708
  • [29] Compressing Pre-trained Models of Code into 3 MB
    Shi, Jieke
    Yang, Zhou
    Xu, Bowen
    Kang, Hong Jin
    Lo, David
    PROCEEDINGS OF THE 37TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE 2022, 2022,
  • [30] What do pre-trained code models know about code?
    Karmakar, Anjan
    Robbes, Romain
    2021 36TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING ASE 2021, 2021, : 1332 - 1336