On the Role of Pre-trained Embeddings in Binary Code Analysis

被引:0
|
作者
Maier, Alwin [1 ]
Weissberg, Felix [2 ]
Rieck, Konrad [2 ,3 ]
机构
[1] Max Planck Inst Solar Syst Res, Gottingen, Germany
[2] Tech Univ Berlin, Berlin, Germany
[3] BIFOLD, Berlin, Germany
基金
欧洲研究理事会;
关键词
Transfer learning; Binary code analysis;
D O I
10.1145/3634737.3657029
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning has enabled remarkable progress in binary code analysis. In particular, pre-trained embeddings of assembly code have become a gold standard for solving analysis tasks, such as measuring code similarity or recognizing functions. These embeddings are capable of learning a vector representation from unlabeled code. In contrast to natural language processing, however, label information is not scarce for many tasks in binary code analysis. For example, labeled training data for function boundaries, optimization levels, and argument types can be easily derived from debug information provided by a compiler. Consequently, the main motivation of embeddings does not transfer directly to binary code analysis. In this paper, we explore the role of pre-trained embeddings from a critical perspective. To this end, we systematically evaluate recent embeddings for assembly code on five downstream tasks using a corpus of 1.2 million functions from the Debian distribution. We observe that several embeddings perform similarly when sufficient labeled data is available, and that differences reported in prior work are hardly noticeable. Surprisingly, we find that end-to-end learning without pre-training performs best on average, which calls into question the need for specialized embeddings. By varying the amount of labeled data, we eventually derive guidelines for when embeddings offer advantages and when end-to-end learning is preferable for binary code analysis.
引用
收藏
页码:795 / 810
页数:16
相关论文
共 50 条
  • [41] Are Pre-trained Convolutions Better than Pre-trained Transformers?
    Tay, Yi
    Dehghani, Mostafa
    Gupta, Jai
    Aribandi, Vamsi
    Bahri, Dara
    Qin, Zhen
    Metzler, Donald
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 4349 - 4359
  • [42] Disentangling Semantics and Syntax in Sentence Embeddings with Pre-trained Language Models
    Huang, James Y.
    Huang, Kuan-Hao
    Chang, Kai-Wei
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 1372 - 1379
  • [43] ON THE CHOICE OF THE OPTIMAL TEMPORAL SUPPORT FOR AUDIO CLASSIFICATION WITH PRE-TRAINED EMBEDDINGS
    Quelennec, Aurian
    Olvera, Michel
    Peeters, Geoffroy
    Essid, Slim
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 976 - 980
  • [44] Investigating the Impact of Pre-trained Word Embeddings on Memorization in Neural Networks
    Thomas, Aleena
    Adelani, David Ifeoluwa
    Davody, Ali
    Mogadala, Aditya
    Klakow, Dietrich
    TEXT, SPEECH, AND DIALOGUE (TSD 2020), 2020, 12284 : 273 - 281
  • [45] Evaluation Metrics for Headline Generation Using Deep Pre-Trained Embeddings
    Moeed, Abdul
    An, Yang
    Hagerer, Gerhard
    Groh, Georg
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 1796 - 1802
  • [46] Pre-trained Word Embeddings for Arabic Aspect-Based Sentiment Analysis of Airline Tweets
    Ashi, Mohammed Matuq
    Siddiqui, Muazzam Ahmed
    Nadeem, Farrukh
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT SYSTEMS AND INFORMATICS 2018, 2019, 845 : 241 - 251
  • [47] CODEEDITOR: Learning to Edit Source Code with Pre-trained Models
    Li, Jia
    Li, Ge
    Li, Zhuo
    Jin, Zhi
    Hu, Xing
    Zhang, Kechi
    Fu, Zhiyi
    ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2023, 32 (06)
  • [48] Towards Summarizing Code Snippets Using Pre-Trained Transformers
    Mastropaolo, Antonio
    Ciniselli, Matteo
    Pascarella, Luca
    Tufano, Rosalia
    Aghajani, Emad
    Bavota, Gabriele
    PROCEEDINGS 2024 32ND IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC 2024, 2024, : 1 - 12
  • [49] Enhancing Code Summarization with Graph Embedding and Pre-trained Model
    Li, Lixuan
    Li, Jie
    Xu, Yihui
    Zhu, Hao
    Zhang, Xiaofang
    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2023, 33 (11N12) : 1765 - 1786
  • [50] Towards Summarizing Code Snippets Using Pre-Trained Transformers
    Mastropaolo, Antonio
    Tufano, Rosalia
    Ciniselli, Matteo
    Aghajani, Emad
    Pascarella, Luca
    Bavota, Gabriele
    arXiv, 1600,