On the Role of Pre-trained Embeddings in Binary Code Analysis

被引：0

作者：

Maier, Alwin ^{[1
]}

Weissberg, Felix ^{[2
]}

Rieck, Konrad ^{[2
,3
]}

机构：

[1] Max Planck Inst Solar Syst Res, Gottingen, Germany

[2] Tech Univ Berlin, Berlin, Germany

[3] BIFOLD, Berlin, Germany

来源：

PROCEEDINGS OF THE 19TH ACM ASIA CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, ACM ASIACCS 2024 | 2024年

基金：

欧洲研究理事会;

关键词：

Transfer learning; Binary code analysis;

D O I：

10.1145/3634737.3657029

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep learning has enabled remarkable progress in binary code analysis. In particular, pre-trained embeddings of assembly code have become a gold standard for solving analysis tasks, such as measuring code similarity or recognizing functions. These embeddings are capable of learning a vector representation from unlabeled code. In contrast to natural language processing, however, label information is not scarce for many tasks in binary code analysis. For example, labeled training data for function boundaries, optimization levels, and argument types can be easily derived from debug information provided by a compiler. Consequently, the main motivation of embeddings does not transfer directly to binary code analysis. In this paper, we explore the role of pre-trained embeddings from a critical perspective. To this end, we systematically evaluate recent embeddings for assembly code on five downstream tasks using a corpus of 1.2 million functions from the Debian distribution. We observe that several embeddings perform similarly when sufficient labeled data is available, and that differences reported in prior work are hardly noticeable. Surprisingly, we find that end-to-end learning without pre-training performs best on average, which calls into question the need for specialized embeddings. By varying the amount of labeled data, we eventually derive guidelines for when embeddings offer advantages and when end-to-end learning is preferable for binary code analysis.

引用

页码：795 / 810

页数：16

共 50 条

[41] Are Pre-trained Convolutions Better than Pre-trained Transformers?
Tay, Yi
Dehghani, Mostafa
Gupta, Jai
Aribandi, Vamsi
Bahri, Dara
Qin, Zhen
Metzler, Donald
59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 4349 - 4359
[42] Disentangling Semantics and Syntax in Sentence Embeddings with Pre-trained Language Models
Huang, James Y.
Huang, Kuan-Hao
Chang, Kai-Wei
2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 1372 - 1379
[43] ON THE CHOICE OF THE OPTIMAL TEMPORAL SUPPORT FOR AUDIO CLASSIFICATION WITH PRE-TRAINED EMBEDDINGS
Quelennec, Aurian
Olvera, Michel
Peeters, Geoffroy
Essid, Slim
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 976 - 980
[44] Investigating the Impact of Pre-trained Word Embeddings on Memorization in Neural Networks
Thomas, Aleena
Adelani, David Ifeoluwa
Davody, Ali
Mogadala, Aditya
Klakow, Dietrich
TEXT, SPEECH, AND DIALOGUE (TSD 2020), 2020, 12284 : 273 - 281
[45] Evaluation Metrics for Headline Generation Using Deep Pre-Trained Embeddings
Moeed, Abdul
An, Yang
Hagerer, Gerhard
Groh, Georg
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 1796 - 1802
[46] Pre-trained Word Embeddings for Arabic Aspect-Based Sentiment Analysis of Airline Tweets
Ashi, Mohammed Matuq
Siddiqui, Muazzam Ahmed
Nadeem, Farrukh
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT SYSTEMS AND INFORMATICS 2018, 2019, 845 : 241 - 251
[47] CODEEDITOR: Learning to Edit Source Code with Pre-trained Models
Li, Jia
Li, Ge
Li, Zhuo
Jin, Zhi
Hu, Xing
Zhang, Kechi
Fu, Zhiyi
ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2023, 32 (06)
[48] Towards Summarizing Code Snippets Using Pre-Trained Transformers
Mastropaolo, Antonio
Ciniselli, Matteo
Pascarella, Luca
Tufano, Rosalia
Aghajani, Emad
Bavota, Gabriele
PROCEEDINGS 2024 32ND IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC 2024, 2024, : 1 - 12
[49] Enhancing Code Summarization with Graph Embedding and Pre-trained Model
Li, Lixuan
Li, Jie
Xu, Yihui
Zhu, Hao
Zhang, Xiaofang
INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2023, 33 (11N12) : 1765 - 1786
[50] Towards Summarizing Code Snippets Using Pre-Trained Transformers
Mastropaolo, Antonio
Tufano, Rosalia
Ciniselli, Matteo
Aghajani, Emad
Pascarella, Luca
Bavota, Gabriele
arXiv, 1600,

← 1 2 3 4 5 →