On the Role of Pre-trained Embeddings in Binary Code Analysis

被引：0

作者：

Maier, Alwin ^{[1
]}

Weissberg, Felix ^{[2
]}

Rieck, Konrad ^{[2
,3
]}

机构：

[1] Max Planck Inst Solar Syst Res, Gottingen, Germany

[2] Tech Univ Berlin, Berlin, Germany

[3] BIFOLD, Berlin, Germany

来源：

PROCEEDINGS OF THE 19TH ACM ASIA CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, ACM ASIACCS 2024 | 2024年

基金：

欧洲研究理事会;

关键词：

Transfer learning; Binary code analysis;

D O I：

10.1145/3634737.3657029

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep learning has enabled remarkable progress in binary code analysis. In particular, pre-trained embeddings of assembly code have become a gold standard for solving analysis tasks, such as measuring code similarity or recognizing functions. These embeddings are capable of learning a vector representation from unlabeled code. In contrast to natural language processing, however, label information is not scarce for many tasks in binary code analysis. For example, labeled training data for function boundaries, optimization levels, and argument types can be easily derived from debug information provided by a compiler. Consequently, the main motivation of embeddings does not transfer directly to binary code analysis. In this paper, we explore the role of pre-trained embeddings from a critical perspective. To this end, we systematically evaluate recent embeddings for assembly code on five downstream tasks using a corpus of 1.2 million functions from the Debian distribution. We observe that several embeddings perform similarly when sufficient labeled data is available, and that differences reported in prior work are hardly noticeable. Surprisingly, we find that end-to-end learning without pre-training performs best on average, which calls into question the need for specialized embeddings. By varying the amount of labeled data, we eventually derive guidelines for when embeddings offer advantages and when end-to-end learning is preferable for binary code analysis.

引用

页码：795 / 810

页数：16

共 50 条

[31] An Empirical Comparison of Pre-Trained Models of Source Code
Niu, Changan
Li, Chuanyi
Ng, Vincent
Chen, Dongxiao
Ge, Jidong
Luo, Bin
2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ICSE, 2023, : 2136 - 2148
[32] Leveraging pre-trained language models for code generation
Soliman, Ahmed
Shaheen, Samir
Hadhoud, Mayada
COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (03) : 3955 - 3980
[33] Diet Code Is Healthy: Simplifying Programs for Pre-trained Models of Code
Zhang, Zhaowei
Zhang, Hongyu
Shen, Beijun
Gu, Xiaodong
PROCEEDINGS OF THE 30TH ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2022, 2022, : 1073 - 1084
[34] Pre-Trained Semantic Embeddings for POI Categories Based on Multiple Contexts
Bing, Junxiang
Chen, Meng
Yang, Min
Huang, Weiming
Gong, Yongshun
Nie, Liqiang
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (09) : 8893 - 8904
[35] Nutrition Guided Recipe Search via Pre-trained Recipe Embeddings
Li, Diya
Zaki, Mohammed J.
Chen, Ching-Hua
2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOPS (ICDEW 2021), 2021, : 20 - 23
[36] Classification of Respiration Sounds Using Deep Pre-trained Audio Embeddings
Meza, Carlos A. Galindo
del Hoyo Ontiveros, Juan A.
Lopez-Meyer, Paulo
2021 IEEE LATIN AMERICAN CONFERENCE ON COMPUTATIONAL INTELLIGENCE (LA-CCI), 2021,
[37] Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding
Dou, Zi-Yi
Peng, Nanyun
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6362 - 6371
[38] Pre-trained Text Embeddings for Enhanced Text-to-Speech Synthesis
Hayashi, Tomoki
Watanabe, Shinji
Toda, Tomoki
Takeda, Kazuya
Toshniwal, Shubham
Livescu, Karen
INTERSPEECH 2019, 2019, : 4430 - 4434
[39] Automated Employee Objective Matching Using Pre-trained Word Embeddings
Ghanem, Mohab
Elnaggar, Ahmed
Mckinnon, Adam
Debes, Christian
Boisard, Olivier
Matthes, Florian
2021 IEEE 25TH INTERNATIONAL ENTERPRISE DISTRIBUTED OBJECT COMPUTING CONFERENCE (EDOC 2021), 2021, : 51 - 60
[40] An Empirical study on Pre-trained Embeddings and Language Models for Bot Detection
Garcia-Silva, Andres
Berrio, Cristian
Manuel Gomez-Perez, Jose
4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019), 2019, : 148 - 155

← 1 2 3 4 5 →