On the Role of Pre-trained Embeddings in Binary Code Analysis

被引：0

作者：

Maier, Alwin ^{[1
]}

Weissberg, Felix ^{[2
]}

Rieck, Konrad ^{[2
,3
]}

机构：

[1] Max Planck Inst Solar Syst Res, Gottingen, Germany

[2] Tech Univ Berlin, Berlin, Germany

[3] BIFOLD, Berlin, Germany

来源：

PROCEEDINGS OF THE 19TH ACM ASIA CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, ACM ASIACCS 2024 | 2024年

基金：

欧洲研究理事会;

关键词：

Transfer learning; Binary code analysis;

D O I：

10.1145/3634737.3657029

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep learning has enabled remarkable progress in binary code analysis. In particular, pre-trained embeddings of assembly code have become a gold standard for solving analysis tasks, such as measuring code similarity or recognizing functions. These embeddings are capable of learning a vector representation from unlabeled code. In contrast to natural language processing, however, label information is not scarce for many tasks in binary code analysis. For example, labeled training data for function boundaries, optimization levels, and argument types can be easily derived from debug information provided by a compiler. Consequently, the main motivation of embeddings does not transfer directly to binary code analysis. In this paper, we explore the role of pre-trained embeddings from a critical perspective. To this end, we systematically evaluate recent embeddings for assembly code on five downstream tasks using a corpus of 1.2 million functions from the Debian distribution. We observe that several embeddings perform similarly when sufficient labeled data is available, and that differences reported in prior work are hardly noticeable. Surprisingly, we find that end-to-end learning without pre-training performs best on average, which calls into question the need for specialized embeddings. By varying the amount of labeled data, we eventually derive guidelines for when embeddings offer advantages and when end-to-end learning is preferable for binary code analysis.

引用

页码：795 / 810

页数：16

共 50 条

[21] StrAE: Autoencoding for Pre-Trained Embeddings using Explicit Structure
Opper, Mattia
Prokhorov, Victor
Siddharth, N.
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 7544 - 7560
[22] Analyzing the Potential of Pre-Trained Embeddings for Audio Classification Tasks
Grollmisch, Sascha
Cano, Estefania
Kehling, Christian
Taenzer, Michael
28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 790 - 794
[23] Disambiguating Clinical Abbreviations using Pre-trained Word Embeddings
Jaber, Areej
Martinez, Paloma
HEALTHINF: PROCEEDINGS OF THE 14TH INTERNATIONAL JOINT CONFERENCE ON BIOMEDICAL ENGINEERING SYSTEMS AND TECHNOLOGIES - VOL. 5: HEALTHINF, 2021, : 501 - 508
[24] Dictionary-based Debiasing of Pre-trained Word Embeddings
Kaneko, Masahiro
Bollegala, Danushka
16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 212 - 223
[25] Embodying Pre-Trained Word Embeddings Through Robot Actions
Toyoda, Minori
Suzuki, Kanata
Mori, Hiroki
Hayashi, Yoshihiko
Ogata, Tetsuya
IEEE ROBOTICS AND AUTOMATION LETTERS, 2021, 6 (02): : 4225 - 4232
[26] Online Fake News Detection using Pre-trained Embeddings
Reshi, Junaid Ali
Ali, Rashid
2022 5TH INTERNATIONAL CONFERENCE ON MULTIMEDIA, SIGNAL PROCESSING AND COMMUNICATION TECHNOLOGIES (IMPACT), 2022,
[27] Gender-preserving Debiasing for Pre-trained Word Embeddings
Kaneko, Masahiro
Bollegala, Danushka
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1641 - 1650
[28] CODEFUSION: A Pre-trained Diffusion Model for Code Generation
Singh, Mukul
Cambronero, Jose
Gulwani, Sumit
Le, Vu
Negreanu, Carina
Verbruggen, Gust
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 11697 - 11708
[29] Compressing Pre-trained Models of Code into 3 MB
Shi, Jieke
Yang, Zhou
Xu, Bowen
Kang, Hong Jin
Lo, David
PROCEEDINGS OF THE 37TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE 2022, 2022,
[30] What do pre-trained code models know about code?
Karmakar, Anjan
Robbes, Romain
2021 36TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING ASE 2021, 2021, : 1332 - 1336

← 1 2 3 4 5 →