DexBERT: Effective, Task-Agnostic and Fine-Grained Representation Learning of Android Bytecode

被引:1
|
作者
Sun T. [1 ]
Allix K. [1 ]
Kim K. [2 ]
Zhou X. [2 ]
Kim D. [3 ]
Lo D. [2 ]
Bissyande T.F. [1 ]
Klein J. [1 ]
机构
[1] University of Luxembourg, Kirchberg
[2] Singapore Management University
[3] Kyungpook National University, Daegu
关键词
Android app analysis; code representation; defect prediction; malicious code localization; Representation learning;
D O I
10.1109/TSE.2023.3310874
中图分类号
学科分类号
摘要
The automation of an increasingly large number of software engineering tasks is becoming possible thanks to Machine Learning (ML). One foundational building block in the application of ML to software artifacts is the representation of these artifacts (e.g., source code or executable code) into a form that is suitable for learning. Traditionally, researchers and practitioners have relied on manually selected features, based on expert knowledge, for the task at hand. Such knowledge is sometimes imprecise and generally incomplete. To overcome this limitation, many studies have leveraged representation learning, delegating to ML itself the job of automatically devising suitable representations and selections of the most relevant features. Yet, in the context of Android problems, existing models are either limited to coarse-grained whole-app level (e.g., apk2vec) or conducted for one specific downstream task (e.g., smali2vec). Thus, the produced representation may turn out to be unsuitable for fine-grained tasks or cannot generalize beyond the task that they have been trained on. Our work is part of a new line of research that investigates effective, task-agnostic, and fine-grained universal representations of bytecode to mitigate both of these two limitations. Such representations aim to capture information relevant to various low-level downstream tasks (e.g., at the class-level). We are inspired by the field of Natural Language Processing, where the problem of universal representation was addressed by building Universal Language Models, such as BERT, whose goal is to capture abstract semantic information about sentences, in a way that is reusable for a variety of tasks. We propose DexBERT, a BERT-like Language Model dedicated to representing chunks of DEX bytecode, the main binary format used in Android applications. We empirically assess whether DexBERT is able to model the DEX language and evaluate the suitability of our model in three distinct class-level software engineering tasks: Malicious Code Localization, Defect Prediction, and Component Type Classification. We also experiment with strategies to deal with the problem of catering to apps having vastly different sizes, and we demonstrate one example of using our technique to investigate what information is relevant to a given task. © 1976-2012 IEEE.
引用
收藏
页码:4691 / 4706
页数:15
相关论文
共 50 条
  • [1] Task-agnostic representation learning of multimodal twitter data for downstream applications
    Ryan Rivas
    Sudipta Paul
    Vagelis Hristidis
    Evangelos E. Papalexakis
    Amit K. Roy-Chowdhury
    Journal of Big Data, 9
  • [2] Task-Agnostic Safety for Reinforcement Learning
    Rahman, Md Asifur
    Alqahtani, Sarra
    PROCEEDINGS OF THE 16TH ACM WORKSHOP ON ARTIFICIAL INTELLIGENCE AND SECURITY, AISEC 2023, 2023, : 139 - 148
  • [3] Task-agnostic representation learning of multimodal twitter data for downstream applications
    Rivas, Ryan
    Paul, Sudipta
    Hristidis, Vagelis
    Papalexakis, Evangelos E.
    Roy-Chowdhury, Amit K.
    JOURNAL OF BIG DATA, 2022, 9 (01)
  • [4] Task-agnostic Exploration in Reinforcement Learning
    Zhang, Xuezhou
    Ma, Yuzhe
    Singla, Adish
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [5] Task-Agnostic Structured Pruning of Speech Representation Models
    Wang, Haoyu
    Wang, Siyuan
    Zhang, Wei-Qiang
    Suo, Hongbin
    Wan, Yulong
    INTERSPEECH 2023, 2023, : 231 - 235
  • [6] GRAPH FINE-GRAINED CONTRASTIVE REPRESENTATION LEARNING
    Tang, Hui
    Liang, Xun
    Guo, Yuhui
    Zheng, Xiangping
    Wu, Bo
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 3478 - 3482
  • [7] Fine-grained representation learning in convolutional autoencoders
    Luo, Chang
    Wang, Jie
    JOURNAL OF ELECTRONIC IMAGING, 2016, 25 (02)
  • [8] Representation Learning for Fine-Grained Change Detection
    O'Mahony, Niall
    Campbell, Sean
    Krpalkova, Lenka
    Carvalho, Anderson
    Walsh, Joseph
    Riordan, Daniel
    SENSORS, 2021, 21 (13)
  • [9] Loss Decoupling for Task-Agnostic Continual Learning
    Liang, Yan-Shuo
    Li, Wu-Jun
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [10] Hierarchically structured task-agnostic continual learning
    Heinke Hihn
    Daniel A. Braun
    Machine Learning, 2023, 112 : 655 - 686