DexBERT: Effective, Task-Agnostic and Fine-Grained Representation Learning of Android Bytecode

被引:1
|
作者
Sun T. [1 ]
Allix K. [1 ]
Kim K. [2 ]
Zhou X. [2 ]
Kim D. [3 ]
Lo D. [2 ]
Bissyande T.F. [1 ]
Klein J. [1 ]
机构
[1] University of Luxembourg, Kirchberg
[2] Singapore Management University
[3] Kyungpook National University, Daegu
关键词
Android app analysis; code representation; defect prediction; malicious code localization; Representation learning;
D O I
10.1109/TSE.2023.3310874
中图分类号
学科分类号
摘要
The automation of an increasingly large number of software engineering tasks is becoming possible thanks to Machine Learning (ML). One foundational building block in the application of ML to software artifacts is the representation of these artifacts (e.g., source code or executable code) into a form that is suitable for learning. Traditionally, researchers and practitioners have relied on manually selected features, based on expert knowledge, for the task at hand. Such knowledge is sometimes imprecise and generally incomplete. To overcome this limitation, many studies have leveraged representation learning, delegating to ML itself the job of automatically devising suitable representations and selections of the most relevant features. Yet, in the context of Android problems, existing models are either limited to coarse-grained whole-app level (e.g., apk2vec) or conducted for one specific downstream task (e.g., smali2vec). Thus, the produced representation may turn out to be unsuitable for fine-grained tasks or cannot generalize beyond the task that they have been trained on. Our work is part of a new line of research that investigates effective, task-agnostic, and fine-grained universal representations of bytecode to mitigate both of these two limitations. Such representations aim to capture information relevant to various low-level downstream tasks (e.g., at the class-level). We are inspired by the field of Natural Language Processing, where the problem of universal representation was addressed by building Universal Language Models, such as BERT, whose goal is to capture abstract semantic information about sentences, in a way that is reusable for a variety of tasks. We propose DexBERT, a BERT-like Language Model dedicated to representing chunks of DEX bytecode, the main binary format used in Android applications. We empirically assess whether DexBERT is able to model the DEX language and evaluate the suitability of our model in three distinct class-level software engineering tasks: Malicious Code Localization, Defect Prediction, and Component Type Classification. We also experiment with strategies to deal with the problem of catering to apps having vastly different sizes, and we demonstrate one example of using our technique to investigate what information is relevant to a given task. © 1976-2012 IEEE.
引用
收藏
页码:4691 / 4706
页数:15
相关论文
共 50 条
  • [21] Pivotal Role of Language Modeling in Recommender Systems: Enriching Task-specific and Task-agnostic Representation Learning
    Shin, Kyuyong
    Kwak, Hanock
    Kim, Wonjae
    Jeong, Jisu
    Jung, Seungjae
    Kim, Kyung-Min
    Ha, Jung-Woo
    Lee, Sang-Woo
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 1146 - 1161
  • [22] How Effective is Task-Agnostic Data Augmentation for Pretrained Transformers?
    Longpre, Shayne
    Wang, Yu
    DuBois, Christopher
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 4401 - 4411
  • [23] TRIO: Task-agnostic dataset representation optimized for automatic algorithm selection
    Cohen-Shapira, Noy
    Rokach, Lior
    2021 21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2021), 2021, : 81 - 90
  • [24] Discrimination-Aware Mechanism for Fine-grained Representation Learning
    Xu, Furong
    Wang, Meng
    Zhang, Wei
    Cheng, Yuan
    Chu, Wei
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 813 - 822
  • [25] ADAPTIVE MULTI-TASK LEARNING FOR FINE-GRAINED CATEGORIZATION
    Sun, Gang
    Chen, Yanyun
    Liu, Xuehui
    Wu, Enhua
    2015 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2015, : 996 - 1000
  • [26] Fine-Grained Fashion Representation Learning by Online Deep Clustering
    Jiao, Yang
    Xie, Ning
    Gao, Yan
    Wang, Chien-Chih
    Sun, Yi
    COMPUTER VISION - ECCV 2022, PT XXVII, 2022, 13687 : 19 - 35
  • [27] Learning Deep Bilinear Transformation for Fine-grained Image Representation
    Zheng, Heliang
    Fu, Jianlong
    Zha, Zheng-Jun
    Luo, Jiebo
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [28] FgDetector: Fine-grained Android Malware Detection
    Li, Dongfang
    Wang, Zhaoguo
    Li, Lixin
    Wang, Zhihua
    Wang, Yucheng
    Xue, Yibo
    2017 IEEE SECOND INTERNATIONAL CONFERENCE ON DATA SCIENCE IN CYBERSPACE (DSC), 2017, : 311 - 318
  • [29] Task-Agnostic Vision Transformer for Distributed Learning of Image Processing
    Kim, Boah
    Kim, Jeongsol
    Ye, Jong Chul
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 203 - 218
  • [30] Interesting Object, Curious Agent: Learning Task-Agnostic Exploration
    Parisi, Simone
    Dean, Victoria
    Pathak, Deepak
    Gupta, Abhinav
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34