BHMDC: A byte and hex n-gram based malware detection and classification method

被引:4
|
作者
Tang, Yonghe [1 ]
Qi, Xuyan [1 ]
Jing, Jing [1 ]
Liu, Chunling [1 ]
Dong, Weiyu [1 ]
机构
[1] State Key Lab Math Engn & Adv Comp, Zhengzhou 450000, Peoples R China
关键词
Malware detection; Malware classification; Byte n-gram; Hex n-gram; Random forest; Light gradient boosting machine; MODEL;
D O I
10.1016/j.cose.2023.103118
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, malware and their variants have proliferated, which poses a grave threat to the systems and networks' security, so it is urgent to detect and classify malware in time to prevent the spread of malicious activities. However, the existing malware detection and classification methods can't meet the requirement of the application perfectly. Among them, machine learning-based approaches generally face the dilemma of balancing efficiency and accuracy due to imperfect feature representation, while deep learning-based methods are usually computationally intense to train and deploy. In order to solve the problem, we focus on improving the feature extraction and classification model, and propose a Byte and Hex n-gram based Malware Detection and Classification method called BHMDC in this paper. For mal-ware detection, LightGBM is used to detect malware with just 256-dimensional byte unigram features, which achieves an accuracy of more than 99.70% on two built datasets with less time consumption. For malware classification, block byte unigram and hex n-gram are proposed and combined together as the feature, which can preserve more properties and profile executable files in a multi-granular way, then random forest is used to optimize the feature by removing redundant information and reducing the di-mensionality, and LightGBM is finally utilized to identify malware families. The performance of the pro-posed approach is evaluated through experiments, and it is compared with state-of-the-art methods. The proposed approach produces 99.264% accuracy on Microsoft malware classification challenge dataset and 99.775% accuracy on Malimg dataset respectively, which substantially outperforms the other approaches. Promising experimental results reveal that BHMDC can be used in antivirus software to detect malware variants and help security analysts to identify malware families.(c) 2023 Published by Elsevier Ltd.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] Music Genre Classification: A N-gram based Musicological Approach
    Zheng, Eve
    Moh, Melody
    Moh, Teng-Sheng
    2017 7TH IEEE INTERNATIONAL ADVANCE COMPUTING CONFERENCE (IACC), 2017, : 671 - 677
  • [32] N-gram MalGAN: Evading machine learning detection via feature n-gram
    Zhu, Enmin
    Zhang, Jianjie
    Yan, Jijie
    Chen, Kongyang
    Gao, Chongzhi
    DIGITAL COMMUNICATIONS AND NETWORKS, 2022, 8 (04) : 485 - 491
  • [33] N-gram MalGAN:Evading machine learning detection via feature n-gram
    Enmin Zhu
    Jianjie Zhang
    Jijie Yan
    Kongyang Chen
    Chongzhi Gao
    Digital Communications and Networks, 2022, 8 (04) : 485 - 491
  • [34] An N-Gram Based Method for Bengali Keyphrase Extraction
    Sarkar, Kamal
    INFORMATION SYSTEMS FOR INDIAN LANGUAGES, 2011, 139 : 36 - 41
  • [35] Formulating ensemble mobile malware detection through n-gram system call sequence features
    Ariff, Nor Azman Mat
    Mas'ud, Mohd Zaki
    Ahmad, Amizah Aida
    Bahaman, Nazrulazhar
    Hamid, Erman
    PROCEEDINGS OF INNOVATIVE RESEARCH AND INDUSTRIAL DIALOGUE 2018 (IRID'18), 2019, : 218 - 219
  • [36] Are n-gram Categories Helpful in Text Classification?
    Kruczek, Jakub
    Kruczek, Paulina
    Kuta, Marcin
    COMPUTATIONAL SCIENCE - ICCS 2020, PT II, 2020, 12138 : 524 - 537
  • [37] A Neural N-Gram Network for Text Classification
    Yan, Zhenguo
    Wu, Yue
    JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2018, 22 (03) : 380 - 386
  • [38] Hybrid method for modeless Japanese input using N-gram based binary classification and dictionary
    Yukino Ikegami
    Setsuo Tsuruta
    Multimedia Tools and Applications, 2015, 74 : 3933 - 3946
  • [39] Hybrid method for modeless Japanese input using N-gram based binary classification and dictionary
    Ikegami, Yukino
    Tsuruta, Setsuo
    MULTIMEDIA TOOLS AND APPLICATIONS, 2015, 74 (11) : 3933 - 3946
  • [40] Association Analysis and N-Gram Based Detection of Incorrect Arguments
    Li C.
    Liu H.
    Ruan Jian Xue Bao/Journal of Software, 2018, 29 (08): : 2243 - 2257