Using NLP techniques for file fragment classification

被引:55
|
作者
Fitzgerald, Simran [1 ]
Mathews, George [1 ]
Morris, Colin [1 ]
Zhulyn, Oles [1 ]
机构
[1] Univ Toronto, Dept Comp Sci, Toronto, ON M5S 1A1, Canada
关键词
File fragment classification; File carving; Natural language processing; Bigrams; Machine learning; Support vector machine; Digital forensics;
D O I
10.1016/j.diin.2012.05.008
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The classification of file fragments is an important problem in digital forensics. The literature does not include comprehensive work on applying machine learning techniques to this problem. In this work, we explore the use of techniques from natural language processing to classify file fragments. We take a supervised learning approach, based on the use of support vector machines combined with the bag-of-words model, where text documents are represented as unordered bags of words. This technique has been repeatedly shown to be effective and robust in classifying text documents (e.g., in distinguishing positive movie reviews from negative ones). In our approach, we represent file fragments as "bags of bytes" with feature vectors consisting of unigram and bigram counts, as well as other statistical measurements (including entropy and others). We made use of the publicly available Garfinkel data corpus to generate file fragments for training and testing. We ran a series of experiments, and found that this approach is effective in this domain as well. (c) 2012 O. Zhulyn, S. Fitzgerald & G. Mathews. Published by Elsevier Ltd. All rights reserved.
引用
收藏
页码:S44 / S49
页数:6
相关论文
共 50 条
  • [31] Selecting NLP Classification Techniques to Better Understand Causes of Mass Killings
    Sticha, Abigail
    Brenner, Paul
    INTELLIGENT COMPUTING, VOL 2, 2022, 507 : 685 - 700
  • [32] FILE FRAGMENT ANALYSIS USING NORMALIZED COMPRESSION DISTANCE
    Axelsson, Stefan
    Bajwa, Kamran Ali
    Srikanth, Mandhapati Venkata
    ADVANCES IN DIGITAL FORENSICS IX, 2013, 410 : 171 - 182
  • [33] Sarcasm Analysis and Mood Retention Using NLP Techniques
    Majumdar, Srijita
    Datta, Debabrata
    Deyasi, Arpan
    Mukherjee, Soumen
    Bhattacharjee, Arup Kumar
    Acharya, Anal
    INTERNATIONAL JOURNAL OF INFORMATION RETRIEVAL RESEARCH, 2022, 12 (01)
  • [34] Phishing Email Detection Using Robust NLP Techniques
    Egozi, Gal
    Verma, Rakesh
    2018 18TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2018, : 7 - 12
  • [35] Using NLP techniques for tagging events in Arabic text
    Abuleil, Saleem.
    19TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, VOL II, PROCEEDINGS, 2007, : 440 - 443
  • [36] A Framework for Efficient Information Retrieval Using NLP Techniques
    Subhashini, R.
    Kumar, V. Jawahar Senthil
    COMPUTER NETWORKS AND INFORMATION TECHNOLOGIES, 2011, 142 : 391 - +
  • [37] Information Security: Machine Learning Experiments to Solve the File Fragment Classification Problem
    Wilgenbus, Erich
    Kruger, Hennie
    du Toit, Tiny
    PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE ON CYBER WARFARE AND SECURITY (ICCWS-2015), 2015, : 390 - 398
  • [38] Security Requirements Classification into Groups Using NLP Transformers
    Varenov, Vasily
    Gabdrahmanov, Aydar
    29TH IEEE INTERNATIONAL REQUIREMENTS ENGINEERING CONFERENCE WORKSHOPS (REW 2021), 2021, : 444 - 450
  • [39] Explainable APT Attribution for Malware Using NLP Techniques
    Wang, Qinqin
    Yan, Hanbing
    Han, Zhihui
    2021 IEEE 21ST INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY (QRS 2021), 2021, : 70 - 80
  • [40] NLP TECHNIQUES FOR SALESPEOPLE
    CONNELL, HS
    TRAINING AND DEVELOPMENT JOURNAL, 1984, 38 (11): : 44 - 46