Using NLP techniques for file fragment classification

被引:55
|
作者
Fitzgerald, Simran [1 ]
Mathews, George [1 ]
Morris, Colin [1 ]
Zhulyn, Oles [1 ]
机构
[1] Univ Toronto, Dept Comp Sci, Toronto, ON M5S 1A1, Canada
关键词
File fragment classification; File carving; Natural language processing; Bigrams; Machine learning; Support vector machine; Digital forensics;
D O I
10.1016/j.diin.2012.05.008
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The classification of file fragments is an important problem in digital forensics. The literature does not include comprehensive work on applying machine learning techniques to this problem. In this work, we explore the use of techniques from natural language processing to classify file fragments. We take a supervised learning approach, based on the use of support vector machines combined with the bag-of-words model, where text documents are represented as unordered bags of words. This technique has been repeatedly shown to be effective and robust in classifying text documents (e.g., in distinguishing positive movie reviews from negative ones). In our approach, we represent file fragments as "bags of bytes" with feature vectors consisting of unigram and bigram counts, as well as other statistical measurements (including entropy and others). We made use of the publicly available Garfinkel data corpus to generate file fragments for training and testing. We ran a series of experiments, and found that this approach is effective in this domain as well. (c) 2012 O. Zhulyn, S. Fitzgerald & G. Mathews. Published by Elsevier Ltd. All rights reserved.
引用
收藏
页码:S44 / S49
页数:6
相关论文
共 50 条
  • [1] Dataset for file fragment classification of audio file formats
    Atieh Khodadadi
    Mehdi Teimouri
    BMC Research Notes, 12
  • [2] Dataset for file fragment classification of textual file formats
    Fatemeh Mansouri Hanis
    Mehdi Teimouri
    BMC Research Notes, 12
  • [3] Dataset for file fragment classification of audio file formats
    Fakouri, Reyhane
    Teimouri, Mehdi
    BMC RESEARCH NOTES, 2019, 12 (01)
  • [4] Dataset for file fragment classification of image file formats
    Fakouri, Reyhane
    Teimouri, Mehdi
    BMC RESEARCH NOTES, 2019, 12 (01)
  • [5] Dataset for file fragment classification of image file formats
    Reyhane Fakouri
    Mehdi Teimouri
    BMC Research Notes, 12
  • [6] Dataset for file fragment classification of textual file formats
    Mansouri Hanis, Fatemeh
    Teimouri, Mehdi
    BMC RESEARCH NOTES, 2019, 12 (01)
  • [7] Dataset for file fragment classification of video file formats
    Sadeghi, Narges
    Fahiminia, Mohadeseh
    Teimouri, Mehdi
    BMC RESEARCH NOTES, 2020, 13 (01)
  • [8] Dataset for file fragment classification of video file formats
    Narges Sadeghi
    Mohadeseh Fahiminia
    Mehdi Teimouri
    BMC Research Notes, 13
  • [9] Classification of "Hot News" for Financial Forecast Using NLP Techniques
    Yildirim, Savas
    Jothimani, Dhanya
    Kavaklioglu, Can
    Basar, Ayse
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 4719 - 4722
  • [10] Byte embeddings for file fragment classification
    Haque, Md Enamul
    Tozal, Mehmet Engin
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2022, 127 : 448 - 461