The impact of preprocessing on text classification

被引:369
|
作者
Uysal, Alper Kursat [1 ]
Gunal, Serkan [1 ]
机构
[1] Anadolu Univ, Dept Comp Engn, Eskisehir, Turkey
关键词
Pattern recognition; Text categorization; Text classification; Text preprocessing; FEATURE-SELECTION; ALGORITHM; MODEL;
D O I
10.1016/j.ipm.2013.08.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Preprocessing is one of the key components in a typical text classification framework. This paper aims to extensively examine the impact of preprocessing on text classification in terms of various aspects such as classification accuracy, text domain, text language, and dimension reduction. For this purpose, all possible combinations of widely used preprocessing tasks are comparatively evaluated on two different domains, namely e-mail and news, and in two different languages, namely Turkish and English. In this way, contribution of the preprocessing tasks to classification success at various feature dimensions, possible interactions among these tasks, and also dependency of these tasks to the respective languages and domains are comprehensively assessed. Experimental analysis on benchmark datasets reveals that choosing appropriate combinations of preprocessing tasks, rather than enabling or disabling them all, may provide significant improvement on classification accuracy depending on the domain and language studied on. (C) 2013 Elsevier Ltd. All rights reserved.
引用
收藏
页码:104 / 112
页数:9
相关论文
共 50 条
  • [41] Preprocessing Techniques for High Quality Text Extraction from Text Images
    Koshy, Alan
    Balakumar, Niranj M. J.
    Shyna, A.
    John, Ansamma
    PROCEEDINGS OF 2019 1ST INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION AND COMMUNICATION TECHNOLOGY (ICIICT 2019), 2019,
  • [42] A Feasible Chinese Text Data Preprocessing Strategy
    Liu, Jingang
    Xia, Chunhe
    Yan, Haihua
    Sun, Jie
    2020 11TH IEEE ANNUAL UBIQUITOUS COMPUTING, ELECTRONICS & MOBILE COMMUNICATION CONFERENCE (UEMCON), 2020, : 234 - 239
  • [43] Evaluating preprocessing by Turing Machine in text categorization
    Ghalehtaki, Razieh Abbasi
    Khotanlou, Hassan
    Esmaeilpour, Mansour
    2014 IRANIAN CONFERENCE ON INTELLIGENT SYSTEMS (ICIS), 2014,
  • [44] PREPROCESSING LOW STRUCTURED AND ERRONEOUS TEXT SOURCES
    Perlaki, Attila
    PROCEEDINGS OF 11TH INTERNATIONAL CARPATHIAN CONTROL CONFERENCE, 2010, 2010, : 215 - 218
  • [45] Preprocessing Phase of Punjabi Language Text Summarization
    Gupta, Vishal
    Lehal, Gurpreet Singh
    INFORMATION SYSTEMS FOR INDIAN LANGUAGES, 2011, 139 : 250 - +
  • [46] Simple lossless preprocessing algorithms for text compression
    Robert, L.
    Nadarajan, R.
    IET SOFTWARE, 2009, 3 (01) : 37 - 45
  • [47] Text Preprocessing and Annotation Tool for Time Information
    Lim, Chae-Gyun
    Jeong, Young-Seob
    Kim, Woo-Jin
    Kim, Youngjin
    Choi, Ho-Jin
    2024 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING, IEEE BIGCOMP 2024, 2024, : 351 - 352
  • [48] A graph theoretical preprocessing step for text compression
    Phukon, Kaushik K.
    Baruah, Hemanta K.
    International Journal of Multimedia and Ubiquitous Engineering, 2015, 10 (05): : 263 - 276
  • [49] Proposed preprocessing methods for manipulate text of tweet
    Salman, Hayder Mahmood
    Test Engineering and Management, 2019, 2019 : 17 - 26
  • [50] Impact of ECG Signal Preprocessing and Filtering on Arrhythmia Classification Using Machine Learning Techniques
    Ayala-Cucas, Hermes Andres
    Mora-Piscal, Edison Alexander
    Mayorca-Torres, Dagoberto
    Peluffo-Ordonez, Diego Hernan
    Leon-Salas, Alejandro J.
    ADVANCES IN ARTIFICIAL INTELLIGENCE-IBERAMIA 2022, 2022, 13788 : 27 - 40