The impact of preprocessing on text classification

被引:369
|
作者
Uysal, Alper Kursat [1 ]
Gunal, Serkan [1 ]
机构
[1] Anadolu Univ, Dept Comp Engn, Eskisehir, Turkey
关键词
Pattern recognition; Text categorization; Text classification; Text preprocessing; FEATURE-SELECTION; ALGORITHM; MODEL;
D O I
10.1016/j.ipm.2013.08.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Preprocessing is one of the key components in a typical text classification framework. This paper aims to extensively examine the impact of preprocessing on text classification in terms of various aspects such as classification accuracy, text domain, text language, and dimension reduction. For this purpose, all possible combinations of widely used preprocessing tasks are comparatively evaluated on two different domains, namely e-mail and news, and in two different languages, namely Turkish and English. In this way, contribution of the preprocessing tasks to classification success at various feature dimensions, possible interactions among these tasks, and also dependency of these tasks to the respective languages and domains are comprehensively assessed. Experimental analysis on benchmark datasets reveals that choosing appropriate combinations of preprocessing tasks, rather than enabling or disabling them all, may provide significant improvement on classification accuracy depending on the domain and language studied on. (C) 2013 Elsevier Ltd. All rights reserved.
引用
收藏
页码:104 / 112
页数:9
相关论文
共 50 条
  • [21] Comparison of text preprocessing methods
    Chai, Christine P.
    NATURAL LANGUAGE ENGINEERING, 2023, 29 (03) : 509 - 553
  • [22] The impact of OCR accuracy on automatic text classification
    Zu, GW
    Murata, M
    Ohyama, W
    Wakabayashi, T
    Kimura, F
    CONTENT COMPUTING, PROCEEDINGS, 2004, 3309 : 403 - 409
  • [23] The impact of indexing approaches on Arabic text classification
    Al-Badarneh, Amer
    Al-Shawakfa, Emad
    Bani-Ismail, Basel
    Al-Rababah, Khaleel
    Shatnawi, Safwan
    JOURNAL OF INFORMATION SCIENCE, 2017, 43 (02) : 159 - 173
  • [24] Text Mining in Hotel Reviews: Impact of Words Restriction in Text Classification
    Campos, Diogo
    Silva, Rodrigo Rocha
    Bernardino, Jorge
    KDIR: PROCEEDINGS OF THE 11TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT - VOL 1: KDIR, 2019, : 442 - 449
  • [25] Universal text preprocessing for data compression
    Abel, J
    Teahan, W
    IEEE TRANSACTIONS ON COMPUTERS, 2005, 54 (05) : 497 - 507
  • [26] Preprocessing Arabic text on social media
    Hegazi, Mohamed Osman
    Al-Dossari, Yasser
    Al-Yahy, Abdullah
    Al-Sumari, Abdulaziz
    Hilal, Anwer
    HELIYON, 2021, 7 (02)
  • [27] Text preprocessing for Czech speech synthesis
    Batusek, R
    Dvorák, J
    TEXT, SPEECH AND DIALOGUE, 1999, 1692 : 209 - 214
  • [28] STRING MATCHING WITH PREPROCESSING OF TEXT AND PATTERN
    NAOR, M
    LECTURE NOTES IN COMPUTER SCIENCE, 1991, 510 : 739 - 750
  • [29] The Influence of preprocessing parameters on text categorization
    Pomikalek, Jan
    Rehurek, Radim
    PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL 19, 2007, 19 : 430 - 433
  • [30] DATA PREPROCESSING IN WEB TEXT MINING
    Jiang Yongbo
    FIFTH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER THEORY AND ENGINEERING (ICACTE 2012), 2012, : 573 - 581