Albanian Text Classification: Bag of Words Model and Word Analogies

被引:9
|
作者
Kadriu, Arbana [1 ]
Abazi, Lejla [2 ]
Abazi, Hyrije [2 ]
机构
[1] SEE Univ, Fac Contemporary Sci & Technol, Tetovo, North Macedonia
[2] SEE Univ, Tetovo, North Macedonia
来源
BUSINESS SYSTEMS RESEARCH JOURNAL | 2019年 / 10卷 / 01期
关键词
data mining; text classification; news articles; machine learning;
D O I
10.2478/bsrj-2019-0006
中图分类号
F [经济];
学科分类号
02 ;
摘要
Background: Text classification is a very important task in information retrieval. Its objective is to classify new text documents in a set of predefined classes, using different supervised algorithms. Objectives: We focus on the text classification for Albanian news articles using two approaches. Methods/Approach: In the first approach, the words in a collection are considered as independent components, allocating to each of them a conforming vector in the vector's space. Here we utilized nine classifiers from the scikit-learn package, training the classifiers with part of news articles (80%) and testing the accuracy with the remaining part of these articles. In the second approach, the text classification treats words based on their semantic and syntactic word similarities, supposing a word is formed by n-grams of characters. In this case, we have used the fastText, a hierarchical classifier, that considers local word order, as well as sub-word information. We have measured the accuracy for each classifier separately. We have also analyzed the training and testing time. Results: Our results show that the bag of words model does better than fastText when testing the classification process for not a large dataset of text. FastText shows better performance when classifying multi-label text. Conclusions: News articles can serve to create a benchmark for testing classification algorithms of Albanian texts. The best results are achieved with a bag of words model, with an accuracy of 94%.
引用
收藏
页码:74 / 87
页数:14
相关论文
共 50 条
  • [1] Network-Based Bag-of-Words Model for Text Classification
    Yan, Dongyang
    Li, Keping
    Gu, Shuang
    Yang, Liu
    IEEE ACCESS, 2020, 8 : 82641 - 82652
  • [2] Clinical Text Classification with Word Embedding Features vs. Bag-of-Words Features
    Shao, Yijun
    Taylor, Stephanie
    Marshall, Nell
    Morioka, Craig
    Zeng-Treitler, Qing
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 2874 - 2878
  • [3] Local word bag model for text categorization
    Pu, Wen
    Liu, Ning
    Yan, Shuicheng
    Yan, Jun
    Xie, Kunqing
    Chen, Zheng
    ICDM 2007: PROCEEDINGS OF THE SEVENTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, 2007, : 625 - +
  • [5] The influence of preprocessing on text classification using a bag-of-words representation
    HaCohen-Kerner, Yaakov
    Miller, Daniel
    Yigal, Yair
    PLOS ONE, 2020, 15 (05):
  • [6] Do Important Words in Bag-of-Words Model of Text Relatedness Help?
    Islam, Aminul
    Milios, Evangelos
    Keselj, Vlado
    TEXT, SPEECH, AND DIALOGUE (TSD 2015), 2015, 9302 : 569 - 577
  • [7] A New Text Representation Scheme Combining Bag-of-Words and Bag-of-Concepts Approaches for Automatic Text Classification
    Alahmadi, Alaa
    Joorabchi, Arash
    Mahdi, Abdulhussain E.
    2013 7TH IEEE GCC CONFERENCE AND EXHIBITION (GCC), 2013, : 108 - 113
  • [8] Pooling region learning of visual word for image classification using bag-of-visual-words model
    Xu, Ye
    Yu, Xiaodong
    Wang, Tian
    Xu, Zezhong
    PLOS ONE, 2020, 15 (06):
  • [9] Bag-of-Visual-Words Model for Fingerprint Classification
    Andono, Pulung
    Supriyanto, Catur
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2018, 15 (01) : 37 - 43
  • [10] A Bag of Words Model for Improving Automatic Stress Classification
    Ciupe, Aurelia
    Florea, Camelia
    Orza, Bogdan
    Vlaicu, Aurel
    Petrovan, Bogdan
    PROCEEDINGS OF THE SECOND INTERNATIONAL AFRO-EUROPEAN CONFERENCE FOR INDUSTRIAL ADVANCEMENT (AECIA 2015), 2016, 427 : 339 - 349