Albanian Text Classification: Bag of Words Model and Word Analogies

被引:9
|
作者
Kadriu, Arbana [1 ]
Abazi, Lejla [2 ]
Abazi, Hyrije [2 ]
机构
[1] SEE Univ, Fac Contemporary Sci & Technol, Tetovo, North Macedonia
[2] SEE Univ, Tetovo, North Macedonia
来源
BUSINESS SYSTEMS RESEARCH JOURNAL | 2019年 / 10卷 / 01期
关键词
data mining; text classification; news articles; machine learning;
D O I
10.2478/bsrj-2019-0006
中图分类号
F [经济];
学科分类号
02 ;
摘要
Background: Text classification is a very important task in information retrieval. Its objective is to classify new text documents in a set of predefined classes, using different supervised algorithms. Objectives: We focus on the text classification for Albanian news articles using two approaches. Methods/Approach: In the first approach, the words in a collection are considered as independent components, allocating to each of them a conforming vector in the vector's space. Here we utilized nine classifiers from the scikit-learn package, training the classifiers with part of news articles (80%) and testing the accuracy with the remaining part of these articles. In the second approach, the text classification treats words based on their semantic and syntactic word similarities, supposing a word is formed by n-grams of characters. In this case, we have used the fastText, a hierarchical classifier, that considers local word order, as well as sub-word information. We have measured the accuracy for each classifier separately. We have also analyzed the training and testing time. Results: Our results show that the bag of words model does better than fastText when testing the classification process for not a large dataset of text. FastText shows better performance when classifying multi-label text. Conclusions: News articles can serve to create a benchmark for testing classification algorithms of Albanian texts. The best results are achieved with a bag of words model, with an accuracy of 94%.
引用
收藏
页码:74 / 87
页数:14
相关论文
共 50 条
  • [21] Words Can Be Confusing: Stereotype Bias Removal in Text Classification at the Word Level
    Shen, Shaofei
    Zhang, Mingzhe
    Chen, Weitong
    Bialkowski, Alina
    Xu, Miao
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2023, PT IV, 2023, 13938 : 99 - 111
  • [22] Towards Visual Words to Words Text Detection with a General Bag of Words Representation
    Mehta, Rakesh
    Chum, Ondrej
    Matas, Jiri
    2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 641 - 645
  • [23] How to use Bag-of-Words model better for image classification
    Wang, Chong
    Huang, Kaiqi
    IMAGE AND VISION COMPUTING, 2015, 38 : 65 - 74
  • [24] Bag-of-Visual-Words Model for Classification of Interferometric SAR Images
    Cagatay, Nazli Deniz
    Datcu, Mihai
    11TH EUROPEAN CONFERENCE ON SYNTHETIC APERTURE RADAR (EUSAR 2016), 2016, : 243 - 246
  • [25] Human Action Classification Based on Sequential Bag-of-Words Model
    Liu, Hong
    Zhang, Qiaoduo
    Sun, Qianru
    2014 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND BIOMIMETICS IEEE-ROBIO 2014, 2014, : 2280 - 2285
  • [26] Visual Attention based Bag-of-Words Model for Image Classification
    Wang, Qiwei
    Wan, Shouhong
    Yue, Lihua
    Wang, Che
    6TH INTERNATIONAL CONFERENCE ON DIGITAL IMAGE PROCESSING (ICDIP 2014), 2014, 9159
  • [27] Music Classification Using the Bag of Words Model of Modulation Spectral Features
    Lee, Chang-Hsing
    Lin, Hwai-San
    Chen, Ling-Hwei
    2015 15TH INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS AND INFORMATION TECHNOLOGIES (ISCIT), 2015, : 121 - 124
  • [28] Image Classification Method Based on Visual Saliency and Bag of Words Model
    Liu Zhi-jie
    PROCEEDINGS OF 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTATION TECHNOLOGY AND AUTOMATION (ICICTA 2015), 2015, : 466 - 469
  • [29] Image classification method based on improved bag-of-words model
    Li, Li
    Yan, Zhou
    Computer Modelling and New Technologies, 2014, 18 (12): : 242 - 246
  • [30] Object Classification and Recognition using Bag-of-Words (BoW) Model
    Ali, Nursabillilah Mohd
    Jun, Soon Wei
    Karis, Mohd Safirin
    Ghazaly, Mariam Md
    Arai, Mohd Shahrieel Mohd
    2016 IEEE 12TH INTERNATIONAL COLLOQUIUM ON SIGNAL PROCESSING & ITS APPLICATIONS (CSPA), 2016, : 216 - 220