Albanian Text Classification: Bag of Words Model and Word Analogies

被引:9
|
作者
Kadriu, Arbana [1 ]
Abazi, Lejla [2 ]
Abazi, Hyrije [2 ]
机构
[1] SEE Univ, Fac Contemporary Sci & Technol, Tetovo, North Macedonia
[2] SEE Univ, Tetovo, North Macedonia
来源
BUSINESS SYSTEMS RESEARCH JOURNAL | 2019年 / 10卷 / 01期
关键词
data mining; text classification; news articles; machine learning;
D O I
10.2478/bsrj-2019-0006
中图分类号
F [经济];
学科分类号
02 ;
摘要
Background: Text classification is a very important task in information retrieval. Its objective is to classify new text documents in a set of predefined classes, using different supervised algorithms. Objectives: We focus on the text classification for Albanian news articles using two approaches. Methods/Approach: In the first approach, the words in a collection are considered as independent components, allocating to each of them a conforming vector in the vector's space. Here we utilized nine classifiers from the scikit-learn package, training the classifiers with part of news articles (80%) and testing the accuracy with the remaining part of these articles. In the second approach, the text classification treats words based on their semantic and syntactic word similarities, supposing a word is formed by n-grams of characters. In this case, we have used the fastText, a hierarchical classifier, that considers local word order, as well as sub-word information. We have measured the accuracy for each classifier separately. We have also analyzed the training and testing time. Results: Our results show that the bag of words model does better than fastText when testing the classification process for not a large dataset of text. FastText shows better performance when classifying multi-label text. Conclusions: News articles can serve to create a benchmark for testing classification algorithms of Albanian texts. The best results are achieved with a bag of words model, with an accuracy of 94%.
引用
收藏
页码:74 / 87
页数:14
相关论文
共 50 条
  • [41] EXPANDED BAG OF WORDS REPRESENTATION FOR OBJECT CLASSIFICATION
    Liu, Tinglin
    Liu, Jing
    Liu, Qinshan
    Lu, Hanqing
    2009 16TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOLS 1-6, 2009, : 297 - 300
  • [42] Bag of Visual Words Model with Deep Spatial Features for Geographical Scene Classification
    Feng, Jiangfan
    Liu, Yuanyuan
    Wu, Lin
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2017, 2017
  • [43] Sports video classification based on marked genre shots and bag of words model
    Zhu, Yingying
    Zhu, Yanyan
    Wen, Zhenkun
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2013, 25 (09): : 1375 - 1383
  • [44] An Image Classification Method Based on Optimized Fuzzy Bag-of-words Model
    Li, Zilong
    Zhou, Yong
    Bao, Rong
    TRAITEMENT DU SIGNAL, 2019, 36 (03) : 239 - 244
  • [45] Bag-of-Multimedia-Words for Image Classification
    Znaidia, Amel
    Shabou, Aymen
    Le Borgne, Herye
    Hudelot, Celine
    Paragios, Nikos
    2012 21ST INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR 2012), 2012, : 1509 - 1512
  • [46] Classification of Pollen Apertures Using Bag of Words
    Lozano-Vega, Gildardo
    Benezeth, Yannick
    Marzani, Franck
    Boochs, Frank
    IMAGE ANALYSIS AND PROCESSING (ICIAP 2013), PT 1, 2013, 8156 : 712 - 721
  • [47] Detection and Classification of Diabetic Retinopathy Anomalies Using Bag-of-Words Model
    Mukti, Fanji Ari
    Eswaran, Chikannan
    Hashim, Noranniza
    JOURNAL OF MEDICAL IMAGING AND HEALTH INFORMATICS, 2015, 5 (05) : 1009 - 1019
  • [48] Image Classification with Bag-of-Words Model Based on Improved SIFT Algorithm
    Gao, Huilin
    Dou, Lihua
    Chen, Wenjie
    Sun, Jian
    2013 9TH ASIAN CONTROL CONFERENCE (ASCC), 2013,
  • [49] Commodity Image Classification Based on Improved Bag-of-Visual-Words Model
    Sun, Huadong
    Zhang, Xu
    Han, Xiaowei
    Jin, Xuesong
    Zhao, Zhijie
    COMPLEXITY, 2021, 2021
  • [50] Embedding generation for text classification of Brazilian Portuguese user reviews: from bag-of-words to transformers
    Frederico Dias Souza
    João Baptista de Oliveira e Souza Filho
    Neural Computing and Applications, 2023, 35 : 9393 - 9406