Albanian Text Classification: Bag of Words Model and Word Analogies

被引:9
|
作者
Kadriu, Arbana [1 ]
Abazi, Lejla [2 ]
Abazi, Hyrije [2 ]
机构
[1] SEE Univ, Fac Contemporary Sci & Technol, Tetovo, North Macedonia
[2] SEE Univ, Tetovo, North Macedonia
来源
BUSINESS SYSTEMS RESEARCH JOURNAL | 2019年 / 10卷 / 01期
关键词
data mining; text classification; news articles; machine learning;
D O I
10.2478/bsrj-2019-0006
中图分类号
F [经济];
学科分类号
02 ;
摘要
Background: Text classification is a very important task in information retrieval. Its objective is to classify new text documents in a set of predefined classes, using different supervised algorithms. Objectives: We focus on the text classification for Albanian news articles using two approaches. Methods/Approach: In the first approach, the words in a collection are considered as independent components, allocating to each of them a conforming vector in the vector's space. Here we utilized nine classifiers from the scikit-learn package, training the classifiers with part of news articles (80%) and testing the accuracy with the remaining part of these articles. In the second approach, the text classification treats words based on their semantic and syntactic word similarities, supposing a word is formed by n-grams of characters. In this case, we have used the fastText, a hierarchical classifier, that considers local word order, as well as sub-word information. We have measured the accuracy for each classifier separately. We have also analyzed the training and testing time. Results: Our results show that the bag of words model does better than fastText when testing the classification process for not a large dataset of text. FastText shows better performance when classifying multi-label text. Conclusions: News articles can serve to create a benchmark for testing classification algorithms of Albanian texts. The best results are achieved with a bag of words model, with an accuracy of 94%.
引用
收藏
页码:74 / 87
页数:14
相关论文
共 50 条
  • [31] Image Classification Using Bag Of Visual Words Model With FAST And FREAK
    Singhal, Neetika
    Singhal, Nishank
    Kalaichelvi, V.
    PROCEEDINGS OF THE 2017 IEEE SECOND INTERNATIONAL CONFERENCE ON ELECTRICAL, COMPUTER AND COMMUNICATION TECHNOLOGIES (ICECCT), 2017,
  • [32] A ⟨word, part of speech⟩ embedding model for text classification
    Liu, Wenfeng
    Liu, Peiyu
    Yang, Yuzhen
    Yi, Jing
    Zhu, Zhenfang
    EXPERT SYSTEMS, 2019, 36 (06)
  • [34] Contrastive Approach towards Text Source Classification based on Top-Bag-of-Word Similarity
    Huang, Chu-Ren
    Lee, Lung-Hao
    PACLIC 22: PROCEEDINGS OF THE 22ND PACIFIC ASIA CONFERENCE ON LANGUAGE, INFORMATION AND COMPUTATION, 2008, : 404 - +
  • [35] Comparison between Bag of Words and Word Sense Disambiguation
    Elyasir, Ayoub Mohamed H.
    Anbananthen, Kalaiarasi Sonai Muthu
    PROCEEDINGS OF THE 2013 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER SCIENCE AND ELECTRONICS INFORMATION (ICACSEI 2013), 2013, 41 : 413 - 417
  • [36] Spatial orientations of visual word pairs to improve Bag-of-Visual-Words model
    Khan, Rahat
    Barat, Cecile
    Muselet, Damien
    Ducottet, Christophe
    PROCEEDINGS OF THE BRITISH MACHINE VISION CONFERENCE 2012, 2012,
  • [37] Text classification by labeling words
    Liu, B
    Li, XL
    Lee, WS
    Yu, PS
    PROCEEDING OF THE NINETEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE SIXTEENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2004, : 425 - 430
  • [38] Comparing High Dimensional Word Embeddings Trained on Medical Text to Bag-of-Words for Predicting Medical Codes
    Yogarajan, Vithya
    Gouk, Henry
    Smith, Tony
    Mayo, Michael
    Pfahringer, Bernhard
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS (ACIIDS 2020), PT I, 2020, 12033 : 97 - 108
  • [39] Beyond the bag of words: A text representation for sentence selection
    Caropreso, Maria Fernanda
    Matwin, Stan
    ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4013 : 324 - 335
  • [40] BOWL: Bag of Word Clusters Text Representation Using Word Embeddings
    Rui, Weikang
    Xing, Kai
    Jia, Yawei
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2016, 2016, 9983 : 3 - 14