A variant of n-gram based language classification

被引:0
|
作者
Tomovic, Andrija [1 ]
Janicic, Predrag [2 ]
机构
[1] Novartis Res Fdn, Friedrich Miescher Inst Biomed Res, Maulbeerstr 66, CH-4058 Basel, Switzerland
[2] Univ Belgrade, Fac Math, Belgrade, Serbia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Rapid classification of documents is of high-importance in many multilingual settings (such as international institutions or Internet search engines). This has been, for years, a well-known problem, addressed by different techniques, with excellent results. We address this problem by a simple n-grams based technique, a variation of techniques of this family. Our n-grams-based classification is very robust and successful, even for 20-fold classification, and even for short text strings. We give a detailed study for different lengths of strings and size- of n-grarns and we explore what classification parameters give the best performance. There is no requirement for vocabularies, but only for a few training documents. As a main corpus, we used a EU set of documents in 20 lanuages. Experimental comparison shows that our approach gives better, 0 results than four other popular approaches.
引用
收藏
页码:410 / +
页数:3
相关论文
共 50 条
  • [21] A Short Text Classification Method Based on N-Gram and CNN
    Wang, Haitao
    He, Jie
    Zhang, Xiaohong
    Liu, Shufen
    CHINESE JOURNAL OF ELECTRONICS, 2020, 29 (02) : 248 - 254
  • [22] A New Estimate of the n-gram Language Model
    Aouragh, Si Lhoussain
    Yousfi, Abdellah
    Laaroussi, Saida
    Gueddah, Hicham
    Nejja, Mohammed
    AI IN COMPUTATIONAL LINGUISTICS, 2021, 189 : 211 - 215
  • [23] MIXTURE OF MIXTURE N-GRAM LANGUAGE MODELS
    Sak, Hasim
    Allauzen, Cyril
    Nakajima, Kaisuke
    Beaufays, Francoise
    2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2013, : 31 - 36
  • [24] Perplexity of n-Gram and Dependency Language Models
    Popel, Martin
    Marecek, David
    TEXT, SPEECH AND DIALOGUE, 2010, 6231 : 173 - 180
  • [25] Development of the N-gram Model for Azerbaijani Language
    Bannayeva, Aliya
    Aslanov, Mustafa
    2020 IEEE 14TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT2020), 2020,
  • [26] Discriminative N-gram Language Modeling for Turkish
    Arisoy, Ebru
    Roark, Brian
    Shafran, Izhak
    Saraclar, Murat
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 825 - +
  • [27] Are n-gram Categories Helpful in Text Classification?
    Kruczek, Jakub
    Kruczek, Paulina
    Kuta, Marcin
    COMPUTATIONAL SCIENCE - ICCS 2020, PT II, 2020, 12138 : 524 - 537
  • [28] A Neural N-Gram Network for Text Classification
    Yan, Zhenguo
    Wu, Yue
    JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2018, 22 (03) : 380 - 386
  • [29] N-gram Based Croatian Language Network: Application in a Smart Environment
    Soic, Renato
    Vukovic, Marin
    JOURNAL OF COMMUNICATIONS SOFTWARE AND SYSTEMS, 2022, 18 (01) : 63 - 71
  • [30] Multiclass composite N-gram language model based on connection direction
    Yamamoto, Hirofumi
    Sagisaka, Yoshinori
    Systems and Computers in Japan, 2003, 34 (07) : 108 - 114