A variant of n-gram based language classification

被引:0
|
作者
Tomovic, Andrija [1 ]
Janicic, Predrag [2 ]
机构
[1] Novartis Res Fdn, Friedrich Miescher Inst Biomed Res, Maulbeerstr 66, CH-4058 Basel, Switzerland
[2] Univ Belgrade, Fac Math, Belgrade, Serbia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Rapid classification of documents is of high-importance in many multilingual settings (such as international institutions or Internet search engines). This has been, for years, a well-known problem, addressed by different techniques, with excellent results. We address this problem by a simple n-grams based technique, a variation of techniques of this family. Our n-grams-based classification is very robust and successful, even for 20-fold classification, and even for short text strings. We give a detailed study for different lengths of strings and size- of n-grarns and we explore what classification parameters give the best performance. There is no requirement for vocabularies, but only for a few training documents. As a main corpus, we used a EU set of documents in 20 lanuages. Experimental comparison shows that our approach gives better, 0 results than four other popular approaches.
引用
收藏
页码:410 / +
页数:3
相关论文
共 50 条
  • [41] A language independent n-gram model for word segmentation
    Kang, Seung-Shik
    Hwang, Kyu-Baek
    Lect. Notes Comput. Sci., 1600, (557-565):
  • [42] Improved N-gram Phonotactic Models For Language Recognition
    BenZeghiba, Mohamed Faouzi
    Gauvain, Jean-Luc
    Lamel, Lori
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2718 - 2721
  • [43] A language independent n-gram model for word segmentation
    Kang, Seung-Shik
    Hwang, Kyu-Baek
    AI 2006: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4304 : 557 - +
  • [44] N-GRAM ANALYSIS OF TEXT DOCUMENTS IN SERBIAN LANGUAGE
    Marovac, Ulfeta
    Pljaskovic, Aldina
    Crnisanin, Adela
    Kajan, Ejub
    2012 20TH TELECOMMUNICATIONS FORUM (TELFOR), 2012, : 1385 - 1388
  • [45] Efficient MDI Adaptation for n-gram Language Models
    Huang, Ruizhe
    Li, Ke
    Arora, Ashish
    Povey, Daniel
    Khudanpur, Sanjeev
    INTERSPEECH 2020, 2020, : 4916 - 4920
  • [46] N-gram language models for massively parallel devices
    Bogoychev, Nikolay
    Lopez, Adam
    PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 1944 - 1953
  • [47] Constrained Discriminative Training of N-gram Language Models
    Rastrow, Ariya
    Sethy, Abhinav
    Ramabhadran, Bhuvana
    2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, : 311 - +
  • [48] Multilingual stochastic n-gram class language models
    Jardino, M
    1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 161 - 163
  • [49] POWER LAW DISCOUNTING FOR N-GRAM LANGUAGE MODELS
    Huang, Songfang
    Renals, Steve
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5178 - 5181
  • [50] N-gram language models for document image decoding
    Kopec, GE
    Said, MR
    Popat, K
    DOCUMENT RECOGNITION AND RETRIEVAL IX, 2002, 4670 : 191 - 202