A variant of n-gram based language classification

被引:0
|
作者
Tomovic, Andrija [1 ]
Janicic, Predrag [2 ]
机构
[1] Novartis Res Fdn, Friedrich Miescher Inst Biomed Res, Maulbeerstr 66, CH-4058 Basel, Switzerland
[2] Univ Belgrade, Fac Math, Belgrade, Serbia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Rapid classification of documents is of high-importance in many multilingual settings (such as international institutions or Internet search engines). This has been, for years, a well-known problem, addressed by different techniques, with excellent results. We address this problem by a simple n-grams based technique, a variation of techniques of this family. Our n-grams-based classification is very robust and successful, even for 20-fold classification, and even for short text strings. We give a detailed study for different lengths of strings and size- of n-grarns and we explore what classification parameters give the best performance. There is no requirement for vocabularies, but only for a few training documents. As a main corpus, we used a EU set of documents in 20 lanuages. Experimental comparison shows that our approach gives better, 0 results than four other popular approaches.
引用
收藏
页码:410 / +
页数:3
相关论文
共 50 条
  • [31] Topic-Dependent-Class-Based n-Gram Language Model
    Naptali, Welly
    Tsuchiya, Masatoshi
    Nakagawa, Seiichi
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (05): : 1513 - 1525
  • [32] N-gram and decision tree based language identification for written words
    Häkkinen, J
    Tian, J
    ASRU 2001: IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, CONFERENCE PROCEEDINGS, 2001, : 335 - 338
  • [33] A Novel Interpolated N-gram Language Model Based on Class Hierarchy
    Lv, Zhenyu
    Liu, Wenju
    Yang, Zhanlei
    IEEE NLP-KE 2009: PROCEEDINGS OF INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, 2009, : 473 - 477
  • [34] Speech Corpus Generation Based on N-gram Confidence Measure Classification
    Koctur, Tomas
    Ondas, Stanislav
    Juhar, Jozef
    PROCEEDINGS OF 2017 INTERNATIONAL SYMPOSIUM ELMAR, 2017, : 149 - 152
  • [35] N-gram modeling based on recognized phonemes in automatic language identification
    Kwan, H
    Hirose, K
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 1998, E81D (11) : 1224 - 1231
  • [36] Classification of ransomware families with machine learning based on N-gram of opcodes
    Zhang, Hanqi
    Xiao, Xi
    Mercaldo, Francesco
    Ni, Shiguang
    Martinelli, Fabio
    Sangaiah, Arun Kumar
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 90 : 211 - 221
  • [37] Web Page Classification using n-gram based URL Features
    Rajalakshmi, R.
    Aravindan, Chandrabose
    2013 FIFTH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING (ICOAC), 2013, : 15 - 21
  • [38] An n-gram based approach to the automatic classification of schoolchildren's writing
    Cicres, Jordi
    Queralt, Sheila
    VIAL-VIGO INTERNATIONAL JOURNAL OF APPLIED LINGUISTICS, 2019, 16 : 53 - 80
  • [39] Alphabet Flatting as a variant of n-gram feature extraction method in ensemble classification of fake news
    Ksieniewicz, Pawel
    Zyblewski, Pawel
    Borek-Marciniec, Weronika
    Kozik, Rafal
    Choras, Michal
    Wozniak, Michal
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 120
  • [40] Bayesian learning of n-gram statistical language modeling
    Bai, Shuanhu
    Li, Haizhou
    2006 IEEE International Conference on Acoustics, Speech and Signal Processing, Vols 1-13, 2006, : 1045 - 1048