A variant of n-gram based language classification

被引:0
|
作者
Tomovic, Andrija [1 ]
Janicic, Predrag [2 ]
机构
[1] Novartis Res Fdn, Friedrich Miescher Inst Biomed Res, Maulbeerstr 66, CH-4058 Basel, Switzerland
[2] Univ Belgrade, Fac Math, Belgrade, Serbia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Rapid classification of documents is of high-importance in many multilingual settings (such as international institutions or Internet search engines). This has been, for years, a well-known problem, addressed by different techniques, with excellent results. We address this problem by a simple n-grams based technique, a variation of techniques of this family. Our n-grams-based classification is very robust and successful, even for 20-fold classification, and even for short text strings. We give a detailed study for different lengths of strings and size- of n-grarns and we explore what classification parameters give the best performance. There is no requirement for vocabularies, but only for a few training documents. As a main corpus, we used a EU set of documents in 20 lanuages. Experimental comparison shows that our approach gives better, 0 results than four other popular approaches.
引用
收藏
页码:410 / +
页数:3
相关论文
共 50 条
  • [1] A variant of n-gram based language-independent text categorization
    Graovac, Jelena
    INTELLIGENT DATA ANALYSIS, 2014, 18 (04) : 677 - 695
  • [2] URL-Based Web Page Classification: With n-Gram Language Models
    Abdallah, Tarek Amr
    de La Iglesia, Beatriz
    KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT, IC3K 2014, 2015, 553 : 19 - 33
  • [3] Profile based compression of n-gram language models
    Olsen, Jesper
    Oria, Daniela
    2006 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-13, 2006, : 1041 - 1044
  • [4] Language Identification based on n-gram Frequency Ranking
    Cordoba, R.
    D'Haro, L. F.
    Fernandez-Martinez, F.
    Macias-Guarasa, J.
    Ferreiros, J.
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 1921 - 1924
  • [5] Proposal of n-gram Based Algorithm for Malware Classification
    Pektas, Abdurrahman
    Eris, Mehmet
    Acarman, Tankut
    PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON EMERGING SECURITY INFORMATION, SYSTEMS AND TECHNOLOGIES (SECURWARE 2011), 2011, : 14 - 18
  • [6] Opcode n-gram based Malware Classification in Android
    Sihag, Vikas
    Mitharwal, Anita
    Vardhan, Manu
    Singh, Pradeep
    PROCEEDINGS OF THE 2020 FOURTH WORLD CONFERENCE ON SMART TRENDS IN SYSTEMS, SECURITY AND SUSTAINABILITY (WORLDS4 2020), 2020, : 645 - 650
  • [7] Syllable n-gram approach for Identification and Classification of genres in Telugu language
    Kumari, K. Pranitha
    Reddy, A. Venugopal
    2014 FIRST INTERNATIONAL CONFERENCE ON NETWORKS & SOFT COMPUTING (ICNSC), 2014, : 125 - 129
  • [8] Combining naive Bayes and n-gram language models for text classification
    Peng, FC
    Schuurmans, D
    ADVANCES IN INFORMATION RETRIEVAL, 2003, 2633 : 335 - 350
  • [9] On compressing n-gram language models
    Hirsimaki, Teemu
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 949 - 952
  • [10] Discriminative n-gram language modeling
    Roark, Brian
    Saraclar, Murat
    Collins, Michael
    COMPUTER SPEECH AND LANGUAGE, 2007, 21 (02): : 373 - 392