Automatic language identification: a case study of Pahari languages

被引:0
|
作者
Gusain, Rachana [1 ]
Dash, Satya Ranjan [2 ]
Parida, Shantipriya [3 ]
Jha, Girish Nath [4 ]
机构
[1] Doon Univ, Dehra Dun, Uttarakhand, India
[2] KIIT Univ, Bhubaneswar, Odisha, India
[3] Silo AI, Helsinki, Finland
[4] Jawaharlal Nehru Univ, New Delhi, India
关键词
Low-resource languages; Corpus development; Statistical analysis; Language identification; Northern Indo-Aryan; Pahari; Nepali; Garhwali; Kumaoni; Dogri;
D O I
10.1007/s10579-023-09651-6
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In an attempt to expand the inclusiveness of Natural Language Processing, this paper focuses on developing resources and building machine learning models to identify four languages of the Northern Indo-Aryan family, also known as Pahari languages-Nepali, Garhwali, Kumaoni, and Dogri. This is the first attempt towards building identification models for Pahari languages and developing a plain text corpus for Garhwali and Kumaoni, both of which are lesser-known and under-resourced languages/mother tongues of India. The collected corpus, including data in Nepali and Dogri, is statistically analyzed at the word level. We also trained traditional machine learning models for Pahari language identification on this corpus and found that character n-grams based Linear Support Vector Machines performed best with 99.28% accuracy.
引用
收藏
页码:1361 / 1387
页数:27
相关论文
共 50 条
  • [1] Automatic language identification: a case study of Pahari languages
    Rachana Gusain
    Satya Ranjan Dash
    Shantipriya Parida
    Girish Nath Jha
    Language Resources and Evaluation, 2023, 57 : 1361 - 1387
  • [2] AUTOMATIC LANGUAGE IDENTIFICATION OF THREE INDIAN LANGUAGES USING VECTOR QUANTIZATION
    Roy, Pinki
    Das, Pradip K.
    FOURTH INTERNATIONAL CONFERENCE ON COMPUTER AND ELECTRICAL ENGINEERING (ICCEE 2011), 2011, : 293 - +
  • [3] automatic language identification for berber and arabic languages using prosodic features
    Lounnas, Khlaed
    Demri, Lyes
    Teffahi, Hocine
    Falek, Leila
    PROCEEDINGS 2018 3RD INTERNATIONAL CONFERENCE ON ELECTRICAL SCIENCES AND TECHNOLOGIES IN MAGHREB (CISTEM), 2018, : 239 - 242
  • [4] Automatic Language Identification for Romance Languages using Stop Words and Diacritics
    Truica, Ciprian-Octavian
    Velcin, Julien
    Boicea, Alexandru
    2015 17TH INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING (SYNASC), 2016, : 243 - 246
  • [5] Automatic identification of European languages
    Zhdanova, AV
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, 2002, 2553 : 76 - 84
  • [6] A GMM-BASED HIERARCHICAL AUTOMATIC LANGUAGE IDENTIFICATION SYSTEM FOR INDIAN LANGUAGES
    Jothilakshmi, S.
    Ramalingam, V.
    Palanivel, S.
    APPLIED ARTIFICIAL INTELLIGENCE, 2012, 26 (06) : 554 - 570
  • [7] Automatic Language Identification for Seven Indian Languages using Higher Level Features
    Madhu, Chithra
    George, Anu
    Mary, Leena
    2017 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, INFORMATICS, COMMUNICATION AND ENERGY SYSTEMS (SPICES), 2017,
  • [8] Language Identification for Austronesian Languages
    Dunn, Jonathan
    Nijhof, Wikke
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6530 - 6539
  • [9] Experiments on Automatic Language Identification for Philippine Languages using Acoustic Gaussian Mixture Models
    Laguna, Ann Franchesca
    Guevara, Rowena Cristina
    2014 IEEE REGION 10 SYMPOSIUM, 2014, : 657 - 662
  • [10] Automatic language identification
    Zissman, MA
    Berkling, KM
    SPEECH COMMUNICATION, 2001, 35 (1-2) : 115 - 124