Automatic language identification: a case study of Pahari languages

被引：0

作者：

Gusain, Rachana ^{[1
]}

Dash, Satya Ranjan ^{[2
]}

Parida, Shantipriya ^{[3
]}

Jha, Girish Nath ^{[4
]}

机构：

[1] Doon Univ, Dehra Dun, Uttarakhand, India

[2] KIIT Univ, Bhubaneswar, Odisha, India

[3] Silo AI, Helsinki, Finland

[4] Jawaharlal Nehru Univ, New Delhi, India

来源：

LANGUAGE RESOURCES AND EVALUATION | 2023年 / 57卷 / 03期

关键词：

Low-resource languages; Corpus development; Statistical analysis; Language identification; Northern Indo-Aryan; Pahari; Nepali; Garhwali; Kumaoni; Dogri;

D O I：

10.1007/s10579-023-09651-6

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

In an attempt to expand the inclusiveness of Natural Language Processing, this paper focuses on developing resources and building machine learning models to identify four languages of the Northern Indo-Aryan family, also known as Pahari languages-Nepali, Garhwali, Kumaoni, and Dogri. This is the first attempt towards building identification models for Pahari languages and developing a plain text corpus for Garhwali and Kumaoni, both of which are lesser-known and under-resourced languages/mother tongues of India. The collected corpus, including data in Nepali and Dogri, is statistically analyzed at the word level. We also trained traditional machine learning models for Pahari language identification on this corpus and found that character n-grams based Linear Support Vector Machines performed best with 99.28% accuracy.

引用

页码：1361 / 1387

页数：27

共 50 条

[1] Automatic language identification: a case study of Pahari languages
Rachana Gusain
Satya Ranjan Dash
Shantipriya Parida
Girish Nath Jha
Language Resources and Evaluation, 2023, 57 : 1361 - 1387
[2] AUTOMATIC LANGUAGE IDENTIFICATION OF THREE INDIAN LANGUAGES USING VECTOR QUANTIZATION
Roy, Pinki
Das, Pradip K.
FOURTH INTERNATIONAL CONFERENCE ON COMPUTER AND ELECTRICAL ENGINEERING (ICCEE 2011), 2011, : 293 - +
[3] automatic language identification for berber and arabic languages using prosodic features
Lounnas, Khlaed
Demri, Lyes
Teffahi, Hocine
Falek, Leila
PROCEEDINGS 2018 3RD INTERNATIONAL CONFERENCE ON ELECTRICAL SCIENCES AND TECHNOLOGIES IN MAGHREB (CISTEM), 2018, : 239 - 242
[4] Automatic Language Identification for Romance Languages using Stop Words and Diacritics
Truica, Ciprian-Octavian
Velcin, Julien
Boicea, Alexandru
2015 17TH INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING (SYNASC), 2016, : 243 - 246
[5] Automatic identification of European languages
Zhdanova, AV
NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, 2002, 2553 : 76 - 84
[6] A GMM-BASED HIERARCHICAL AUTOMATIC LANGUAGE IDENTIFICATION SYSTEM FOR INDIAN LANGUAGES
Jothilakshmi, S.
Ramalingam, V.
Palanivel, S.
APPLIED ARTIFICIAL INTELLIGENCE, 2012, 26 (06) : 554 - 570
[7] Automatic Language Identification for Seven Indian Languages using Higher Level Features
Madhu, Chithra
George, Anu
Mary, Leena
2017 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, INFORMATICS, COMMUNICATION AND ENERGY SYSTEMS (SPICES), 2017,
[8] Language Identification for Austronesian Languages
Dunn, Jonathan
Nijhof, Wikke
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6530 - 6539
[9] Experiments on Automatic Language Identification for Philippine Languages using Acoustic Gaussian Mixture Models
Laguna, Ann Franchesca
Guevara, Rowena Cristina
2014 IEEE REGION 10 SYMPOSIUM, 2014, : 657 - 662
[10] Automatic language identification
Zissman, MA
Berkling, KM
SPEECH COMMUNICATION, 2001, 35 (1-2) : 115 - 124

← 1 2 3 4 5 →