Language Identification of Bengali-English Code-Mixed Data using Character & Phonetic based LSTM Models

被引：4

作者：

Das, Sourya Dipta ^{[1
]}

Mandal, Soumil ^{[2
]}

Das, Dipankar ^{[1
]}

机构：

[1] Jadavpur Univ, Kolkata, India

[2] SRM Univ, Chennai, Tamil Nadu, India

来源：

PROCEEDINGS OF THE 11TH ANNUAL MEETING OF THE FORUM FOR INFORMATION RETRIEVAL EVALUATION (FIRE 2019) | 2019年

关键词：

code-mixing; code-switching; phonetic encoding; character encoding; language identification;

D O I：

10.1145/3368567.3368578

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Language identification of social media text still remains a challenging task due to properties like code-mixing and inconsistent phonetic transliterations. In this paper, we present a supervised learning approach for language identification at the word level of low resource Bengali-English code-mixed data taken from social media. We employ two methods of word encoding, namely character based and root phone based to train our deep LSTM models. Utilizing these two models we created two ensemble models using stacking and threshold technique which gave 91.78% and 92.35% accuracies respectively on our testing data.

引用

页码：60 / 64

页数：5

共 40 条

[1] Deep Insights of Erroneous Bengali-English Code-Mixed Bilingual Language
Ganguli, Isha
Bhowmick, Rajat Subhra
Sil, Jaya
IETE JOURNAL OF RESEARCH, 2023, 69 (06) : 3334 - 3345
[2] Mixed language processing increases cross-language phonetic transfer in Bengali-English bilinguals
Mitra, Auromita
Dutta, Indranil
BILINGUALISM-LANGUAGE AND COGNITION, 2023, 26 (05) : 896 - 909
[3] Abusive Comment Detection from Bengali-English Code-Mixed Social Media Texts Using Ensemble of Deep Learning
Fahim, Iftekhar
Ahsan, Shawly
Hoque, Mohammed Moshiul
ARTIFICIAL INTELLIGENCE AND KNOWLEDGE PROCESSING, AIKP 2024, 2025, 2228 : 252 - 267
[4] Deep Learning-Based Language Identification in English-Hindi-Bengali Code-Mixed Social Media Corpora
Jamatia, Anupam
Das, Amitava
Gambaeck, Bjoern
JOURNAL OF INTELLIGENT SYSTEMS, 2019, 28 (03) : 399 - 408
[5] Character Embedding for Language Identification in Hindi-English Code-mixed Social Media Text
Veena, P. V.
Kumar, M. Anand
Soman, K. P.
COMPUTACION Y SISTEMAS, 2018, 22 (01): : 65 - 74
[6] Word Level Language Identification in Assamese-Bengali-Hindi-English Code-Mixed Social Media Text
Sarma, Neelakshi
Singh, Sanasam Ranbir
Goswami, Diganta
2018 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2018, : 261 - 266
[7] Transformer Based Language Identification for Malayalam-English Code-Mixed Text
Thara, S.
Poornachandran, Prabaharan
IEEE Access, 2021, 9 : 118837 - 118850
[8] Transformer Based Language Identification for Malayalam-English Code-Mixed Text
Thara, S.
Poornachandran, Prabaharan
IEEE ACCESS, 2021, 9 : 118837 - 118850
[9] Language Detection in Sinhala-English Code-mixed Data
Smith, Ian
Thayasivam, Uthayasanker
PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 228 - 233
[10] Sentiment Analysis and Offensive Language Identification in Code-Mixed Tamil-English Languages Using Transformer-Based Models
Ponnambalam, Satheesh Kumar
Desai, Darshana
ADVANCED NETWORK TECHNOLOGIES AND INTELLIGENT COMPUTING, ANTIC 2023, PT III, 2024, 2092 : 149 - 167

← 1 2 3 4 →