Language Identification of Bengali-English Code-Mixed Data using Character & Phonetic based LSTM Models

被引:4
|
作者
Das, Sourya Dipta [1 ]
Mandal, Soumil [2 ]
Das, Dipankar [1 ]
机构
[1] Jadavpur Univ, Kolkata, India
[2] SRM Univ, Chennai, Tamil Nadu, India
来源
PROCEEDINGS OF THE 11TH ANNUAL MEETING OF THE FORUM FOR INFORMATION RETRIEVAL EVALUATION (FIRE 2019) | 2019年
关键词
code-mixing; code-switching; phonetic encoding; character encoding; language identification;
D O I
10.1145/3368567.3368578
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Language identification of social media text still remains a challenging task due to properties like code-mixing and inconsistent phonetic transliterations. In this paper, we present a supervised learning approach for language identification at the word level of low resource Bengali-English code-mixed data taken from social media. We employ two methods of word encoding, namely character based and root phone based to train our deep LSTM models. Utilizing these two models we created two ensemble models using stacking and threshold technique which gave 91.78% and 92.35% accuracies respectively on our testing data.
引用
收藏
页码:60 / 64
页数:5
相关论文
共 40 条
  • [1] Deep Insights of Erroneous Bengali-English Code-Mixed Bilingual Language
    Ganguli, Isha
    Bhowmick, Rajat Subhra
    Sil, Jaya
    IETE JOURNAL OF RESEARCH, 2023, 69 (06) : 3334 - 3345
  • [2] Mixed language processing increases cross-language phonetic transfer in Bengali-English bilinguals
    Mitra, Auromita
    Dutta, Indranil
    BILINGUALISM-LANGUAGE AND COGNITION, 2023, 26 (05) : 896 - 909
  • [3] Abusive Comment Detection from Bengali-English Code-Mixed Social Media Texts Using Ensemble of Deep Learning
    Fahim, Iftekhar
    Ahsan, Shawly
    Hoque, Mohammed Moshiul
    ARTIFICIAL INTELLIGENCE AND KNOWLEDGE PROCESSING, AIKP 2024, 2025, 2228 : 252 - 267
  • [4] Deep Learning-Based Language Identification in English-Hindi-Bengali Code-Mixed Social Media Corpora
    Jamatia, Anupam
    Das, Amitava
    Gambaeck, Bjoern
    JOURNAL OF INTELLIGENT SYSTEMS, 2019, 28 (03) : 399 - 408
  • [5] Character Embedding for Language Identification in Hindi-English Code-mixed Social Media Text
    Veena, P. V.
    Kumar, M. Anand
    Soman, K. P.
    COMPUTACION Y SISTEMAS, 2018, 22 (01): : 65 - 74
  • [6] Word Level Language Identification in Assamese-Bengali-Hindi-English Code-Mixed Social Media Text
    Sarma, Neelakshi
    Singh, Sanasam Ranbir
    Goswami, Diganta
    2018 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2018, : 261 - 266
  • [7] Transformer Based Language Identification for Malayalam-English Code-Mixed Text
    Thara, S.
    Poornachandran, Prabaharan
    IEEE Access, 2021, 9 : 118837 - 118850
  • [8] Transformer Based Language Identification for Malayalam-English Code-Mixed Text
    Thara, S.
    Poornachandran, Prabaharan
    IEEE ACCESS, 2021, 9 : 118837 - 118850
  • [9] Language Detection in Sinhala-English Code-mixed Data
    Smith, Ian
    Thayasivam, Uthayasanker
    PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 228 - 233
  • [10] Sentiment Analysis and Offensive Language Identification in Code-Mixed Tamil-English Languages Using Transformer-Based Models
    Ponnambalam, Satheesh Kumar
    Desai, Darshana
    ADVANCED NETWORK TECHNOLOGIES AND INTELLIGENT COMPUTING, ANTIC 2023, PT III, 2024, 2092 : 149 - 167