Minority Positive Sampling for Switching Points - an Anecdote for the Code-Mixing Language Modeling

被引:0
|
作者
Chatterjee, Arindam [1 ]
Guptha, Bodla Vineeth [1 ]
Chopra, Parul [1 ]
Das, Amitava [1 ]
机构
[1] Wipro AI Labs, Bangalore, Karnataka, India
关键词
code-mix; language modelling; Hinglish; switching point; LAW;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Code-Mixing (CM) or language mixing is a social norm in multilingual societies. CM is quite prevalent in social media conversations in multilingual regions like - India, Europe, Canada and Mexico. In this paper, we explore the problem of Language Modeling (LM) for code-mixed Hinglish (Hindi-English language pair) text. In recent times, there have been several success stories with neural language modeling like Generative Pre-trained Transformer (GPT) (Radford et al., 2019), Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) etc.. Hence, neural language models have become the new holy grail of modern NLP, although LM for CM is an unexplored area altogether. To better understand the problem of LM for CM, we initially experimented with several statistical language modeling techniques and consequently experimented with contemporary neural language models. Analysis shows that switching points (junctions in the text where the language switches) are the main challenge for the CM language model and the reason for the performance drop, as compared to monolingual LMs. To handle this impediment, in this paper we introduce the concept of minority positive sampling, to selectively induce more samples, to achieve better performance. The neural language models for CM demand a huge corpus, still they exhibit improvement in performance, after the samples are induced. Finally, we report a perplexity of 139 for Hinglish LM for CM using statistical bi-directional technique.
引用
收藏
页码:6228 / 6236
页数:9
相关论文
共 45 条
  • [1] CODE-SWITCHING OR CODE-MIXING
    THELANDER, M
    LINGUISTICS, 1976, (183) : 103 - 123
  • [2] Code-switching and code-mixing in bilingual communication: Language deficiency or creativity?
    Nugraheni, D. A.
    ELT IN ASIA IN THE DIGITAL ERA: GLOBAL CITIZENSHIP AND IDENTITY, 2018, : 401 - 407
  • [3] Code-switching and code-mixing in Welsh bilinguals' talk: confirming or refuting the maintenance of language boundaries?
    Musk, Nigel
    LANGUAGE CULTURE AND CURRICULUM, 2010, 23 (03) : 179 - 197
  • [4] Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data
    Pratapa, Adithya
    Bhat, Gayatri
    Choudhury, Monojit
    Sitaram, Sunayana
    Dandapat, Sandipan
    Bali, Kalika
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 1543 - 1553
  • [5] Code-Mixing and Code-Switching in Classical Hong Kong Cinemas
    Fan, Victor
    CONCENTRIC-LITERARY AND CULTURAL STUDIES, 2023, 49 (01): : 27 - 47
  • [6] Code-Switching and Code-Mixing in the Practice of Judgement Writing in Malaysia
    Zolkapli, Rasyiqah Batrisya Md
    Mohamad, Hairul Azhar
    Mohaini, Muhammad Luthfi
    Wahab, Nadiah Hanim Abdul
    Nath, Pavithran Ravinthra
    PERTANIKA JOURNAL OF SOCIAL SCIENCE AND HUMANITIES, 2022, 30 (03): : 1365 - 1382
  • [7] Code-Mixing and Code-Switching on Social Media Text: A Brief Survey
    Mangla, Ankur
    Bansal, Rakesh Kumar
    Bansal, Savina
    Proceedings of the 2023 IEEE International Conference on Computer Vision and Machine Intelligence, CVMI 2023, 2023,
  • [8] Bislama into Kwamera: Code-mixing and Language Change on Tanna (Vanuatu)
    Lindstrom, Lamont
    LANGUAGE DOCUMENTATION & CONSERVATION, 2007, 1 (02): : 216 - 239
  • [9] CODE-MIXING TO ENGLISH LANGUAGE AS A MEANS OF COMMUNICATION IN JORDANIAN ARABIC
    Vanyushina, Natalia
    Hazaymeh, Omar
    DIALECTOLOGIA, 2021, (27): : 229 - 239
  • [10] STRUCTURAL ANALYSIS OF PERSIAN-ENGLISH REVERSE CODE-SWITCHING AND CODE-MIXING
    Moradi, Hamzeh
    Chen, Jianbo
    VESTNIK VOLGOGRADSKOGO GOSUDARSTVENNOGO UNIVERSITETA-SERIYA 2-YAZYKOZNANIE, 2019, 18 (01): : 122 - 131