Minority Positive Sampling for Switching Points - an Anecdote for the Code-Mixing Language Modeling

被引：0

作者：

Chatterjee, Arindam ^{[1
]}

Guptha, Bodla Vineeth ^{[1
]}

Chopra, Parul ^{[1
]}

Das, Amitava ^{[1
]}

机构：

[1] Wipro AI Labs, Bangalore, Karnataka, India

来源：

PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020) | 2020年

关键词：

code-mix; language modelling; Hinglish; switching point; LAW;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Code-Mixing (CM) or language mixing is a social norm in multilingual societies. CM is quite prevalent in social media conversations in multilingual regions like - India, Europe, Canada and Mexico. In this paper, we explore the problem of Language Modeling (LM) for code-mixed Hinglish (Hindi-English language pair) text. In recent times, there have been several success stories with neural language modeling like Generative Pre-trained Transformer (GPT) (Radford et al., 2019), Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) etc.. Hence, neural language models have become the new holy grail of modern NLP, although LM for CM is an unexplored area altogether. To better understand the problem of LM for CM, we initially experimented with several statistical language modeling techniques and consequently experimented with contemporary neural language models. Analysis shows that switching points (junctions in the text where the language switches) are the main challenge for the CM language model and the reason for the performance drop, as compared to monolingual LMs. To handle this impediment, in this paper we introduce the concept of minority positive sampling, to selectively induce more samples, to achieve better performance. The neural language models for CM demand a huge corpus, still they exhibit improvement in performance, after the samples are induced. Finally, we report a perplexity of 139 for Hinglish LM for CM using statistical bi-directional technique.

引用

页码：6228 / 6236

页数：9

共 45 条

[1] CODE-SWITCHING OR CODE-MIXING
THELANDER, M
LINGUISTICS, 1976, (183) : 103 - 123
[2] Code-switching and code-mixing in bilingual communication: Language deficiency or creativity?
Nugraheni, D. A.
ELT IN ASIA IN THE DIGITAL ERA: GLOBAL CITIZENSHIP AND IDENTITY, 2018, : 401 - 407
[3] Code-switching and code-mixing in Welsh bilinguals' talk: confirming or refuting the maintenance of language boundaries?
Musk, Nigel
LANGUAGE CULTURE AND CURRICULUM, 2010, 23 (03) : 179 - 197
[4] Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data
Pratapa, Adithya
Bhat, Gayatri
Choudhury, Monojit
Sitaram, Sunayana
Dandapat, Sandipan
Bali, Kalika
PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 1543 - 1553
[5] Code-Mixing and Code-Switching in Classical Hong Kong Cinemas
Fan, Victor
CONCENTRIC-LITERARY AND CULTURAL STUDIES, 2023, 49 (01): : 27 - 47
[6] Code-Switching and Code-Mixing in the Practice of Judgement Writing in Malaysia
Zolkapli, Rasyiqah Batrisya Md
Mohamad, Hairul Azhar
Mohaini, Muhammad Luthfi
Wahab, Nadiah Hanim Abdul
Nath, Pavithran Ravinthra
PERTANIKA JOURNAL OF SOCIAL SCIENCE AND HUMANITIES, 2022, 30 (03): : 1365 - 1382
[7] Code-Mixing and Code-Switching on Social Media Text: A Brief Survey
Mangla, Ankur
Bansal, Rakesh Kumar
Bansal, Savina
Proceedings of the 2023 IEEE International Conference on Computer Vision and Machine Intelligence, CVMI 2023, 2023,
[8] Bislama into Kwamera: Code-mixing and Language Change on Tanna (Vanuatu)
Lindstrom, Lamont
LANGUAGE DOCUMENTATION & CONSERVATION, 2007, 1 (02): : 216 - 239
[9] CODE-MIXING TO ENGLISH LANGUAGE AS A MEANS OF COMMUNICATION IN JORDANIAN ARABIC
Vanyushina, Natalia
Hazaymeh, Omar
DIALECTOLOGIA, 2021, (27): : 229 - 239
[10] STRUCTURAL ANALYSIS OF PERSIAN-ENGLISH REVERSE CODE-SWITCHING AND CODE-MIXING
Moradi, Hamzeh
Chen, Jianbo
VESTNIK VOLGOGRADSKOGO GOSUDARSTVENNOGO UNIVERSITETA-SERIYA 2-YAZYKOZNANIE, 2019, 18 (01): : 122 - 131

← 1 2 3 4 5 →