Mandarin–English code-switching speech corpus in South-East Asia: SEAME

被引:0
|
作者
Dau-Cheng Lyu
Tien-Ping Tan
Eng-Siong Chng
Haizhou Li
机构
[1] Nanyang Technological University,Temasek Laboratories
[2] Nanyang Technological University,School of Computer Engineering
[3] Institute for Infocomm Research,School of Computer Sciences
[4] Universiti Sains Malaysia,undefined
[5] The University of New South Wales,undefined
来源
关键词
Code-switching speech; Spontaneous spoken corpus development; Mandarin–English; Speech recognition; Language recognition;
D O I
暂无
中图分类号
学科分类号
摘要
This paper introduces the South East Asia Mandarin–English corpus, a 63-h spontaneous Mandarin–English code-switching transcribed speech corpus suitable for LVCSR and language change detection/identification research. The corpus is recorded under unscripted interview and conversational settings from 157 Singaporean and Malaysian speakers who spoke a mixture of Mandarin and English within a single sentence. About 82 % of the transcribed utterances are intra-sentential code-switching speech and the corpus will be release by LDC in 2015. This paper presents an analysis of the code-switching statistics of the corpus, such as the duration of monolingual segments and the frequency of language turns in code-switch utterances. We also summarize the development effort, details such as the processing time for transcription, validation and language boundary labelling. Lastly, we present textual analyses of code-switch segments examining the word length of monolingual segments in code-switch utterances and the most common single word and two-word phrase of such segments.
引用
收藏
页码:581 / 600
页数:19
相关论文
共 50 条
  • [41] Code-Switching and College English Teaching
    Bo, Li
    PROCEEDINGS OF THE SIXTH NORTHEAST ASIA INTERNATIONAL SYMPOSIUM ON LANGUAGE, LITERATURE AND TRANSLATION, 2017, : 724 - 729
  • [42] Code-switching in medieval English drama
    Diller, HJ
    COMPARATIVE DRAMA, 1997, 31 (04) : 506 - 537
  • [43] CODE-SWITCHING - HINDI-ENGLISH
    VERMA, SK
    LINGUA, 1976, 38 (02) : 153 - 165
  • [44] A Turkish-German Code-Switching Corpus
    Cetinoglu, Ozlem
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 4215 - 4220
  • [45] Hinglish: code-switching in Indian English
    Sailaja, Pingali
    ELT JOURNAL, 2011, 65 (04) : 473 - 480
  • [46] Code-switching in early English literature
    Schendl, Herbert
    LANGUAGE AND LITERATURE, 2015, 24 (03) : 233 - 248
  • [47] AN EVALUATION BENCHMARK FOR AUTOMATIC SPEECH RECOGNITION OF GERMAN-ENGLISH CODE-SWITCHING
    Khosravani, Abbas
    Garner, Philip N.
    Lazaridis, Alexandros
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 811 - 816
  • [48] Longitudinal Speaker Clustering and Verification Corpus with Code-Switching Frisian-Dutch Speech
    Yilmaz, Emre
    Dijkstra, Jelske
    Van de Velde, Hans
    Kampstra, Frederik
    Algra, Jouke
    van den Heuvel, Henk
    Van Leeuwen, David
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 37 - 41
  • [49] Borrowing or Code-switching? Traces of community norms in Vietnamese-English speech
    Li Nguyen
    AUSTRALIAN JOURNAL OF LINGUISTICS, 2018, 38 (04) : 443 - 466
  • [50] IITG-HingCoS corpus: A Hinglish code-switching database for automatic speech recognition
    Ganji, Sreeram
    Dhawan, Kunal
    Sinha, Rohit
    SPEECH COMMUNICATION, 2019, 110 : 76 - 89