Mandarin–English code-switching speech corpus in South-East Asia: SEAME

被引:0
|
作者
Dau-Cheng Lyu
Tien-Ping Tan
Eng-Siong Chng
Haizhou Li
机构
[1] Nanyang Technological University,Temasek Laboratories
[2] Nanyang Technological University,School of Computer Engineering
[3] Institute for Infocomm Research,School of Computer Sciences
[4] Universiti Sains Malaysia,undefined
[5] The University of New South Wales,undefined
来源
关键词
Code-switching speech; Spontaneous spoken corpus development; Mandarin–English; Speech recognition; Language recognition;
D O I
暂无
中图分类号
学科分类号
摘要
This paper introduces the South East Asia Mandarin–English corpus, a 63-h spontaneous Mandarin–English code-switching transcribed speech corpus suitable for LVCSR and language change detection/identification research. The corpus is recorded under unscripted interview and conversational settings from 157 Singaporean and Malaysian speakers who spoke a mixture of Mandarin and English within a single sentence. About 82 % of the transcribed utterances are intra-sentential code-switching speech and the corpus will be release by LDC in 2015. This paper presents an analysis of the code-switching statistics of the corpus, such as the duration of monolingual segments and the frequency of language turns in code-switch utterances. We also summarize the development effort, details such as the processing time for transcription, validation and language boundary labelling. Lastly, we present textual analyses of code-switch segments examining the word length of monolingual segments in code-switch utterances and the most common single word and two-word phrase of such segments.
引用
收藏
页码:581 / 600
页数:19
相关论文
共 50 条
  • [1] Mandarin-English code-switching speech corpus in South-East Asia: SEAME
    Lyu, Dau-Cheng
    Tan, Tien-Ping
    Chng, Eng-Siong
    Li, Haizhou
    LANGUAGE RESOURCES AND EVALUATION, 2015, 49 (03) : 581 - 600
  • [2] SEAME: a Mandarin-English Code-switching Speech Corpus in South-East Asia
    Lyu, Dau-Cheng
    Tan, Tien-Ping
    Chng, Eng-Siong
    Li, Haizhou
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 1986 - +
  • [3] A Review of the Mandarin-English Code-switching Corpus: SEAME
    Lee, Grandee
    Ho, Thi-Nga
    Chng, Eng-Siong
    Li, Haizhou
    2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 210 - 213
  • [4] A Mandarin-English Code-Switching Corpus
    Li, Ying
    Yu, Yue
    Fung, Pascale
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 2515 - 2519
  • [5] Mandarin-English Code-switching Speech Recognition
    Xu, Haihua
    Van Tung Pham
    Kyaw, Zin Tun
    Lim, Zhi Hao
    Chng, Eng Siong
    Li, Haizhou
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 554 - 555
  • [6] TALCS: AN OPEN-SOURCE MANDARIN-ENGLISH CODE-SWITCHING CORPUS AND A SPEECH RECOGNITION BASELINE
    Li, Chengfei
    Deng, Shuhao
    Wang, Yaoping
    Wang, Guangjing
    Gong, Yaguang
    Chen, Changbin
    Bai, Jinfeng
    INTERSPEECH 2022, 2022, : 1741 - 1745
  • [7] Pronunciation augmentation for Mandarin-English code-switching speech recognition
    Long, Yanhua
    Wei, Shuang
    Lian, Jie
    Li, Yijie
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)
  • [8] Pronunciation augmentation for Mandarin-English code-switching speech recognition
    Yanhua Long
    Shuang Wei
    Jie Lian
    Yijie Li
    EURASIP Journal on Audio, Speech, and Music Processing, 2021
  • [9] An Empirical Study on Punctuation Restoration for English, Mandarin, and Code-Switching Speech
    Liu, Changsong
    Thi Nga Ho
    Chng, Eng Siong
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2023, PT II, 2023, 13996 : 286 - 296
  • [10] NON-AUTOREGRESSIVE MANDARIN-ENGLISH CODE-SWITCHING SPEECH RECOGNITION
    Chuang, Shun-Po
    Chang, Heng-Jui
    Huang, Sung-Feng
    Lee, Hung-yi
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 465 - 472