A multi-channel/multi-speaker interactive 3D Audio-Visual Speech Corpus in Mandarin

被引:0
|
作者
Yu, Jun [1 ,2 ,3 ]
Su, Rongfeng [1 ,2 ]
Wang, Lan [1 ,2 ]
Zhou, Wenpeng [1 ,2 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Key Lab Human Machine Intelligence Synergy Syst, Shenzhen, Peoples R China
[2] Chinese Univ Hong Kong, Hong Kong, Hong Kong, Peoples R China
[3] Lanzhou Univ, Sch Informat Sci & Engn, Lanzhou, Gansu, Peoples R China
基金
中国国家自然科学基金;
关键词
3D audio-visual; 3D facial motion; multi-channel; multi-speaker;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper presents a multi-channel/multi-speaker 3D audiovisual corpus for Mandarin continuous speech recognition and other fields, such as speech visualization and speech synthesis. This corpus consists of 24 speakers with about 18k utterances, about 20 hours in total. For each utterance, the audio streams were recorded by two professional microphones in near-field and far-field respectively, while a marker-based 3D facial motion capturing system with six infrared cameras was used to acquire the 3D video streams. Besides, the corresponding 2D video streams were captured by an additional camera as a supplement. A data process is described in this paper for synchronizing audio and video streams, detecting and correcting outliers, and removing head motions during recording. Finally, results about data process are also discussed. As so far, this corpus is the largest 3D audio-visual corpus for Mandarin.
引用
收藏
页数:5
相关论文
共 50 条
  • [21] Audio-Visual Clustering for 3D Speaker Localization
    Khalidov, Vasil
    Forbes, Florence
    Hansard, Miles
    Arnaud, Elise
    Horaud, Radu
    MACHINE LEARNING FOR MULTIMODAL INTERACTION, PROCEEDINGS, 2008, 5237 : 86 - 97
  • [22] Candidate Speech Extraction from Multi-speaker Single-Channel Audio Interviews
    Pandharipande, Meghna
    Kopparapu, Sunil Kumar
    SPEECH AND COMPUTER, SPECOM 2023, PT I, 2023, 14338 : 210 - 221
  • [23] Multi-speaker DoA Estimation Using Audio and Visual Modality
    Yulin Wu
    Ruimin Hu
    Xiaochen Wang
    Shanfa Ke
    Neural Processing Letters, 2023, 55 : 8887 - 8901
  • [24] Exploiting the Complementarity of Audio and Visual Data in Multi-Speaker Tracking
    Ban, Yutong
    Girin, Laurent
    Alameda-Pineda, Xavier
    Horaud, Radu
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, : 446 - 454
  • [25] The 'Audio-Visual Face Cover Corpus': Investigations into audio-visual speech and speaker recognition when the speaker's face is occluded by facewear
    Fecher, Natalie
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 2247 - 2250
  • [26] DeepMine-multi-TTS: a Persian speech corpus for multi-speaker text-to-speech
    Adibian, Majid
    Zeinali, Hossein
    Barmaki, Soroush
    LANGUAGE RESOURCES AND EVALUATION, 2025,
  • [27] MULTI-SCALE HYBRID FUSION NETWORK FOR MANDARIN AUDIO-VISUAL SPEECH RECOGNITION
    Wang, Jinxin
    Guo, Zhongwen
    Yang, Chao
    Li, Xiaomei
    Cui, Ziyuan
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 642 - 647
  • [28] Multi-speaker DoA Estimation Using Audio and Visual Modality
    Wu, Yulin
    Hu, Ruimin
    Wang, Xiaochen
    Ke, Shanfa
    NEURAL PROCESSING LETTERS, 2023, 55 (07) : 8887 - 8901
  • [29] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
    Su, Rongfeng
    Wang, Lan
    Liu, Xunying
    2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43
  • [30] CLeLfPC: a Large Open Multi-Speaker Corpus of French Cued Speech
    Bigi, Brigitte
    Zimmermann, Maryvonne
    Andre, Carine
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 987 - 994