A multi-channel/multi-speaker interactive 3D Audio-Visual Speech Corpus in Mandarin

被引：0

作者：

Yu, Jun ^{[1
,2
,3
]}

Su, Rongfeng ^{[1
,2
]}

Wang, Lan ^{[1
,2
]}

Zhou, Wenpeng ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Key Lab Human Machine Intelligence Synergy Syst, Shenzhen, Peoples R China

[2] Chinese Univ Hong Kong, Hong Kong, Hong Kong, Peoples R China

[3] Lanzhou Univ, Sch Informat Sci & Engn, Lanzhou, Gansu, Peoples R China

来源：

2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP) | 2016年

基金：

中国国家自然科学基金;

关键词：

3D audio-visual; 3D facial motion; multi-channel; multi-speaker;

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

This paper presents a multi-channel/multi-speaker 3D audiovisual corpus for Mandarin continuous speech recognition and other fields, such as speech visualization and speech synthesis. This corpus consists of 24 speakers with about 18k utterances, about 20 hours in total. For each utterance, the audio streams were recorded by two professional microphones in near-field and far-field respectively, while a marker-based 3D facial motion capturing system with six infrared cameras was used to acquire the 3D video streams. Besides, the corresponding 2D video streams were captured by an additional camera as a supplement. A data process is described in this paper for synchronizing audio and video streams, detecting and correcting outliers, and removing head motions during recording. Finally, results about data process are also discussed. As so far, this corpus is the largest 3D audio-visual corpus for Mandarin.

引用

页数：5

共 50 条

[21] Audio-Visual Clustering for 3D Speaker Localization
Khalidov, Vasil
Forbes, Florence
Hansard, Miles
Arnaud, Elise
Horaud, Radu
MACHINE LEARNING FOR MULTIMODAL INTERACTION, PROCEEDINGS, 2008, 5237 : 86 - 97
[22] Candidate Speech Extraction from Multi-speaker Single-Channel Audio Interviews
Pandharipande, Meghna
Kopparapu, Sunil Kumar
SPEECH AND COMPUTER, SPECOM 2023, PT I, 2023, 14338 : 210 - 221
[23] Multi-speaker DoA Estimation Using Audio and Visual Modality
Yulin Wu
Ruimin Hu
Xiaochen Wang
Shanfa Ke
Neural Processing Letters, 2023, 55 : 8887 - 8901
[24] Exploiting the Complementarity of Audio and Visual Data in Multi-Speaker Tracking
Ban, Yutong
Girin, Laurent
Alameda-Pineda, Xavier
Horaud, Radu
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, : 446 - 454
[25] The 'Audio-Visual Face Cover Corpus': Investigations into audio-visual speech and speaker recognition when the speaker's face is occluded by facewear
Fecher, Natalie
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 2247 - 2250
[26] DeepMine-multi-TTS: a Persian speech corpus for multi-speaker text-to-speech
Adibian, Majid
Zeinali, Hossein
Barmaki, Soroush
LANGUAGE RESOURCES AND EVALUATION, 2025,
[27] MULTI-SCALE HYBRID FUSION NETWORK FOR MANDARIN AUDIO-VISUAL SPEECH RECOGNITION
Wang, Jinxin
Guo, Zhongwen
Yang, Chao
Li, Xiaomei
Cui, Ziyuan
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 642 - 647
[28] Multi-speaker DoA Estimation Using Audio and Visual Modality
Wu, Yulin
Hu, Ruimin
Wang, Xiaochen
Ke, Shanfa
NEURAL PROCESSING LETTERS, 2023, 55 (07) : 8887 - 8901
[29] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
Su, Rongfeng
Wang, Lan
Liu, Xunying
2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43
[30] CLeLfPC: a Large Open Multi-Speaker Corpus of French Cued Speech
Bigi, Brigitte
Zimmermann, Maryvonne
Andre, Carine
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 987 - 994

← 1 2 3 4 5 →