MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information

被引:2
|
作者
Wang, Jianrong [1 ]
Huo, Yuchen [2 ]
Liu, Li [3 ]
Xu, Tianyi [1 ]
Li, Qi [4 ]
Li, Sen [1 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
[2] Tianjin Univ, Tianjin Int Engn Inst, Tianjin, Peoples R China
[3] Hong Kong Univ Sci & Technol Guangzhou, Guangzhou, Peoples R China
[4] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Audio-Visual Speech Recognition; Mandarin Audio-Visual Corpus; Azure Kinect; Depth Information; SPEECH; RECOGNITION; TECHNOLOGY;
D O I
10.21437/Interspeech.2023-823
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio-visual speech recognition (AVSR) gains increasing attention from researchers as an important part of human-computer interaction. However, the existing available Mandarin audio-visual datasets are limited and lack the depth information. To address this issue, this work establishes the MAVD, a new large-scale Mandarin multimodal corpus comprising 12,484 utterances spoken by 64 native Chinese speakers. To ensure the dataset covers diverse real-world scenarios, a pipeline for cleaning and filtering the raw text material has been developed to create a well-balanced reading material. In particular, the latest data acquisition device of Microsoft, Azure Kinect is used to capture depth information in addition to the traditional audio signals and RGB images during data acquisition. We also provide a baseline experiment, which could be used to evaluate the effectiveness of the dataset. The dataset and code will be released at https://github.com/SpringHuo/MAVD.
引用
收藏
页码:2113 / 2117
页数:5
相关论文
共 50 条
  • [1] VGGSOUND: A LARGE-SCALE AUDIO-VISUAL DATASET
    Chen, Honglie
    Xie, Weidi
    Vedaldi, Andrea
    Zisserman, Andrew
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 721 - 725
  • [2] A Large-scale Depth-based Multimodal Audio-Visual Corpus in Mandarin
    Wang, Jianrong
    Wang, Liyuan
    Zhang, Ju
    Yu, Mei
    Yu, Ruiguo
    Wei, Jianguo
    IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 881 - 885
  • [3] OLKAVS: AN OPEN LARGE-SCALE KOREAN AUDIO-VISUAL SPEECH DATASET
    Park, Jeongkyun
    Hwang, Jung-Wook
    Choi, Kwanghee
    Lee, Seung-Hyeon
    Ahn, Jun Hwan
    Park, Rae-Hong
    Park, Hyung-Min
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 6385 - 6389
  • [4] Audio-visual large-scale video copy detection
    Liu, Yang
    Xu, Changsheng
    Lu, Hanqing
    INTERNATIONAL JOURNAL OF COMPUTER MATHEMATICS, 2011, 88 (18) : 3803 - 3816
  • [5] AVCAffe: A Large Scale Audio-Visual Dataset of Cognitive Load and Affect for Remote Work
    Sarkar, Pritam
    Posen, Aaron
    Etemad, Ali
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 76 - 85
  • [6] TOWARDS A LARGE-SCALE AUDIO-VISUAL CORPUS FOR RESEARCH ON AMYOTROPHIC LATERAL SCLEROSIS
    Anvar, Aria
    Suendermann-Oeft, David
    Pautler, David
    Ramanarayanan, Vikram
    Kumm, Jochen
    Norel, Raquel
    Fraenkel, Ernest
    Navar, Indu
    NEUROLOGY, 2021, 96 (15)
  • [7] LSOIE: A Large-Scale Dataset for Supervised Open Information Extraction
    Solawetz, Jacob
    Larson, Stefan
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 2595 - 2600
  • [8] Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline
    Geng, Tiantian
    Wang, Teng
    Duan, Jinming
    Cong, Runmin
    Zheng, Feng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22942 - 22951
  • [9] Large-Scale Processing, Indexing and Search System for Czech Audio-Visual Cultural Heritage Archives
    Nouza, Jan
    Blavka, Karel
    Zdansky, Jindrich
    Cerva, Petr
    Silovsky, Jan
    Bohac, Marek
    Chaloupka, Josef
    Kucharova, Michaela
    Seps, Ladislav
    2012 IEEE 14TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2012, : 337 - 342
  • [10] MULTI-SCALE HYBRID FUSION NETWORK FOR MANDARIN AUDIO-VISUAL SPEECH RECOGNITION
    Wang, Jinxin
    Guo, Zhongwen
    Yang, Chao
    Li, Xiaomei
    Cui, Ziyuan
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 642 - 647