Improvement of Acoustic Models Fused with Lip Visual Information for Low-Resource Speech

被引:4
|
作者
Yu, Chongchong [1 ]
Yu, Jiaqi [1 ]
Qian, Zhaopeng [1 ]
Tan, Yuchen [1 ]
机构
[1] Beijing Technol & Business Univ, Sch Artificial Intelligence, Beijing 100048, Peoples R China
关键词
audiovisual speech recognition; low-resource language; automatic speech recognition; lipreading; AUDIOVISUAL FUSION; RECOGNITION; LANGUAGE; ADAPTATION; ASR;
D O I
10.3390/s23042071
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Endangered language generally has low-resource characteristics, as an immaterial cultural resource that cannot be renewed. Automatic speech recognition (ASR) is an effective means to protect this language. However, for low-resource language, native speakers are few and labeled corpora are insufficient. ASR, thus, suffers deficiencies including high speaker dependence and over fitting, which greatly harms the accuracy of recognition. To tackle the deficiencies, the paper puts forward an approach of audiovisual speech recognition (AVSR) based on LSTM-Transformer. The approach introduces visual modality information including lip movements to reduce the dependence of acoustic models on speakers and the quantity of data. Specifically, the new approach, through the fusion of audio and visual information, enhances the expression of speakers' feature space, thus achieving the speaker adaptation that is difficult in a single modality. The approach also includes experiments on speaker dependence and evaluates to what extent audiovisual fusion is dependent on speakers. Experimental results show that the CER of AVSR is 16.9% lower than those of traditional models (optimal performance scenario), and 11.8% lower than that for lip reading. The accuracy for recognizing phonemes, especially finals, improves substantially. For recognizing initials, the accuracy improves for affricates and fricatives where the lip movements are obvious and deteriorates for stops where the lip movements are not obvious. In AVSR, the generalization onto different speakers is also better than in a single modality and the CER can drop by as much as 17.2%. Therefore, AVSR is of great significance in studying the protection and preservation of endangered languages through AI.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Multilingual acoustic models for speech recognition in low-resource devices
    Garcia, Enrique Gil
    Mengusoglu, Erhan
    Janke, Eric
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 981 - +
  • [2] Acoustic Modeling for Hindi Speech Recognition in Low-Resource Settings
    Dey, Anik
    Zhang, Weibin
    Fung, Pascale
    2014 INTERNATIONAL CONFERENCE ON AUDIO, LANGUAGE AND IMAGE PROCESSING (ICALIP), VOLS 1-2, 2014, : 891 - 894
  • [3] Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech Recognition
    Fantaye, Tessfu Geteye
    Yu, Junqing
    Hailu, Tulu Tilahun
    COMPUTERS, 2020, 9 (02)
  • [4] Speech-to-speech Low-resource Translation
    Liu, Hsiao-Chuan
    Day, Min-Yuh
    Wang, Chih-Chien
    2023 IEEE 24TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE, IRI, 2023, : 91 - 95
  • [5] On the Contributions of Visual and Textual Supervision in Low-Resource Semantic Speech Retrieval
    Pasad, Ankita
    Shi, Bowen
    Kamper, Herman
    Livescu, Karen
    INTERSPEECH 2019, 2019, : 4195 - 4199
  • [6] Acoustic Modeling Based on Deep Learning for Low-Resource Speech Recognition: An Overview
    Yu, Chongchong
    Kang, Meng
    Chen, Yunbing
    Wu, Jiajia
    Zhao, Xia
    IEEE ACCESS, 2020, 8 : 163829 - 163843
  • [7] Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition
    Yi, Cheng
    Zhou, Shiyu
    Xu, Bo
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 (28) : 788 - 792
  • [8] A General Procedure for Improving Language Models in Low-Resource Speech Recognition
    Liu, Qian
    Zhang, Wei-Qiang
    Liu, Jia
    Liu, Yao
    PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 428 - 433
  • [9] Exploring Large Language Models for Low-Resource IT Information Extraction
    Bhavya, Bhavya
    Isaza, Paulina Toro
    Deng, Yu
    Nidd, Michael
    Azad, Amar Prakash
    Shwartz, Larisa
    Zhai, ChengXiang
    2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 1203 - 1212
  • [10] Joint Estimation of Articulatory Features and Acoustic models for Low-Resource Languages
    Abraham, Basil
    Umesh, S.
    Joy, Neethu Mariam
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2153 - 2157