Improvement of Acoustic Models Fused with Lip Visual Information for Low-Resource Speech

被引：4

作者：

Yu, Chongchong ^{[1
]}

Yu, Jiaqi ^{[1
]}

Qian, Zhaopeng ^{[1
]}

Tan, Yuchen ^{[1
]}

机构：

[1] Beijing Technol & Business Univ, Sch Artificial Intelligence, Beijing 100048, Peoples R China

来源：

SENSORS | 2023年 / 23卷 / 04期

关键词：

audiovisual speech recognition; low-resource language; automatic speech recognition; lipreading; AUDIOVISUAL FUSION; RECOGNITION; LANGUAGE; ADAPTATION; ASR;

D O I：

10.3390/s23042071

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

Endangered language generally has low-resource characteristics, as an immaterial cultural resource that cannot be renewed. Automatic speech recognition (ASR) is an effective means to protect this language. However, for low-resource language, native speakers are few and labeled corpora are insufficient. ASR, thus, suffers deficiencies including high speaker dependence and over fitting, which greatly harms the accuracy of recognition. To tackle the deficiencies, the paper puts forward an approach of audiovisual speech recognition (AVSR) based on LSTM-Transformer. The approach introduces visual modality information including lip movements to reduce the dependence of acoustic models on speakers and the quantity of data. Specifically, the new approach, through the fusion of audio and visual information, enhances the expression of speakers' feature space, thus achieving the speaker adaptation that is difficult in a single modality. The approach also includes experiments on speaker dependence and evaluates to what extent audiovisual fusion is dependent on speakers. Experimental results show that the CER of AVSR is 16.9% lower than those of traditional models (optimal performance scenario), and 11.8% lower than that for lip reading. The accuracy for recognizing phonemes, especially finals, improves substantially. For recognizing initials, the accuracy improves for affricates and fricatives where the lip movements are obvious and deteriorates for stops where the lip movements are not obvious. In AVSR, the generalization onto different speakers is also better than in a single modality and the CER can drop by as much as 17.2%. Therefore, AVSR is of great significance in studying the protection and preservation of endangered languages through AI.

引用

页数：19

共 50 条

[1] Multilingual acoustic models for speech recognition in low-resource devices
Garcia, Enrique Gil
Mengusoglu, Erhan
Janke, Eric
2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 981 - +
[2] Acoustic Modeling for Hindi Speech Recognition in Low-Resource Settings
Dey, Anik
Zhang, Weibin
Fung, Pascale
2014 INTERNATIONAL CONFERENCE ON AUDIO, LANGUAGE AND IMAGE PROCESSING (ICALIP), VOLS 1-2, 2014, : 891 - 894
[3] Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech Recognition
Fantaye, Tessfu Geteye
Yu, Junqing
Hailu, Tulu Tilahun
COMPUTERS, 2020, 9 (02)
[4] Speech-to-speech Low-resource Translation
Liu, Hsiao-Chuan
Day, Min-Yuh
Wang, Chih-Chien
2023 IEEE 24TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE, IRI, 2023, : 91 - 95
[5] On the Contributions of Visual and Textual Supervision in Low-Resource Semantic Speech Retrieval
Pasad, Ankita
Shi, Bowen
Kamper, Herman
Livescu, Karen
INTERSPEECH 2019, 2019, : 4195 - 4199
[6] Acoustic Modeling Based on Deep Learning for Low-Resource Speech Recognition: An Overview
Yu, Chongchong
Kang, Meng
Chen, Yunbing
Wu, Jiajia
Zhao, Xia
IEEE ACCESS, 2020, 8 : 163829 - 163843
[7] Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition
Yi, Cheng
Zhou, Shiyu
Xu, Bo
IEEE SIGNAL PROCESSING LETTERS, 2021, 28 (28) : 788 - 792
[8] A General Procedure for Improving Language Models in Low-Resource Speech Recognition
Liu, Qian
Zhang, Wei-Qiang
Liu, Jia
Liu, Yao
PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 428 - 433
[9] Exploring Large Language Models for Low-Resource IT Information Extraction
Bhavya, Bhavya
Isaza, Paulina Toro
Deng, Yu
Nidd, Michael
Azad, Amar Prakash
Shwartz, Larisa
Zhai, ChengXiang
2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 1203 - 1212
[10] Joint Estimation of Articulatory Features and Acoustic models for Low-Resource Languages
Abraham, Basil
Umesh, S.
Joy, Neethu Mariam
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2153 - 2157

← 1 2 3 4 5 →