Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading

被引:0
|
作者
Daou, Samar [1 ]
Ben-Hamadou, Achraf [1 ,2 ]
Rekik, Ahmed [1 ,3 ]
Kallel, Abdelaziz [1 ,2 ]
机构
[1] Technopk Sfax, SMARTS Lab, Sfax 3021, Tunisia
[2] Technopole Sfax, Digital Res Ctr Sfax, Sfax 3021, Tunisia
[3] Gafsa Univ, ISSAT Inst Super Sci Appl & Technol, Sidi Ahmed Zarrouk Univ Campus, Gafsa 2112, Tunisia
关键词
lipreading; deep learning; LRW-AR; graph neural networks; Transformer; Arabic language;
D O I
10.3390/technologies13010026
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Lipreading involves recognizing spoken words by analyzing the movements of the lips and surrounding area using visual data. It is an emerging research topic with many potential applications, such as human-machine interaction and enhancing audio-based speech recognition. Recent deep learning approaches integrate visual features from the mouth region and lip contours. However, simple methods such as concatenation may not effectively optimize the feature vector. In this article, we propose extracting optimal visual features using 3D convolution blocks followed by a ResNet-18, while employing a graph neural network to extract geometric features from tracked lip landmarks. To fuse these complementary features, we introduce a cross-attention mechanism that combines visual and geometric information to obtain an optimal representation of lip movements for lipreading tasks. To validate our approach for Arabic, we introduce the first large-scale Lipreading in the Wild for Arabic (LRW-AR) dataset, consisting of 20,000 videos across 100 word classes, spoken by 36 speakers. Experimental results on both the LRW-AR and LRW datasets demonstrate the effectiveness of our approach, achieving accuracies of 85.85% and 89.41%, respectively.
引用
收藏
页数:22
相关论文
共 50 条
  • [1] Building large-vocabulary speaker-independent lipreading systems
    Thangthai, Kwanchiva
    Harvey, Richard
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2648 - 2652
  • [2] Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention
    Praveen, R. Gnana
    Alam, Jahangir
    2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,
  • [3] A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition
    Praveen, R. Gnana
    de Melo, Wheidima Carneiro
    Ullah, Nasib
    Aslam, Haseeb
    Zeeshan, Osama
    Denorme, Theo
    Pedersoli, Marco
    Koerich, Alessandro L.
    Bacon, Simon
    Cardinal, Patrick
    Granger, Eric
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 2485 - 2494
  • [4] Large-vocabulary Audio-visual Speech Recognition in Noisy Environments
    Yu, Wentao
    Zeiler, Steffen
    Kolossa, Dorothea
    IEEE MMSP 2021: 2021 IEEE 23RD INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2021,
  • [5] Multimodal Integration for Large-Vocabulary Audio-Visual Speech Recognition
    Yu, Wentao
    Zeiler, Steffen
    Kolossa, Dorothea
    28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 341 - 345
  • [6] CROSS-ATTENTION WATERMARKING OF LARGE LANGUAGE MODELS
    Baldassini, Folco Bertini
    Huy H. Nguyen
    Chang, Ching-Chung
    Echizen, Isao
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 4625 - 4629
  • [7] A method for disturbance identification in power quality based on cross-attention fusion of temporal and spatial features
    Liao, Tianyu
    Wang, Wenbo
    Xing, Yuanxiu
    ELECTRIC POWER SYSTEMS RESEARCH, 2024, 234
  • [8] Cross-Attention Fusion Learning of Transformer-CNN Features for Person Re-Identification
    Xiang, Jun
    Zhang, Jincheng
    Jiang, Xiaoping
    Hou, Jianhua
    Computer Engineering and Applications, 2024, 60 (16) : 94 - 104
  • [9] Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition
    Yu, Wentao
    Zeiler, Steffen
    Kolossa, Dorothea
    SENSORS, 2022, 22 (15)
  • [10] A Multimodal Graph Recommendation Method Based on Cross-Attention Fusion
    Li, Kai
    Xu, Long
    Zhu, Cheng
    Zhang, Kunlun
    MATHEMATICS, 2024, 12 (15)