Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading

被引:0
|
作者
Daou, Samar [1 ]
Ben-Hamadou, Achraf [1 ,2 ]
Rekik, Ahmed [1 ,3 ]
Kallel, Abdelaziz [1 ,2 ]
机构
[1] Technopk Sfax, SMARTS Lab, Sfax 3021, Tunisia
[2] Technopole Sfax, Digital Res Ctr Sfax, Sfax 3021, Tunisia
[3] Gafsa Univ, ISSAT Inst Super Sci Appl & Technol, Sidi Ahmed Zarrouk Univ Campus, Gafsa 2112, Tunisia
关键词
lipreading; deep learning; LRW-AR; graph neural networks; Transformer; Arabic language;
D O I
10.3390/technologies13010026
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Lipreading involves recognizing spoken words by analyzing the movements of the lips and surrounding area using visual data. It is an emerging research topic with many potential applications, such as human-machine interaction and enhancing audio-based speech recognition. Recent deep learning approaches integrate visual features from the mouth region and lip contours. However, simple methods such as concatenation may not effectively optimize the feature vector. In this article, we propose extracting optimal visual features using 3D convolution blocks followed by a ResNet-18, while employing a graph neural network to extract geometric features from tracked lip landmarks. To fuse these complementary features, we introduce a cross-attention mechanism that combines visual and geometric information to obtain an optimal representation of lip movements for lipreading tasks. To validate our approach for Arabic, we introduce the first large-scale Lipreading in the Wild for Arabic (LRW-AR) dataset, consisting of 20,000 videos across 100 word classes, spoken by 36 speakers. Experimental results on both the LRW-AR and LRW datasets demonstrate the effectiveness of our approach, achieving accuracies of 85.85% and 89.41%, respectively.
引用
收藏
页数:22
相关论文
共 50 条
  • [31] DyFusion: Cross-Attention 3D Object Detection with Dynamic Fusion
    Bi, Jiangfeng
    Wei, Haiyue
    Zhang, Guoxin
    Yang, Kuihe
    Song, Ziying
    IEEE LATIN AMERICA TRANSACTIONS, 2024, 22 (02) : 106 - 112
  • [32] Background-Aware Cross-Attention Multiscale Fusion for Multispectral Object Detection
    Guo, Runze
    Guo, Xiaojun
    Sun, Xiaoyong
    Zhou, Peida
    Sun, Bei
    Su, Shaojing
    REMOTE SENSING, 2024, 16 (21)
  • [33] ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection
    Shen, Jifeng
    Chen, Yifei
    Liu, Yue
    Zuo, Xin
    Fan, Heng
    Yang, Wankou
    PATTERN RECOGNITION, 2024, 145
  • [34] MGCAF: A Novel Multigraph Cross-Attention Fusion Method for Traffic Speed Prediction
    Ma, Tian
    Wei, Xiaobao
    Liu, Shuai
    Ren, Yilong
    INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH, 2022, 19 (21)
  • [35] Learning Cross-Attention Discriminators via Alternating TimeSpace Transformers for Visual Tracking
    Wang, Wuwei
    Zhang, Ke
    Su, Yu
    Wang, Jingyu
    Wang, Qi
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (11) : 15156 - 15169
  • [36] VISUAL QUESTION ANSWERING IN REMOTE SENSING WITH CROSS-ATTENTION AND MULTIMODAL INFORMATION BOTTLENECK
    Songara, Jayesh
    Pande, Shivam
    Choudhury, Shabnam
    Banerjee, Biplab
    Velmurugan, Rajbabu
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 6278 - 6281
  • [37] Large-vocabulary audio-visual speech recognition: A summary of the Johns Hopkins Summer 2000 Workshop
    Neti, C
    Potamianos, G
    Luettin, J
    Matthews, I
    Glotin, H
    Vergyri, D
    2001 IEEE FOURTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2001, : 619 - 624
  • [38] A cascade information diffusion prediction model integrating topic features and cross-attention
    Liu, Xiaoyang
    Wang, Haotian
    Bouyer, Asgarali
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2023, 35 (10)
  • [39] Cross-attention Based Text-image Transformer for Visual Question Answering
    Rezapour M.
    Recent Advances in Computer Science and Communications, 2024, 17 (04) : 72 - 78
  • [40] Multi-level Cross-attention Siamese Network For Visual Object Tracking
    Zhang, Jianwei
    Wang, Jingchao
    Zhang, Huanlong
    Miao, Mengen
    Cai, Zengyu
    Chen, Fuguo
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2022, 16 (12): : 3976 - 3990