Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading

被引:0
|
作者
Daou, Samar [1 ]
Ben-Hamadou, Achraf [1 ,2 ]
Rekik, Ahmed [1 ,3 ]
Kallel, Abdelaziz [1 ,2 ]
机构
[1] Technopk Sfax, SMARTS Lab, Sfax 3021, Tunisia
[2] Technopole Sfax, Digital Res Ctr Sfax, Sfax 3021, Tunisia
[3] Gafsa Univ, ISSAT Inst Super Sci Appl & Technol, Sidi Ahmed Zarrouk Univ Campus, Gafsa 2112, Tunisia
关键词
lipreading; deep learning; LRW-AR; graph neural networks; Transformer; Arabic language;
D O I
10.3390/technologies13010026
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Lipreading involves recognizing spoken words by analyzing the movements of the lips and surrounding area using visual data. It is an emerging research topic with many potential applications, such as human-machine interaction and enhancing audio-based speech recognition. Recent deep learning approaches integrate visual features from the mouth region and lip contours. However, simple methods such as concatenation may not effectively optimize the feature vector. In this article, we propose extracting optimal visual features using 3D convolution blocks followed by a ResNet-18, while employing a graph neural network to extract geometric features from tracked lip landmarks. To fuse these complementary features, we introduce a cross-attention mechanism that combines visual and geometric information to obtain an optimal representation of lip movements for lipreading tasks. To validate our approach for Arabic, we introduce the first large-scale Lipreading in the Wild for Arabic (LRW-AR) dataset, consisting of 20,000 videos across 100 word classes, spoken by 36 speakers. Experimental results on both the LRW-AR and LRW datasets demonstrate the effectiveness of our approach, achieving accuracies of 85.85% and 89.41%, respectively.
引用
收藏
页数:22
相关论文
共 50 条
  • [21] Audio-Visual Cross-Attention Network for Robotic Speaker Tracking
    Qian, Xinyuan
    Wang, Zhengdong
    Wang, Jiadong
    Guan, Guohui
    Li, Haizhou
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 550 - 562
  • [22] Multi-Granularity Cross-Attention Network for Visual Question Answering
    Wang, Yue
    Gao, Wei
    Cheng, Xinzhou
    Wang, Xin
    Zhao, Huiying
    Xie, Zhipu
    Xu, Lexi
    2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 2098 - 2103
  • [23] Adaptive Multi-Feature Fusion Visual Target Tracking Based on Siamese Neural Network with Cross-Attention Mechanism
    Zhou, Qian
    Xia, Haoran
    Yan, Hongzheng
    Yang, Ming
    Chen, Shidong
    2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022), 2022, : 307 - 316
  • [24] CAFE: A Cross-Attention Based Adaptive Weighting Fusion Network for MODIS and Landsat Spatiotemporal Fusion
    Lin, Liupeng
    Shen, Yao
    Wu, Jingan
    Nan, Fang
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2023, 20
  • [25] MSER: Multimodal speech emotion recognition using cross-attention with deep fusion
    Khan, Mustaqeem
    Gueaieb, Wail
    El Saddik, Abdulmotaleb
    Kwon, Soonil
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 245
  • [26] Multimodal Dual Cross-Attention Fusion Strategy for Autonomous Garbage Classification System
    Xu, Huxiu
    Tang, Wei
    Li, Zhaoyang
    Qin, Kecheng
    Zou, Jun
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, 20 (11) : 13319 - 13329
  • [27] Enhancing Emotion Recognition in Speech Based on Self-Supervised Learning: Cross-Attention Fusion of Acoustic and Semantic Features
    Deeb, Bashar M.
    Savchenko, Andrey V.
    Makarov, Ilya
    IEEE ACCESS, 2025, 13 : 56283 - 56295
  • [28] Bridging CNN and Transformer With Cross-Attention Fusion Network for Hyperspectral Image Classification
    Xu, Fulin
    Mei, Shaohui
    Zhang, Ge
    Wang, Nan
    Du, Qian
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [29] Spatial Cross-Attention RGB-D Fusion Module for Object Detection
    Gao, Shangyin
    Markhasin, Lev
    Wang, Bi
    IEEE MMSP 2021: 2021 IEEE 23RD INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2021,
  • [30] CaEGCN: Cross-Attention Fusion Based Enhanced Graph Convolutional Network for Clustering
    Huo, Guangyu
    Zhang, Yong
    Gao, Junbin
    Wang, Boyue
    Hu, Yongli
    Yin, Baocai
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (04) : 3471 - 3483