Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading

被引：0

作者：

Daou, Samar ^{[1
]}

Ben-Hamadou, Achraf ^{[1
,2
]}

Rekik, Ahmed ^{[1
,3
]}

Kallel, Abdelaziz ^{[1
,2
]}

机构：

[1] Technopk Sfax, SMARTS Lab, Sfax 3021, Tunisia

[2] Technopole Sfax, Digital Res Ctr Sfax, Sfax 3021, Tunisia

[3] Gafsa Univ, ISSAT Inst Super Sci Appl & Technol, Sidi Ahmed Zarrouk Univ Campus, Gafsa 2112, Tunisia

来源：

TECHNOLOGIES | 2025年 / 13卷 / 01期

关键词：

lipreading; deep learning; LRW-AR; graph neural networks; Transformer; Arabic language;

D O I：

10.3390/technologies13010026

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

Lipreading involves recognizing spoken words by analyzing the movements of the lips and surrounding area using visual data. It is an emerging research topic with many potential applications, such as human-machine interaction and enhancing audio-based speech recognition. Recent deep learning approaches integrate visual features from the mouth region and lip contours. However, simple methods such as concatenation may not effectively optimize the feature vector. In this article, we propose extracting optimal visual features using 3D convolution blocks followed by a ResNet-18, while employing a graph neural network to extract geometric features from tracked lip landmarks. To fuse these complementary features, we introduce a cross-attention mechanism that combines visual and geometric information to obtain an optimal representation of lip movements for lipreading tasks. To validate our approach for Arabic, we introduce the first large-scale Lipreading in the Wild for Arabic (LRW-AR) dataset, consisting of 20,000 videos across 100 word classes, spoken by 36 speakers. Experimental results on both the LRW-AR and LRW datasets demonstrate the effectiveness of our approach, achieving accuracies of 85.85% and 89.41%, respectively.

引用

页数：22

共 50 条

[1] Building large-vocabulary speaker-independent lipreading systems
Thangthai, Kwanchiva
Harvey, Richard
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2648 - 2652
[2] Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention
Praveen, R. Gnana
Alam, Jahangir
2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,
[3] A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition
Praveen, R. Gnana
de Melo, Wheidima Carneiro
Ullah, Nasib
Aslam, Haseeb
Zeeshan, Osama
Denorme, Theo
Pedersoli, Marco
Koerich, Alessandro L.
Bacon, Simon
Cardinal, Patrick
Granger, Eric
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 2485 - 2494
[4] Large-vocabulary Audio-visual Speech Recognition in Noisy Environments
Yu, Wentao
Zeiler, Steffen
Kolossa, Dorothea
IEEE MMSP 2021: 2021 IEEE 23RD INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2021,
[5] Multimodal Integration for Large-Vocabulary Audio-Visual Speech Recognition
Yu, Wentao
Zeiler, Steffen
Kolossa, Dorothea
28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 341 - 345
[6] CROSS-ATTENTION WATERMARKING OF LARGE LANGUAGE MODELS
Baldassini, Folco Bertini
Huy H. Nguyen
Chang, Ching-Chung
Echizen, Isao
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 4625 - 4629
[7] A method for disturbance identification in power quality based on cross-attention fusion of temporal and spatial features
Liao, Tianyu
Wang, Wenbo
Xing, Yuanxiu
ELECTRIC POWER SYSTEMS RESEARCH, 2024, 234
[8] Cross-Attention Fusion Learning of Transformer-CNN Features for Person Re-Identification
Xiang, Jun
Zhang, Jincheng
Jiang, Xiaoping
Hou, Jianhua
Computer Engineering and Applications, 2024, 60 (16) : 94 - 104
[9] Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition
Yu, Wentao
Zeiler, Steffen
Kolossa, Dorothea
SENSORS, 2022, 22 (15)
[10] A Multimodal Graph Recommendation Method Based on Cross-Attention Fusion
Li, Kai
Xu, Long
Zhu, Cheng
Zhang, Kunlun
MATHEMATICS, 2024, 12 (15)

← 1 2 3 4 5 →