Cross-modal knowledge distillation for continuous sign language recognition

被引：0

作者：

Gao, Liqing ^{[1
]}

Shi, Peng ^{[1
]}

Hu, Lianyu ^{[1
]}

Feng, Jichao ^{[1
]}

Zhu, Lei ^{[2
]}

Wan, Liang ^{[1
]}

Feng, Wei ^{[1
]}

机构：

[1] Tianjin Univ, Coll Intelligence & Comp, Sch Comp Sci & Technol, Tianjin 300350, Peoples R China

[2] Hong Kong Univ Sci & Technol Guangzhou, Guangzhou, Peoples R China

来源：

NEURAL NETWORKS | 2024年 / 179卷

关键词：

Sign language recognition; Knowledge distillation; Cross-modal; Attention mechanism;

D O I：

10.1016/j.neunet.2024.106587

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Continuous Sign Language Recognition (CSLR) is a task which converts a sign language video into a sequence. The existing deep learning based sign language recognition methods usually rely on large-scale training data and rich supervised information. However, current sign language datasets are limited, and are only annotated at sentence-level rather than frame-level. Inadequate supervision of sign language poses a serious challenge for sign language recognition, which may result in insufficient training of language recognition models. To address above problems, we propose a cross-modal knowledge distillation method for continuous sign language recognition, which contains two teacher models and one student model. One of the teacher models is the Sign2Text dialogue teacher model, which takes a sign language video a dialogue sentence as input and outputs the sign language recognition result. The other teacher model the Text2Gloss translation teacher model, which targets to translate a text sentence into a gloss sequence. Both teacher models can provide information-rich soft labels to assist the training of the student model, which is a general sign language recognition model. We conduct extensive experiments on multiple commonly used sign language datasets, i.e., PHOENIX 2014T, CSL-Daily and QSL, the results show that the proposed cross-modal knowledge distillation method can effectively improve the sign language recognition accuracy by transferring multi-modal information from teacher models to the student model. Code is available https://github.com/glq-1992/cross-modal-knowledge-distillation_new.

引用

页数：13

共 50 条

[11] CROSS-MODAL KNOWLEDGE DISTILLATION FOR VISION-TO-SENSOR ACTION RECOGNITION
Ni, Jianyuan
Sarbajna, Raunak
Liu, Yang
Ngu, Anne H. H.
Yan, Yan
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4448 - 4452
[12] FedCMD: A Federated Cross-modal Knowledge Distillation for Drivers' Emotion Recognition
Bano, Saira
Tonellotto, Nicola
Cassara, Pietro
Gotta, Alberto
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2024, 15 (03)
[13] C2ST: Cross-modal Contextualized Sequence Transduction for Continuous Sign Language Recognition
Zhang, Huaiwen
Guo, Zihang
Yang, Yang
Liu, Xin
Hu, De
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 20996 - 21005
[14] Cross-modal Neural Sign Language Translation
Duarte, Amanda
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1650 - 1654
[15] A Sign Language Recognition Framework Based on Cross-Modal Complementary Information Fusion
Zhang, Jiangtao
Wang, Qingshan
Wang, Qi
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8131 - 8144
[16] Acoustic NLOS Imaging with Cross-Modal Knowledge Distillation
Shin, Ui-Hyeon
Jang, Seungwoo
Kim, Kwangsu
PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 1405 - 1413
[17] EmotionKD: A Cross-Modal Knowledge Distillation Framework for Emotion Recognition Based on Physiological Signals
Liu, Yucheng
Jia, Ziyu
Wang, Haichao
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 6122 - 6131
[18] Cross-Modal Attention Network for Sign Language Translation
Gao, Liqing
Wan, Liang
Feng, Wei
2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 985 - 994
[19] Continuous Sign Language Recognition Through Cross-Modal Alignment of Video and Text Embeddings in a Joint-Latent Space
Papastratis, Ilias
Dimitropoulos, Kosmas
Konstantinidis, Dimitrios
Daras, Petros
IEEE ACCESS, 2020, 8 : 91170 - 91180
[20] Unsupervised Deep Cross-Modal Hashing by Knowledge Distillation for Large-scale Cross-modal Retrieval
Li, Mingyong
Wang, Hongya
PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 183 - 191

← 1 2 3 4 5 →