A Multi-modal Gesture Recognition System Using Audio, Video, and Skeletal Joint Data

被引：15

作者：

Nandakumar, Karthik ^{[1
]}

Wah, Wan Kong ^{[1
]}

Alice, Chan Siu Man ^{[1
]}

Terence, Ng Wen Zheng ^{[1
]}

Gang, Wang Jian ^{[1
]}

Yun, Yau Wei ^{[1
]}

机构：

[1] ASTAR, I2R, 1 Fusionopolis Way, Singapore, Singapore

来源：

ICMI'13: PROCEEDINGS OF THE 2013 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION | 2013年

关键词：

Multi-modal gesture recognition; log-energy features; Mel frequency cepstral coefficients (MFCC); Space-Time Interest Points (STIP); covariance descriptor; Hidden Markov Model (HMM); Support Vector Machine (SVM); fusion; NORMALIZATION;

D O I：

10.1145/2522848.2532593

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

This paper describes the gesture recognition system developed by the Institute for Infocomm Research (I2R) for the 2013 ICMI CHALEARN Multi-modal Gesture Recognition Challenge. The proposed system adopts a multi-modal approach for detecting as well as recognizing the gestures. Automated gesture detection is performed using both audio signals and information about hand joints obtained from the Kinect sensor to segment a sample into individual gestures. Once the gestures are detected and segmented, features extracted from three different modalities, namely, audio, 2-dimensional video (RGB), and skeletal joints (Kinect) are used to classify a given sequence of frames into one of the 20 known gestures or an unrecognized gesture. Mel frequency cepstral coefficients (MFCC) are extracted from the audio signals and a Hidden Markov Model (HMM) is used for classification. While Space-Time Interest Points (STIP) are used to represent the RGB modality, a covariance descriptor is extracted from the skeletal joint data. In the case of both RGB and Kinect modalities, Support Vector Machines (SVM) are used for gesture classification. Finally, a fusion scheme is applied to accumulate evidence from all the three modalities and predict the sequence of gestures in each test sample. The proposed gesture recognition system is able to achieve an average edit distance of 0.2074 over the 275 test samples containing 2, 742 unlabeled gestures. While the proposed system is able to recognize the known gestures with high accuracy, most of the errors are caused due to insertion, which occurs when an unrecognized gesture is misclassified as one of the 20 known gestures.

引用

页码：475 / 482

页数：8

共 50 条

[31] Bayesian Co-Boosting for Multi-modal Gesture Recognition
Wu, Jiaxiang
Cheng, Jian
JOURNAL OF MACHINE LEARNING RESEARCH, 2014, 15 : 3013 - 3036
[32] Bayesian co-boosting for multi-modal gesture recognition
Wu, Jiaxiang
Cheng, Jian
Journal of Machine Learning Research, 2014, 15 : 3013 - 3036
[33] Multi-modal Gesture Recognition Challenge 2013: Dataset and Results
Escalera, Sergio
Gonzalez, Jordi
Baro, Xavier
Reyes, Miguel
Lopes, Oscar
Guyon, Isabelle
Athitsos, Vassilis
Escalante, Hugo J.
ICMI'13: PROCEEDINGS OF THE 2013 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2013, : 445 - 452
[34] Mudra: A Multi-Modal Smartwatch Interactive System with Hand Gesture Recognition and User Identification
Guo, Kaiwen
Zhou, Hao
Tian, Ye
Zhou, Wangqiu
Ji, Yusheng
Li, Xiang-Yang
IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2022), 2022, : 100 - 109
[35] Nonparametric Feature Matching Based Conditional Random Fields for Gesture Recognition from Multi-Modal Video
Chang, Ju Yong
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2016, 38 (08) : 1612 - 1625
[36] An enhanced artificial neural network for hand gesture recognition using multi-modal features
Uke, Shailaja N.
Zade, Amol V.
COMPUTER METHODS IN BIOMECHANICS AND BIOMEDICAL ENGINEERING-IMAGING AND VISUALIZATION, 2023, 11 (06): : 2278 - 2289
[37] Multi-Task and Multi-Modal Learning for RGB Dynamic Gesture Recognition
Fan, Dinghao
Lu, Hengjie
Xu, Shugong
Cao, Shan
IEEE SENSORS JOURNAL, 2021, 21 (23) : 27026 - 27036
[38] A comprehensive video dataset for multi-modal recognition systems
Handa A.
Agarwal R.
Kohli N.
Data Science Journal, 2019, 18 (01):
[39] A Multi-modal System for Video Semantic Understanding
Lv, Zhengwei
Lei, Tao
Liang, Xiao
Shi, Zhizhong
Liu, Duoxing
CCKS 2021 - EVALUATION TRACK, 2022, 1553 : 34 - 43
[40] Adaptive cross-fusion learning for multi-modal gesture recognition
Zhou, Benjia
Wan, Jun
Liang, Yanyan
Guo, Guodong
Virtual Reality and Intelligent Hardware, 2021, 3 (03): : 235 - 247

← 1 2 3 4 5 →