A Multi-modal Gesture Recognition System Using Audio, Video, and Skeletal Joint Data

被引:15
|
作者
Nandakumar, Karthik [1 ]
Wah, Wan Kong [1 ]
Alice, Chan Siu Man [1 ]
Terence, Ng Wen Zheng [1 ]
Gang, Wang Jian [1 ]
Yun, Yau Wei [1 ]
机构
[1] ASTAR, I2R, 1 Fusionopolis Way, Singapore, Singapore
关键词
Multi-modal gesture recognition; log-energy features; Mel frequency cepstral coefficients (MFCC); Space-Time Interest Points (STIP); covariance descriptor; Hidden Markov Model (HMM); Support Vector Machine (SVM); fusion; NORMALIZATION;
D O I
10.1145/2522848.2532593
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper describes the gesture recognition system developed by the Institute for Infocomm Research (I2R) for the 2013 ICMI CHALEARN Multi-modal Gesture Recognition Challenge. The proposed system adopts a multi-modal approach for detecting as well as recognizing the gestures. Automated gesture detection is performed using both audio signals and information about hand joints obtained from the Kinect sensor to segment a sample into individual gestures. Once the gestures are detected and segmented, features extracted from three different modalities, namely, audio, 2-dimensional video (RGB), and skeletal joints (Kinect) are used to classify a given sequence of frames into one of the 20 known gestures or an unrecognized gesture. Mel frequency cepstral coefficients (MFCC) are extracted from the audio signals and a Hidden Markov Model (HMM) is used for classification. While Space-Time Interest Points (STIP) are used to represent the RGB modality, a covariance descriptor is extracted from the skeletal joint data. In the case of both RGB and Kinect modalities, Support Vector Machines (SVM) are used for gesture classification. Finally, a fusion scheme is applied to accumulate evidence from all the three modalities and predict the sequence of gestures in each test sample. The proposed gesture recognition system is able to achieve an average edit distance of 0.2074 over the 275 test samples containing 2, 742 unlabeled gestures. While the proposed system is able to recognize the known gestures with high accuracy, most of the errors are caused due to insertion, which occurs when an unrecognized gesture is misclassified as one of the 20 known gestures.
引用
收藏
页码:475 / 482
页数:8
相关论文
共 50 条
  • [21] Gesture recognition based on multi-modal feature weight
    Duan, Haojie
    Sun, Ying
    Cheng, Wentao
    Jiang, Du
    Yun, Juntong
    Liu, Ying
    Liu, Yibo
    Zhou, Dalin
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (05):
  • [22] Low-level fusion of audio and video feature for multi-modal emotion recognition
    Wimmer, Matthias
    Schuller, Bjoern
    Arsic, Dejan
    Rigoll, Gerhard
    Radig, Bernd
    VISAPP 2008: PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, VOL 2, 2008, : 145 - +
  • [23] A Multi-modal Gesture Recognition System in a Human-Robot Interaction Scenario
    Li, Zhi
    Jarvis, Ray
    2009 IEEE INTERNATIONAL WORKSHOP ON ROBOTIC AND SENSORS ENVIRONMENTS (ROSE 2009), 2009, : 41 - 46
  • [24] Multi-Modal Multi-Action Video Recognition
    Shi, Zhensheng
    Liang, Ju
    Li, Qianqian
    Zheng, Haiyong
    Gu, Zhaorui
    Dong, Junyu
    Zheng, Bing
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 13658 - 13667
  • [25] Multi-modal Laughter Recognition in Video Conversations
    Escalera, Sergio
    Puertas, Eloi
    Radeva, Petia
    Pujol, Oriol
    2009 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPR WORKSHOPS 2009), VOLS 1 AND 2, 2009, : 869 - 874
  • [26] Multi-modal authentication system based on audio-visual data
    Debnath, Saswati
    Roy, Pinki
    PROCEEDINGS OF THE 2019 IEEE REGION 10 CONFERENCE (TENCON 2019): TECHNOLOGY, KNOWLEDGE, AND SOCIETY, 2019, : 2507 - 2512
  • [27] Nonparametric Gesture Labeling from Multi-modal Data
    Chang, Ju Yong
    COMPUTER VISION - ECCV 2014 WORKSHOPS, PT I, 2015, 8925 : 503 - 517
  • [28] MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
    Ruan, Ludan
    Ma, Yiyang
    Yang, Huan
    He, Huiguo
    Liu, Bei
    Fu, Jianlong
    Yuan, Nicholas Jing
    Jin, Qin
    Guo, Baining
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10219 - 10228
  • [29] Analysis of Deep Fusion Strategies for Multi-modal Gesture Recognition
    Roitberg, Alina
    Pollert, Tim
    Haurilet, Monica
    Martin, Manuel
    Stiefelhagen, Rainer
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2019), 2019, : 198 - 206
  • [30] Convolutional Transformer Fusion Blocks for Multi-Modal Gesture Recognition
    Hampiholi, Basavaraj
    Jarvers, Christian
    Mader, Wolfgang
    Neumann, Heiko
    IEEE ACCESS, 2023, 11 : 34094 - 34103