A Multi-modal Gesture Recognition System Using Audio, Video, and Skeletal Joint Data

被引:15
|
作者
Nandakumar, Karthik [1 ]
Wah, Wan Kong [1 ]
Alice, Chan Siu Man [1 ]
Terence, Ng Wen Zheng [1 ]
Gang, Wang Jian [1 ]
Yun, Yau Wei [1 ]
机构
[1] ASTAR, I2R, 1 Fusionopolis Way, Singapore, Singapore
关键词
Multi-modal gesture recognition; log-energy features; Mel frequency cepstral coefficients (MFCC); Space-Time Interest Points (STIP); covariance descriptor; Hidden Markov Model (HMM); Support Vector Machine (SVM); fusion; NORMALIZATION;
D O I
10.1145/2522848.2532593
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper describes the gesture recognition system developed by the Institute for Infocomm Research (I2R) for the 2013 ICMI CHALEARN Multi-modal Gesture Recognition Challenge. The proposed system adopts a multi-modal approach for detecting as well as recognizing the gestures. Automated gesture detection is performed using both audio signals and information about hand joints obtained from the Kinect sensor to segment a sample into individual gestures. Once the gestures are detected and segmented, features extracted from three different modalities, namely, audio, 2-dimensional video (RGB), and skeletal joints (Kinect) are used to classify a given sequence of frames into one of the 20 known gestures or an unrecognized gesture. Mel frequency cepstral coefficients (MFCC) are extracted from the audio signals and a Hidden Markov Model (HMM) is used for classification. While Space-Time Interest Points (STIP) are used to represent the RGB modality, a covariance descriptor is extracted from the skeletal joint data. In the case of both RGB and Kinect modalities, Support Vector Machines (SVM) are used for gesture classification. Finally, a fusion scheme is applied to accumulate evidence from all the three modalities and predict the sequence of gestures in each test sample. The proposed gesture recognition system is able to achieve an average edit distance of 0.2074 over the 275 test samples containing 2, 742 unlabeled gestures. While the proposed system is able to recognize the known gestures with high accuracy, most of the errors are caused due to insertion, which occurs when an unrecognized gesture is misclassified as one of the 20 known gestures.
引用
收藏
页码:475 / 482
页数:8
相关论文
共 50 条
  • [1] Multi-modal Gesture Recognition using Integrated Model of Motion, Audio and Video
    GOUTSU Yusuke
    KOBAYASHI Takaki
    OBARA Junya
    KUSAJIMA Ikuo
    TAKEICHI Kazunari
    TAKANO Wataru
    NAKAMURA Yoshihiko
    Chinese Journal of Mechanical Engineering, 2015, (04) : 657 - 665
  • [2] Multi-modal Gesture Recognition using Integrated Model of Motion, Audio and Video
    Goutsu, Yusuke
    Kobayashi, Takaki
    Obara, Junya
    Kusajima, Ikuo
    Takeichi, Kazunari
    Takano, Wataru
    Nakamura, Yoshihiko
    CHINESE JOURNAL OF MECHANICAL ENGINEERING, 2015, 28 (04) : 657 - 665
  • [3] Multi-modal gesture recognition using integrated model of motion, audio and video
    Yusuke Goutsu
    Takaki Kobayashi
    Junya Obara
    Ikuo Kusajima
    Kazunari Takeichi
    Wataru Takano
    Yoshihiko Nakamura
    Chinese Journal of Mechanical Engineering, 2015, 28 : 657 - 665
  • [4] Multi-modal Gesture Recognition using Integrated Model of Motion, Audio and Video
    GOUTSU Yusuke
    KOBAYASHI Takaki
    OBARA Junya
    KUSAJIMA Ikuo
    TAKEICHI Kazunari
    TAKANO Wataru
    NAKAMURA Yoshihiko
    Chinese Journal of Mechanical Engineering, 2015, 28 (04) : 657 - 665
  • [5] Erratum to: Multi-modal Gesture Recognition using Integrated Model of Motion, Audio and Video
    GOUTSU Yusuke
    KOBAYASHI Takaki
    OBARA Junya
    KUSAJIMA Ikuo
    TAKEICHI Kazunari
    TAKANO Wataru
    NAKAMURA Yoshihiko
    Chinese Journal of Mechanical Engineering, 2017, 30 : 1473 - 1473
  • [6] A Multi Modal Approach to Gesture Recognition from Audio and Video Data
    Bayer, Immanuel
    Silbermann, Thierry
    ICMI'13: PROCEEDINGS OF THE 2013 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2013, : 461 - 465
  • [7] Multi-Modal Emotion Recognition Fusing Video and Audio
    Xu, Chao
    Du, Pufeng
    Feng, Zhiyong
    Meng, Zhaopeng
    Cao, Tianyi
    Dong, Caichao
    APPLIED MATHEMATICS & INFORMATION SCIENCES, 2013, 7 (02): : 455 - 462
  • [8] Multi-modal Gesture Recognition Using Skeletal Joints and Motion Trail Model
    Liang, Bin
    Zheng, Lihong
    COMPUTER VISION - ECCV 2014 WORKSHOPS, PT I, 2015, 8925 : 623 - 638
  • [9] Multi-modal Gesture Recognition using Integrated Model of Motion, Audio and Video (vol 28, pg 657, 2015)
    Goutsu, Yusuke
    Kobayashi, Takaki
    Obara, Junya
    Kusajima, Ikuo
    Takeichi, Kazunari
    Takano, Wataru
    Nakamura, Yoshihiko
    CHINESE JOURNAL OF MECHANICAL ENGINEERING, 2017, 30 (06) : 1473 - 1473
  • [10] MULTI-MODAL LEARNING FOR GESTURE RECOGNITION
    Cao, Congqi
    Zhang, Yifan
    Lu, Hanqing
    2015 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO (ICME), 2015,