Object category detection using audio-visual cues

被引:0
|
作者
Luo, Jie [1 ,2 ]
Caputo, Barbara [1 ,2 ]
Zweig, Alon [3 ]
Bach, Joerg-Hendrik [4 ]
Anemueller, Joern [4 ]
机构
[1] IDIAP Res Inst, Ctr Parc, CH-1920 Martigny, Switzerland
[2] Swiss Fed Inst Technol, Lausanne, Switzerland
[3] Hebrew Univ Jerusalem, Jerusalem, Israel
[4] Carl von Ossietzky Univ Oldenburg, Oldenburg, Germany
来源
关键词
object categorization; multimodal recognition; audio-visual fusion;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Categorization is one of the fundamental building blocks of cognitive systems. Object categorization has traditionally been addressed in the vision domain, even though cognitive agents are intrinsically multimodal. Indeed, biological systems combine several modalities in order to achieve robust categorization. In this paper we propose a multimodal approach to object category detection, using audio and visual information. The auditory channel is modeled on biologically motivated spectral features via a discriminative classifier. The visual channel is modeled by a state of the art part based model. Multimodality is achieved using two fusion schemes, one high level and the other low level. Experiments on six different object categories, under increasingly difficult conditions, show strengths and weaknesses of the two approaches, and clearly underline the open challenges for multimodal category detection.
引用
收藏
页码:539 / 548
页数:10
相关论文
共 50 条
  • [41] Tampering Detection of Audio-Visual Content using Encrypted Watermarks
    Rigoni, Ronaldo
    Freitas, Pedro Garcia
    Farias, Mylene C. Q.
    2014 27TH SIBGRAPI CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 2014, : 196 - 203
  • [42] Audio-Visual Voice Activity Detection Using Diffusion Maps
    Dov, David
    Talmon, Ronen
    Cohen, Israel
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (04) : 732 - 745
  • [43] Speaker position detection system using audio-visual information
    Matsuo, N
    Kitagawa, H
    Nagata, S
    FUJITSU SCIENTIFIC & TECHNICAL JOURNAL, 1999, 35 (02): : 212 - 220
  • [44] Voice activity detection for driver using audio-visual integration
    Ninomiya, Yoshiki
    Ban, Yoshihide
    Maeno, Toshiki
    Negi, Daisuke
    Miyajima, Chiyomi
    Mori, Kensaku
    Kitasaka, Takayuki
    Suenaga, Yasuhito
    Kyokai Joho Imeji Zasshi/Journal of the Institute of Image Information and Television Engineers, 2008, 62 (03): : 435 - 441
  • [45] Active Speaker Detection Using Audio-Visual Sensor Array
    Kheradiya, Jatin
    Reddy, Sandeep C.
    Hegde, Rajesh
    2014 IEEE INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT), 2014, : 480 - 484
  • [46] Audio-visual deepfake detection using articulatory representation learning
    Wang, Yujia
    Huang, Hua
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 248
  • [47] Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition
    Ghaleb, Esam
    Popa, Mirela
    Asteriadis, Stylianos
    2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2019,
  • [48] Associating Audio-Visual Activity Cues in a Dominance Estimation Framework
    Hung, Hayley
    Huang, Yan
    Yeo, Chuohao
    Gatica-Perez, Daniel
    2008 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, VOLS 1-3, 2008, : 1644 - +
  • [49] Audio-visual object search is changed by bilingual experience
    Chabal, Sarah
    Schroeder, Scott R.
    Marian, Viorica
    ATTENTION PERCEPTION & PSYCHOPHYSICS, 2015, 77 (08) : 2684 - 2693
  • [50] Audio-visual object search is changed by bilingual experience
    Sarah Chabal
    Scott R. Schroeder
    Viorica Marian
    Attention, Perception, & Psychophysics, 2015, 77 : 2684 - 2693