Video Event Understanding using Natural Language Descriptions

被引:20
|
作者
Ramanathan, Vignesh [1 ]
Liang, Percy [2 ]
Li Fei-Fei [2 ]
机构
[1] Stanford Univ, Dept Elect Engn, Stanford, CA 94305 USA
[2] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
关键词
D O I
10.1109/ICCV.2013.117
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human action and role recognition play an important part in complex event understanding. State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. In this work, we propose a method to learn such models based on natural language descriptions of the training videos, which are easier to collect and scale with the number of actions and roles. There are two challenges with using this form of weak supervision: First, these descriptions only provide a high-level summary and often do not directly mention the actions and roles occurring in a video. Second, natural language descriptions do not provide spatio temporal annotations of actions and roles. To tackle these challenges, we introduce a topic-based semantic relatedness (SR) measure between a video description and an action and role label, and incorporate it into a posterior regularization objective. Our event recognition system based on these action and role models matches the state-of-the-art method on the TRECVID-MED11 event kit, despite weaker supervision.
引用
收藏
页码:905 / 912
页数:8
相关论文
共 50 条
  • [1] Translating Video Content to Natural Language Descriptions
    Rohrbach, Marcus
    Qiu, Wei
    Titov, Ivan
    Thater, Stefan
    Pinkal, Manfred
    Schiele, Bernt
    2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 433 - 440
  • [2] A framework for creating natural language descriptions of video streams
    Khan, Muhammad Usman Ghani
    Al Harbi, Nouf
    Gotoh, Yoshihiko
    INFORMATION SCIENCES, 2015, 303 : 61 - 82
  • [3] THE LANGUAGE OF EVENT DESCRIPTIONS
    FRENCH, L
    NELSON, K
    BULLETIN OF THE BRITISH PSYCHOLOGICAL SOCIETY, 1984, 37 (FEB): : A29 - A30
  • [4] Natural language descriptions of human Behavior from video sequences
    Tena, Carles Fernandez
    Baiget, Pau
    Roca, Xavier
    Gonzalez, Jordi
    KI 2007: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2007, 4667 : 279 - +
  • [5] Conceptual representations between video signals and natural language descriptions
    Arens, M.
    Gerber, R.
    Nagel, H. -H.
    IMAGE AND VISION COMPUTING, 2008, 26 (01) : 53 - 66
  • [6] Generating Natural Video Descriptions using Semantic Gate
    Lee, Hyungmin
    Kim, Il-Koo
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [7] CAPP USING NATURAL-LANGUAGE PART DESCRIPTIONS
    MASON, AK
    OKHUYSEN, GA
    JOURNAL OF SYSTEMS ENGINEERING, 1995, 5 (01): : 27 - 35
  • [8] Matchmaking Using Natural Language Descriptions Linking Customers with Enterprise Service Descriptions
    Geldart, Joe
    Song, William
    Li, Yang
    2009 IEEE 33RD INTERNATIONAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE, VOLS 1 AND 2, 2009, : 1049 - +
  • [9] Conversational natural language understanding interfacing city event information
    Mast, M
    Ross, T
    Schulz, H
    Harrikari, H
    Demesticha, V
    Polymenakos, L
    Vamvakoulas, Y
    Stadermann, J
    DATA & KNOWLEDGE ENGINEERING, 2002, 42 (03) : 343 - 360
  • [10] USING NATURAL LANGUAGE DESCRIPTIONS TO IMPROVE THE USABILITY OF DATABASES.
    HAFNER, CAROLE D.
    JOYCE, JOHN D.
    1600,