Visual Event Recognition in Videos by Learning from Web Data

被引:40
|
作者
Duan, Lixin [1 ]
Xu, Dong [1 ]
Tsang, Ivor Wai-Hung [1 ]
Luo, Jiebo [2 ]
机构
[1] Nanyang Technol Univ, Sch Comp Engn, Singapore 639798, Singapore
[2] Univ Rochester, Dept Comp Sci, Rochester, NY 14627 USA
关键词
Event recognition; transfer learning; domain adaptation; cross-domain learning; adaptive MKL; aligned space-time pyramid matching; KERNEL; CONTEXT; IMAGES; SVM;
D O I
10.1109/TPAMI.2011.265
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a visual event recognition framework for consumer videos by leveraging a large amount of loosely labeled web videos (e.g., from YouTube). Observing that consumer videos generally contain large intraclass variations within the same type of events, we first propose a new method, called Aligned Space-Time Pyramid Matching (ASTPM), to measure the distance between any two video clips. Second, we propose a new transfer learning method, referred to as Adaptive Multiple Kernel Learning (A-MKL), in order to 1) fuse the information from multiple pyramid levels and features (i.e., space-time features and static SIFT features) and 2) cope with the considerable variation in feature distributions between videos from two domains (i.e., web video domain and consumer video domain). For each pyramid level and each type of local features, we first train a set of SVM classifiers based on the combined training set from two domains by using multiple base kernels from different kernel types and parameters, which are then fused with equal weights to obtain a prelearned average classifier. In A-MKL, for each event class we learn an adapted target classifier based on multiple base kernels and the prelearned average classifiers from this event class or all the event classes by minimizing both the structural risk functional and the mismatch between data distributions of two domains. Extensive experiments demonstrate the effectiveness of our proposed framework that requires only a small number of labeled consumer videos by leveraging web data. We also conduct an in-depth investigation on various aspects of the proposed method A-MKL, such as the analysis on the combination coefficients on the prelearned classifiers, the convergence of the learning algorithm, and the performance variation by using different proportions of labeled consumer videos. Moreover, we show that A-MKL using the prelearned classifiers from all the event classes leads to better performance when compared with A-MKL using the prelearned classifiers only from each individual event class.
引用
收藏
页码:1667 / 1680
页数:14
相关论文
共 50 条
  • [31] Robust learning from noisy web data for fine-Grained recognition
    Cai, Zhenhuang
    Xie, Guo-Sen
    Huang, Xingguo
    Huang, Dan
    Yao, Yazhou
    Tang, Zhenmin
    PATTERN RECOGNITION, 2023, 134
  • [32] Towards Automatic Learning of Procedures from Web Instructional Videos
    Zhou, Luowei
    Xu, Chenliang
    Corso, Jason J.
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 7590 - 7598
  • [33] AUDIOVISUAL CELEBRITY RECOGNITION IN UNCONSTRAINED WEB VIDEOS
    Sargin, Mehmet Emre
    Aradhye, Hrishikesh
    Moreno, Pedro J.
    Zhao, Ming
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 1977 - +
  • [34] Sangati - A Social Event Web approach to Index Videos
    Anilkumar, Anjana
    Sreenivasan, Anusha
    Sahay, Animesh
    Gurumurthy, Dilip
    Nirupama, M. P.
    Kalambur, Subramaniam
    Sitaram, Dinkar
    Jain, Ramesh
    2018 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2018, : 1574 - 1580
  • [35] Unequal adaptive visual recognition by learning from multi-modal data
    Cai, Ziyun
    Zhang, Tengfei
    Jing, Xiao-Yuan
    Shao, Ling
    INFORMATION SCIENCES, 2022, 600 : 1 - 21
  • [36] Cascaded multilingual audio-visual learning from videos
    Rouditchenko, Andrew
    Boggust, Angie
    Harwath, David
    Thomas, Samuel
    Kuehne, Hilde
    Chen, Brian
    Panda, Rameswar
    Feris, Rogerio
    Kingsbury, Brian
    Picheny, Michael
    Glass, James
    arXiv, 2021,
  • [37] FRAMEWORK FOR EVALUATION OF SOUND EVENT DETECTION IN WEB VIDEOS
    Badlani, Rohan
    Shah, Ankit
    Elizalde, Benjamin
    Kumar, Anurag
    Raj, Bhiksha
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 3096 - 3100
  • [38] Cascaded Multilingual Audio-Visual Learning from Videos
    Rouditchenko, Andrew
    Boggust, Angie
    Harwath, David
    Thomas, Samuel
    Kuehne, Hilde
    Chen, Brian
    Panda, Rameswar
    Feris, Rogerio
    Kingsbury, Brian
    Picheny, Michael
    Glass, James
    INTERSPEECH 2021, 2021, : 3006 - 3010
  • [39] A Visual Tracking Framework for Intent Recognition in Videos
    Tavakkoli, Alireza
    Kelley, Richard
    King, Christopher
    Nicolescu, Mircea
    Nicolescu, Monica
    Bebis, George
    ADVANCES IN VISUAL COMPUTING, PT I, PROCEEDINGS, 2008, 5358 : 450 - 459
  • [40] Anomalous Event Recognition in Videos Based on Joint Learning of Motion and Appearance with Multiple Ranking Measures
    Dubey, Shikha
    Boragule, Abhijeet
    Gwak, Jeonghwan
    Jeon, Moongu
    APPLIED SCIENCES-BASEL, 2021, 11 (03): : 1 - 21