Visual Event Recognition in Videos by Learning from Web Data

被引:40
|
作者
Duan, Lixin [1 ]
Xu, Dong [1 ]
Tsang, Ivor Wai-Hung [1 ]
Luo, Jiebo [2 ]
机构
[1] Nanyang Technol Univ, Sch Comp Engn, Singapore 639798, Singapore
[2] Univ Rochester, Dept Comp Sci, Rochester, NY 14627 USA
关键词
Event recognition; transfer learning; domain adaptation; cross-domain learning; adaptive MKL; aligned space-time pyramid matching; KERNEL; CONTEXT; IMAGES; SVM;
D O I
10.1109/TPAMI.2011.265
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a visual event recognition framework for consumer videos by leveraging a large amount of loosely labeled web videos (e.g., from YouTube). Observing that consumer videos generally contain large intraclass variations within the same type of events, we first propose a new method, called Aligned Space-Time Pyramid Matching (ASTPM), to measure the distance between any two video clips. Second, we propose a new transfer learning method, referred to as Adaptive Multiple Kernel Learning (A-MKL), in order to 1) fuse the information from multiple pyramid levels and features (i.e., space-time features and static SIFT features) and 2) cope with the considerable variation in feature distributions between videos from two domains (i.e., web video domain and consumer video domain). For each pyramid level and each type of local features, we first train a set of SVM classifiers based on the combined training set from two domains by using multiple base kernels from different kernel types and parameters, which are then fused with equal weights to obtain a prelearned average classifier. In A-MKL, for each event class we learn an adapted target classifier based on multiple base kernels and the prelearned average classifiers from this event class or all the event classes by minimizing both the structural risk functional and the mismatch between data distributions of two domains. Extensive experiments demonstrate the effectiveness of our proposed framework that requires only a small number of labeled consumer videos by leveraging web data. We also conduct an in-depth investigation on various aspects of the proposed method A-MKL, such as the analysis on the combination coefficients on the prelearned classifiers, the convergence of the learning algorithm, and the performance variation by using different proportions of labeled consumer videos. Moreover, we show that A-MKL using the prelearned classifiers from all the event classes leads to better performance when compared with A-MKL using the prelearned classifiers only from each individual event class.
引用
收藏
页码:1667 / 1680
页数:14
相关论文
共 50 条
  • [21] Automatic Data Augmentation from Massive Web Images for Deep Visual Recognition
    Bai, Yalong
    Yang, Kuiyuan
    Mei, Tao
    Ma, Wei-Ying
    Zhao, Tiejun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2018, 14 (03)
  • [22] Visual quality assessment for web videos
    Xia, Tian
    Mei, Tao
    Hua, Gang
    Zhang, Yong-Dong
    Hua, Xian-Sheng
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2010, 21 (08) : 826 - 837
  • [23] Exploiting Web Images for Event Recognition in Consumer Videos: A Multiple Source Domain Adaptation Approach
    Duan, Lixin
    Xu, Dong
    Chang, Shih-Fu
    2012 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2012, : 1338 - 1345
  • [24] TRANSFER LEARNING FOR VIDEOS: FROM ACTION RECOGNITION TO SIGN LANGUAGE RECOGNITION
    Sarhan, Noha
    Frintrop, Simone
    2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1811 - 1815
  • [25] Learning Visual Affordance Grounding From Demonstration Videos
    Luo, Hongchen
    Zhai, Wei
    Zhang, Jing
    Cao, Yang
    Tao, Dacheng
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (11) : 16857 - 16871
  • [26] Learning Visual Affordance Grounding From Demonstration Videos
    Luo, Hongchen
    Zhai, Wei
    Zhang, Jing
    Cao, Yang
    Tao, Dacheng
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (11) : 16857 - 16871
  • [27] Bimodal Learning Engagement Recognition from Videos in the Classroom
    Hu, Meijia
    Wei, Yantao
    Li, Mengsiying
    Yao, Huang
    Deng, Wei
    Tong, Mingwen
    Liu, Qingtang
    SENSORS, 2022, 22 (16)
  • [28] Event Model Learning from Complex Videos using ILP
    Dubba, Krishna S. R.
    Cohn, Anthony G.
    Hogg, David C.
    ECAI 2010 - 19TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2010, 215 : 93 - 98
  • [29] Super Fast Event Recognition in Internet Videos
    Jiang, Yu-Gang
    Dai, Qi
    Mei, Tao
    Rui, Yong
    Chang, Shih-Fu
    IEEE TRANSACTIONS ON MULTIMEDIA, 2015, 17 (08) : 1174 - 1186
  • [30] BiSPL: Bidirectional Self-Paced Learning for Recognition From Web Data
    Wu, Xiaoping
    Chang, Jianlong
    Lai, Yu-Kun
    Yang, Jufeng
    Tian, Qi
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 6512 - 6527