Leveraging Transformers for Weakly Supervised Object Localization in Unconstrained Videos

被引:0
|
作者
Murtaza, Shakeeb [1 ]
Pedersoli, Marco [1 ]
Sarraf, Aydin [2 ]
Granger, Eric [1 ]
机构
[1] ETS Montreal, Dept Syst Engn, LIVIA, Montreal, PQ, Canada
[2] Ericsson, Global AI Accelerator, Montreal, PQ, Canada
关键词
D O I
10.1007/978-3-031-71602-7_17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Weakly-Supervised Video Object Localization (WSVOL) involves localizing an object in videos using only video-level labels, also referred to as tags. State-of-the-art WSVOL methods like Temporal CAM (TCAM) rely on class activation mapping (CAM) and typically require a pre-trained CNN classifier. However, their localization accuracy is affected by their tendency to minimize the mutual information between different instances of a class and exploit temporal information during training for downstream tasks, e.g., detection and tracking. In the absence of bounding box annotation, it is challenging to exploit precise information about objects from temporal cues because the model struggles to locate objects over time. To address these issues, a novel method called transformer based CAM for videos (TrCAM-V), is proposed for WSVOL. It consists of a DeiT backbone with two heads for classification and localization. The classification head is trained using standard classification loss (CL), while the localization head is trained using pseudo-labels that are extracted using a pre-trained CLIP model. From these pseudo-labels, the high and low activation values are considered to be foreground and background regions, respectively. Our TrCAM-V method allows training a localization network by sampling pseudo-pixels on the fly from these regions. Additionally, a conditional random field (CRF) loss is employed to align the object boundaries with the foreground map. During inference, the model can process individual frames for real-time localization applications. Extensive experiments on challenging YouTube-Objects unconstrained video datasets show that our TrCAM-V method achieves new state-of-the-art performance in terms of classification and localization accuracy. Code: https://github.com/shakeebmurtaza/TrCAM/.
引用
收藏
页码:195 / 207
页数:13
相关论文
共 50 条
  • [21] Motion Context Network for Weakly Supervised Object Detection in Videos
    Jin, Ruibing
    Lin, Guosheng
    Wen, Changyun
    Wang, Jianliang
    IEEE SIGNAL PROCESSING LETTERS, 2020, 27 : 1864 - 1868
  • [22] Efficient Object Localization and Segmentation in Weakly Labeled Videos
    Rochan, Mrigank
    Wang, Yang
    ADVANCES IN VISUAL COMPUTING (ISVC 2014), PT 1, 2014, 8887 : 172 - 181
  • [23] Latent SVM for Object Localization in Weakly Labeled Videos
    Rochan, Mrigank
    Wang, Yang
    2015 12TH CONFERENCE ON COMPUTER AND ROBOT VISION CRV 2015, 2015, : 200 - 207
  • [24] Deep Weakly Supervised Domain Adaptation for Pain Localization in Videos
    Praveen, Gnana R.
    Granger, Eric
    Cardinal, Patrick
    2020 15TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2020), 2020, : 473 - 480
  • [25] Weakly-Supervised Learning for Tool Localization in Laparoscopic Videos
    Vardazaryan, Armine
    Mutter, Didier
    Marescaux, Jacques
    Padoy, Nicolas
    INTRAVASCULAR IMAGING AND COMPUTER ASSISTED STENTING AND LARGE-SCALE ANNOTATION OF BIOMEDICAL DATA AND EXPERT LABEL SYNTHESIS, 2018, 11043 : 169 - 179
  • [26] CoLo-CAM: Class activation mapping for object co-localization in weakly-labeled unconstrained videos
    Belharbi, Soufiane
    Murtaza, Shakeeb
    Pedersoli, Marco
    Ben Ayed, Ismail
    Mccaffrey, Luke
    Granger, Eric
    PATTERN RECOGNITION, 2025, 162
  • [27] Adaptive attention augmentor for weakly supervised object localization
    Zhang, Longhao
    Yang, Huihua
    NEUROCOMPUTING, 2021, 454 : 474 - 482
  • [28] Foreground Activation Maps for Weakly Supervised Object Localization
    Meng, Meng
    Zhang, Tianzhu
    Tian, Qi
    Zhang, Yongdong
    Wu, Feng
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 3365 - 3375
  • [29] Token Masking Transformer for Weakly Supervised Object Localization
    Xu, Wenhao
    Wang, Changwei
    Xu, Rongtao
    Xu, Shibiao
    Meng, Weiliang
    Zhang, Man
    Zhang, Xiaopeng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 : 2059 - 2069
  • [30] Weakly Supervised Object Localization with Latent Category Learning
    Wang, Chong
    Ren, Weiqiang
    Huang, Kaiqi
    Tan, Tieniu
    COMPUTER VISION - ECCV 2014, PT VI, 2014, 8694 : 431 - 445