Leveraging Transformers for Weakly Supervised Object Localization in Unconstrained Videos

被引：0

作者：

Murtaza, Shakeeb ^{[1
]}

Pedersoli, Marco ^{[1
]}

Sarraf, Aydin ^{[2
]}

Granger, Eric ^{[1
]}

机构：

[1] ETS Montreal, Dept Syst Engn, LIVIA, Montreal, PQ, Canada

[2] Ericsson, Global AI Accelerator, Montreal, PQ, Canada

来源：

ARTIFICIAL NEURAL NETWORKS IN PATTERN RECOGNITION, ANNPR 2024 | 2024年 / 15154卷

关键词：

D O I：

10.1007/978-3-031-71602-7_17

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Weakly-Supervised Video Object Localization (WSVOL) involves localizing an object in videos using only video-level labels, also referred to as tags. State-of-the-art WSVOL methods like Temporal CAM (TCAM) rely on class activation mapping (CAM) and typically require a pre-trained CNN classifier. However, their localization accuracy is affected by their tendency to minimize the mutual information between different instances of a class and exploit temporal information during training for downstream tasks, e.g., detection and tracking. In the absence of bounding box annotation, it is challenging to exploit precise information about objects from temporal cues because the model struggles to locate objects over time. To address these issues, a novel method called transformer based CAM for videos (TrCAM-V), is proposed for WSVOL. It consists of a DeiT backbone with two heads for classification and localization. The classification head is trained using standard classification loss (CL), while the localization head is trained using pseudo-labels that are extracted using a pre-trained CLIP model. From these pseudo-labels, the high and low activation values are considered to be foreground and background regions, respectively. Our TrCAM-V method allows training a localization network by sampling pseudo-pixels on the fly from these regions. Additionally, a conditional random field (CRF) loss is employed to align the object boundaries with the foreground map. During inference, the model can process individual frames for real-time localization applications. Extensive experiments on challenging YouTube-Objects unconstrained video datasets show that our TrCAM-V method achieves new state-of-the-art performance in terms of classification and localization accuracy. Code: https://github.com/shakeebmurtaza/TrCAM/.

引用

页码：195 / 207

页数：13

共 50 条

[21] Motion Context Network for Weakly Supervised Object Detection in Videos
Jin, Ruibing
Lin, Guosheng
Wen, Changyun
Wang, Jianliang
IEEE SIGNAL PROCESSING LETTERS, 2020, 27 : 1864 - 1868
[22] Efficient Object Localization and Segmentation in Weakly Labeled Videos
Rochan, Mrigank
Wang, Yang
ADVANCES IN VISUAL COMPUTING (ISVC 2014), PT 1, 2014, 8887 : 172 - 181
[23] Latent SVM for Object Localization in Weakly Labeled Videos
Rochan, Mrigank
Wang, Yang
2015 12TH CONFERENCE ON COMPUTER AND ROBOT VISION CRV 2015, 2015, : 200 - 207
[24] Deep Weakly Supervised Domain Adaptation for Pain Localization in Videos
Praveen, Gnana R.
Granger, Eric
Cardinal, Patrick
2020 15TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2020), 2020, : 473 - 480
[25] Weakly-Supervised Learning for Tool Localization in Laparoscopic Videos
Vardazaryan, Armine
Mutter, Didier
Marescaux, Jacques
Padoy, Nicolas
INTRAVASCULAR IMAGING AND COMPUTER ASSISTED STENTING AND LARGE-SCALE ANNOTATION OF BIOMEDICAL DATA AND EXPERT LABEL SYNTHESIS, 2018, 11043 : 169 - 179
[26] CoLo-CAM: Class activation mapping for object co-localization in weakly-labeled unconstrained videos
Belharbi, Soufiane
Murtaza, Shakeeb
Pedersoli, Marco
Ben Ayed, Ismail
Mccaffrey, Luke
Granger, Eric
PATTERN RECOGNITION, 2025, 162
[27] Adaptive attention augmentor for weakly supervised object localization
Zhang, Longhao
Yang, Huihua
NEUROCOMPUTING, 2021, 454 : 474 - 482
[28] Foreground Activation Maps for Weakly Supervised Object Localization
Meng, Meng
Zhang, Tianzhu
Tian, Qi
Zhang, Yongdong
Wu, Feng
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 3365 - 3375
[29] Token Masking Transformer for Weakly Supervised Object Localization
Xu, Wenhao
Wang, Changwei
Xu, Rongtao
Xu, Shibiao
Meng, Weiliang
Zhang, Man
Zhang, Xiaopeng
IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 : 2059 - 2069
[30] Weakly Supervised Object Localization with Latent Category Learning
Wang, Chong
Ren, Weiqiang
Huang, Kaiqi
Tan, Tieniu
COMPUTER VISION - ECCV 2014, PT VI, 2014, 8694 : 431 - 445

← 1 2 3 4 5 →