Delving into CLIP latent space for Video Anomaly Recognition

被引：1

作者：

Zanella, Luca ^{[1
]}

Liberatori, Benedetta ^{[1
]}

Menapace, Willi ^{[1
]}

Poiesi, Fabio ^{[2
]}

Wang, Yiming ^{[2
]}

Ricci, Elisa ^{[1
,2
]}

机构：

[1] Univ Trento, Trento, Italy

[2] Fdn Bruno Kessler, Trento, Italy

来源：

COMPUTER VISION AND IMAGE UNDERSTANDING | 2024年 / 249卷

关键词：

Video anomaly detection and recognition; Multi-modal learning;

D O I：

10.1016/j.cviu.2024.104163

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We tackle the complex problem of detecting and recognising anomalies in surveillance videos at the frame level, utilising only video-level supervision. We introduce the novel method AnomalyCLIP, , the first to combine Vision and Language Models (VLMs), such as CLIP, with multiple instance learning for joint video anomaly detection and classification. Our approach specifically involves manipulating the latent CLIP feature space to identify the normal event subspace, which in turn allows us to effectively learn text-driven directions for abnormal events. When anomalous frames are projected onto these directions, they exhibit a large feature magnitude if they belong to a particular class. We also leverage a computationally efficient Transformer architecture to model short- and long-term temporal dependencies between frames, ultimately producing the final anomaly score and class prediction probabilities. We compare AnomalyCLIP against state-of-the-art methods considering three major anomaly detection benchmarks, i.e. ShanghaiTech, UCF-Crime, and XD- Violence, and empirically show that it outperforms baselines in recognising video anomalies. Project website and code are available at https://lucazanella.github.io/AnomalyCLIP/.

引用

页数：13

共 50 条

[21] CLIP-TSA: CLIP-ASSISTED TEMPORAL SELF-ATTENTION FOR WEAKLY-SUPERVISED VIDEO ANOMALY DETECTION
Joo, Hyekang Kevin
Khoa Vo
Yamazaki, Kashu
Ngan Le
2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 3230 - 3234
[22] CLIP2GAN: Toward Bridging Text With the Latent Space of GANs
Wang, Yixuan
Zhou, Wengang
Bao, Jianmin
Wang, Weilun
Li, Li
Li, Houqiang
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (08) : 6847 - 6859
[23] Long Movie Clip Classification with State-Space Video Models
Islam, Md Mohaiminul
Bertasius, Gedas
COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 87 - 104
[24] Deep Learning in Latent Space for Video Prediction and Compression
Liu, Bowen
Chen, Yu
Liu, Shiyu
Kim, Hun-Seok
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 701 - 710
[25] DASC: Learning discriminative latent space for video clustering
Lin, Jiaxin
Gao, Xizhan
Zhang, Zhihan
Deng, Haotian
NEUROCOMPUTING, 2025, 637
[26] Delving Deeper into the Decoder for Video Captioning
Chen, Haoran
Li, Jianmin
Hu, Xiaolin
ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, 325 : 1079 - 1086
[27] Video Probabilistic Diffusion Models in Projected Latent Space
Yu, Sihyun
Sohn, Kihyuk
Kim, Subin
Shin, Jinwoo
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18456 - 18466
[28] Delving Deeper Into Color Space
Jraissati, Yasmina
Douven, Igor
I-PERCEPTION, 2018, 9 (04):
[29] Query by video clip
Zhuang, YT
Liu, XM
Pan, YH
PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN & COMPUTER GRAPHICS, 1999, : 1288 - 1293
[30] Query by video clip
Anil K. Jain
Aditya Vailaya
Xiong Wei
Multimedia Systems, 1999, 7 : 369 - 384

← 1 2 3 4 5 →