Delving into CLIP latent space for Video Anomaly Recognition

被引:1
|
作者
Zanella, Luca [1 ]
Liberatori, Benedetta [1 ]
Menapace, Willi [1 ]
Poiesi, Fabio [2 ]
Wang, Yiming [2 ]
Ricci, Elisa [1 ,2 ]
机构
[1] Univ Trento, Trento, Italy
[2] Fdn Bruno Kessler, Trento, Italy
关键词
Video anomaly detection and recognition; Multi-modal learning;
D O I
10.1016/j.cviu.2024.104163
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We tackle the complex problem of detecting and recognising anomalies in surveillance videos at the frame level, utilising only video-level supervision. We introduce the novel method AnomalyCLIP, , the first to combine Vision and Language Models (VLMs), such as CLIP, with multiple instance learning for joint video anomaly detection and classification. Our approach specifically involves manipulating the latent CLIP feature space to identify the normal event subspace, which in turn allows us to effectively learn text-driven directions for abnormal events. When anomalous frames are projected onto these directions, they exhibit a large feature magnitude if they belong to a particular class. We also leverage a computationally efficient Transformer architecture to model short- and long-term temporal dependencies between frames, ultimately producing the final anomaly score and class prediction probabilities. We compare AnomalyCLIP against state-of-the-art methods considering three major anomaly detection benchmarks, i.e. ShanghaiTech, UCF-Crime, and XD- Violence, and empirically show that it outperforms baselines in recognising video anomalies. Project website and code are available at https://lucazanella.github.io/AnomalyCLIP/.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] CLIP-TSA: CLIP-ASSISTED TEMPORAL SELF-ATTENTION FOR WEAKLY-SUPERVISED VIDEO ANOMALY DETECTION
    Joo, Hyekang Kevin
    Khoa Vo
    Yamazaki, Kashu
    Ngan Le
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 3230 - 3234
  • [22] CLIP2GAN: Toward Bridging Text With the Latent Space of GANs
    Wang, Yixuan
    Zhou, Wengang
    Bao, Jianmin
    Wang, Weilun
    Li, Li
    Li, Houqiang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (08) : 6847 - 6859
  • [23] Long Movie Clip Classification with State-Space Video Models
    Islam, Md Mohaiminul
    Bertasius, Gedas
    COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 87 - 104
  • [24] Deep Learning in Latent Space for Video Prediction and Compression
    Liu, Bowen
    Chen, Yu
    Liu, Shiyu
    Kim, Hun-Seok
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 701 - 710
  • [25] DASC: Learning discriminative latent space for video clustering
    Lin, Jiaxin
    Gao, Xizhan
    Zhang, Zhihan
    Deng, Haotian
    NEUROCOMPUTING, 2025, 637
  • [26] Delving Deeper into the Decoder for Video Captioning
    Chen, Haoran
    Li, Jianmin
    Hu, Xiaolin
    ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, 325 : 1079 - 1086
  • [27] Video Probabilistic Diffusion Models in Projected Latent Space
    Yu, Sihyun
    Sohn, Kihyuk
    Kim, Subin
    Shin, Jinwoo
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18456 - 18466
  • [28] Delving Deeper Into Color Space
    Jraissati, Yasmina
    Douven, Igor
    I-PERCEPTION, 2018, 9 (04):
  • [29] Query by video clip
    Zhuang, YT
    Liu, XM
    Pan, YH
    PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN & COMPUTER GRAPHICS, 1999, : 1288 - 1293
  • [30] Query by video clip
    Anil K. Jain
    Aditya Vailaya
    Xiong Wei
    Multimedia Systems, 1999, 7 : 369 - 384