Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data

被引:0
|
作者
Wang, Wei-Cheng [1 ]
De Coninck, Sander [1 ]
Leroux, Sam [1 ]
Simoens, Pieter [1 ]
机构
[1] Univ Ghent, IDLab, imec, Ghent, Belgium
来源
关键词
self-supervised learning; surveillance; audio-visual representation learning; contrastive learning; audio-visual event localization; anomaly detection; event search;
D O I
10.3389/frobt.2024.1490718
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Smart cities deploy various sensors such as microphones and RGB cameras to collect data to improve the safety and comfort of the citizens. As data annotation is expensive, self-supervised methods such as contrastive learning are used to learn audio-visual representations for downstream tasks. Focusing on surveillance data, we investigate two common limitations of audio-visual contrastive learning: false negatives and the minimal sufficient information bottleneck. Irregular, yet frequently recurring events can lead to a considerable number of false-negative pairs and disrupt the model's training. To tackle this challenge, we propose a novel method for generating contrastive pairs based on the distance between embeddings of different modalities, rather than relying solely on temporal cues. The semantically synchronized pairs can then be used to ease the minimal sufficient information bottleneck along with the new loss function for multiple positives. We experimentally validate our approach on real-world data and show how the learnt representations can be used for different downstream tasks, including audio-visual event localization, anomaly detection, and event search. Our approach reaches similar performance as state-of-the-art modality- and task-specific approaches.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning
    Zhu, Hao
    Huang, Huaibo
    Li, Yi
    Zheng, Aihua
    He, Ran
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 2362 - 2368
  • [42] Audio-visual aligned saliency model for omnidirectional video with implicit neural representation learning
    Zhu, Dandan
    Shao, Xuan
    Zhang, Kaiwei
    Min, Xiongkuo
    Zhai, Guangtao
    Yang, Xiaokang
    APPLIED INTELLIGENCE, 2023, 53 (19) : 22615 - 22634
  • [43] LEARNING AUDIO-VISUAL CORRELATIONS FROM VARIATIONAL CROSS-MODAL GENERATION
    Zhu, Ye
    Wu, Yu
    Latapie, Hugo
    Yang, Yi
    Yan, Yan
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 4300 - 4304
  • [44] Audio-visual aligned saliency model for omnidirectional video with implicit neural representation learning
    Dandan Zhu
    Xuan Shao
    Kaiwei Zhang
    Xiongkuo Min
    Guangtao Zhai
    Xiaokang Yang
    Applied Intelligence, 2023, 53 : 22615 - 22634
  • [45] Enhancing semantic audio-visual representation learning with supervised multi-scale attention
    Zhang, Jiwei
    Yu, Yi
    Tang, Suhua
    Qi, Guojun
    Wu, Haiyuan
    Hachiya, Hirotaka
    PATTERN ANALYSIS AND APPLICATIONS, 2025, 28 (02)
  • [46] Audio-visual representation learning via knowledge distillation from speech foundation models
    Zhang, Jing-Xuan
    Wan, Genshun
    Gao, Jianqing
    Ling, Zhen-Hua
    PATTERN RECOGNITION, 2025, 162
  • [47] Deep Learning Based Audio-Visual Emotion Recognition in a Smart Learning Environment
    Ivleva, Natalja
    Pentel, Avar
    Dunajeva, Olga
    Justsenko, Valeria
    TOWARDS A HYBRID, FLEXIBLE AND SOCIALLY ENGAGED HIGHER EDUCATION, VOL 1, ICL 2023, 2024, 899 : 420 - 431
  • [48] A Generative Approach to Audio-Visual Generalized Zero-Shot Learning: Combining Contrastive and Discriminative Techniques
    Zheng, Qichen
    Hong, Jie
    Farazi, Moshiur
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [49] Audio-Visual Emotion Recognition with Capsule-like Feature Representation and Model-Based Reinforcement Learning
    Ouyang, Xi
    Nagisetty, Srikanth
    Goh, Ester Gue Hua
    Shen, Shengmei
    Ding, Wan
    Ming, Huaiping
    Huang, Dong-Yan
    2018 FIRST ASIAN CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII ASIA), 2018,
  • [50] Metric Learning-Based Multimodal Audio-Visual Emotion Recognition
    Ghaleb, Esam
    Popa, Mirela
    Asteriadis, Stylianos
    IEEE MULTIMEDIA, 2020, 27 (01) : 37 - 48