Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data

被引:0
|
作者
Wang, Wei-Cheng [1 ]
De Coninck, Sander [1 ]
Leroux, Sam [1 ]
Simoens, Pieter [1 ]
机构
[1] Univ Ghent, IDLab, imec, Ghent, Belgium
来源
关键词
self-supervised learning; surveillance; audio-visual representation learning; contrastive learning; audio-visual event localization; anomaly detection; event search;
D O I
10.3389/frobt.2024.1490718
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Smart cities deploy various sensors such as microphones and RGB cameras to collect data to improve the safety and comfort of the citizens. As data annotation is expensive, self-supervised methods such as contrastive learning are used to learn audio-visual representations for downstream tasks. Focusing on surveillance data, we investigate two common limitations of audio-visual contrastive learning: false negatives and the minimal sufficient information bottleneck. Irregular, yet frequently recurring events can lead to a considerable number of false-negative pairs and disrupt the model's training. To tackle this challenge, we propose a novel method for generating contrastive pairs based on the distance between embeddings of different modalities, rather than relying solely on temporal cues. The semantically synchronized pairs can then be used to ease the minimal sufficient information bottleneck along with the new loss function for multiple positives. We experimentally validate our approach on real-world data and show how the learnt representations can be used for different downstream tasks, including audio-visual event localization, anomaly detection, and event search. Our approach reaches similar performance as state-of-the-art modality- and task-specific approaches.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
    Sun, Weixuan
    Zhang, Jiayi
    Wang, Jianyuan
    Liu, Zheyuan
    Zhong, Yiran
    Feng, Tianpeng
    Guo, Yandong
    Zhang, Yanhao
    Barnes, Nick
    arXiv, 2023,
  • [22] Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
    Sun, Weixuan
    Zhang, Jiayi
    Wang, Jianyuan
    Liu, Zheyuan
    Zhong, Yiran
    Feng, Tianpeng
    Guo, Yandong
    Zhang, Yanhao
    Barnes, Nick
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6420 - 6429
  • [23] Network embedding-based representation learning for single cell RNA-seq data
    Li, Xiangyu
    Chen, Weizheng
    Chen, Yang
    Zhang, Xuegong
    Gu, Jin
    Zhang, Michael Q.
    NUCLEIC ACIDS RESEARCH, 2017, 45 (19)
  • [24] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
    Su, Rongfeng
    Wang, Lan
    Liu, Xunying
    2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43
  • [25] Joint Audio-Visual Attention with Contrastive Learning for More General Deepfake Detection
    Zhang, Yibo
    Lin, Weiguo
    Xu, Junfeng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (05)
  • [26] SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding
    Sun, Chao
    Chen, Min
    Cheng, Jialiang
    Liang, Han
    Zhu, Chuanbo
    Chen, Jincai
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 261 - 270
  • [27] Audio-Visual Contrastive and Consistency Learning for Semi-Supervised Action Recognition
    Assefa, Maregu
    Jiang, Wei
    Zhan, Jinyu
    Gedamu, Kumie
    Yilma, Getinet
    Ayalew, Melese
    Adhikari, Deepak
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3491 - 3504
  • [28] Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning
    Tellamekala, Mani Kumar
    Valstar, Michel
    Pound, Michael
    Giesbrecht, Timo
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9912 - 9919
  • [29] Hierarchical Hyperedge Embedding-Based Representation Learning for Group Recommendation
    Guo, Lei
    Yin, Hongzhi
    Chen, Tong
    Zhang, Xiangliang
    Zheng, Kai
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2022, 40 (01)
  • [30] Comparing Learning Methodologies for Self-Supervised Audio-Visual Representation Learning
    Terbouche, Hacene
    Schoneveld, Liam
    Benson, Oisin
    Othmani, Alice
    IEEE ACCESS, 2022, 10 : 41622 - 41638