Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation

被引:11
|
作者
Li, Liulei [1 ,4 ]
Wang, Wenguan [1 ]
Zhou, Tianfei [2 ]
Li, Jianwu [3 ]
Yang, Yi [1 ]
机构
[1] Zhejiang Univ, CCAI, ReLER, Hangzhou, Peoples R China
[2] Swiss Fed Inst Technol, Zurich, Switzerland
[3] Beijing Inst Technol, Beijing, Peoples R China
[4] Baidu VIS, Sunnyvale, CA USA
关键词
D O I
10.1109/CVPR52729.2023.01794
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The objective of this paper is self-supervised learning of video object segmentation. We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning and embeds object-level context for target-mask decoding. As a result, it is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos, in contrast to previous efforts usually relying on an oblique solution - cheaply "copying" labels according to pixel-wise correlations. Concretely, our algorithm alternates between i) clustering video pixels for creating pseudo segmentation labels ex nihilo; and ii) utilizing the pseudo labels to learn mask encoding and decoding for VOS. Unsupervised correspondence learning is further incorporated into this self-taught, mask embedding scheme, so as to ensure the generic nature of the learnt representation and avoid cluster degeneracy. Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS(17) and YouTube-VOS), narrowing the gap between self- and fully-supervised VOS, in terms of both performance and network architecture design.
引用
收藏
页码:18706 / 18716
页数:11
相关论文
共 50 条
  • [1] Spatial-then-Temporal Self-Supervised Learning for Video Correspondence
    Li, Rui
    Liu, Dong
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2279 - 2288
  • [2] Discriminative Spatiotemporal Alignment for Self-Supervised Video Correspondence Learning
    Wei, Qiaoqiao
    Zhang, Hui
    Yong, Jun-Hai
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1841 - 1846
  • [3] Self-Supervised AcousticWord Embedding Learning via Correspondence Transformer Encoder
    Lin, Jingru
    Yue, Xianghu
    Ao, Junyi
    Li, Haizhou
    INTERSPEECH 2023, 2023, : 2988 - 2992
  • [4] Learning disentangled representation for self-supervised video object segmentation
    Hou, Wenjie
    Qin, Zheyun
    Xi, Xiaoming
    Lu, Xiankai
    Yin, Yilong
    NEUROCOMPUTING, 2022, 481 : 270 - 280
  • [5] Learning disentangled representation for self-supervised video object segmentation
    Hou, Wenjie
    Qin, Zheyun
    Xi, Xiaoming
    Lu, Xiankai
    Yin, Yilong
    Neurocomputing, 2022, 481 : 270 - 280
  • [6] Self-supervised Amodal Video Object Segmentation
    Yao, Jian
    Hong, Yuxin
    Wang, Chiyu
    Xiao, Tianjun
    He, Tong
    Locatello, Francesco
    Wipf, David
    Fu, Yanwei
    Zhang, Zheng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [7] Self-Supervised Correspondence in Visuomotor Policy Learning
    Florence, Peter
    Manuelli, Lucas
    Tedrake, Russ
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2020, 5 (02) : 492 - 499
  • [8] Contrastive Transformation for Self-supervised Correspondence Learning
    Wang, Ning
    Zhou, Wengang
    Li, Hougiang
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 10174 - 10182
  • [9] BaSSL: Boundary-aware Self-Supervised Learning for Video Scene Segmentation
    Mun, Jonghwan
    Shin, Minchul
    Han, Gunsoo
    Lee, Sangho
    Ha, Seongsu
    Lee, Joonseok
    Kim, Eun-Sol
    COMPUTER VISION - ACCV 2022, PT IV, 2023, 13844 : 485 - 501
  • [10] Self-Supervised Deep TripleNet for Video Object Segmentation
    Xu, Kai
    Wen, Longyin
    Li, Guorong
    Huang, Qingming
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 3530 - 3539