A closer look at referring expressions for video object segmentation

被引:6
|
作者
Bellver, Miriam [1 ]
Ventura, Carles [2 ]
Silberer, Carina [3 ]
Kazakos, Ioannis [4 ]
Torres, Jordi [1 ]
Giro-i-Nieto, Xavier [5 ,6 ]
机构
[1] Barcelona Supercomp Ctr BSC, Barcelona, Spain
[2] Univ Oberta Catalunya UOC, Barcelona, Spain
[3] Univ Stuttgart, Inst NLP, Stuttgart, Germany
[4] Natl Tech Univ Athens, Athens, Greece
[5] Univ Politecn Catalunya UPC, Barcelona, Catalonia, Spain
[6] CSIC UPC, Inst Robot & Informat Ind, Barcelona, Catalonia, Spain
关键词
Referring expressions; Video object segmentation; Vision and language;
D O I
10.1007/s11042-022-13413-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The task of Language-guided Video Object Segmentation (LVOS) aims at generating binary masks for an object referred by a linguistic expression. When this expression unambiguously describes an object in the scene, it is named referring expression (RE). Our work argues that existing benchmarks used for LVOS are mainly composed of trivial cases, in which referents can be identified with simple phrases. Our analysis relies on a new categorization of the referring expressions in the DAVIS-2017 and Actor-Action datasets into trivial and non-trivial REs, where the non-trivial REs are further annotated with seven RE semantic categories. We leverage these data to analyze the performance of RefVOS, a novel neural network that obtains competitive results for the task of language-guided image segmentation and state of the art results for LVOS. Our study indicates that the major challenges for the task are related to understanding motion and static actions.
引用
收藏
页码:4419 / 4438
页数:20
相关论文
共 50 条
  • [21] Weakly Supervised Referring Video Object Segmentation With Object-Centric Pseudo-Guidance
    Wang, Weikang
    Su, Yuting
    Liu, Jing
    Sun, Wei
    Zhai, Guangtao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 : 1320 - 1333
  • [22] LEVERAGING VISUAL PROMPTS TO GUIDE LANGUAGE MODELING FOR REFERRING VIDEO OBJECT SEGMENTATION
    Gao, Qiqi
    Zhong, Wanjun
    Li, Jie
    Zhao, Tiejun
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 685 - 689
  • [23] Spectrum-guided Multi-granularity Referring Video Object Segmentation
    Miao, Bo
    Bennamoun, Mohammed
    Gao, Yongsheng
    Mian, Ajmal
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 920 - 930
  • [24] CLUE: Contrastive language-guided learning for referring video object segmentation
    Gao, Qiqi
    Zhong, Wanjun
    Li, Jie
    Zhao, Tiejun
    PATTERN RECOGNITION LETTERS, 2024, 178 : 115 - 121
  • [25] Closer look at zoom video
    IC Card Syst Des, 5 (10):
  • [26] Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation
    Zhu, Zixin
    Feng, Xuelu
    Chen, Dongdong
    Yuan, Junsong
    Qiao, Chunming
    Hua, Gang
    COMPUTER VISION - ECCV 2024, PT XII, 2025, 15070 : 452 - 469
  • [27] Mamba-driven hierarchical temporal multimodal alignment for referring video object segmentation
    Liang, Le
    Zhang, Lefei
    NEUROCOMPUTING, 2025, 622
  • [28] Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation
    Wu, Dongming
    Dong, Xingping
    Shao, Ling
    Shen, Jianbing
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4986 - 4995
  • [29] Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation
    Ding, Zihan
    Hui, Tianrui
    Huang, Junshi
    Wei, Xiaoming
    Han, Jizhong
    Liu, Si
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4954 - 4963
  • [30] SLVP: Self-supervised Language-Video Pre-training for Referring Video Object Segmentation
    Mei, Jie
    Piergiovanni, A. J.
    Hwang, Jenq-Neng
    Li, Wei
    2024 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS, WACVW 2024, 2024, : 507 - 517