Fully Transformer-Equipped Architecture for end-to-end Referring Video Object Segmentation

被引:6
|
作者
Li, Ping [1 ,2 ]
Zhang, Yu [1 ]
Yuan, Li [3 ]
Xu, Xianghua [1 ]
机构
[1] Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou, Peoples R China
[2] Guangdong Lab Artificial Intelligence & Digital Ec, Shenzhen, Peoples R China
[3] Peking Univ, Sch Elect & Comp Engn, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Video object segmentation; Stacked transformer; Diverse object mask; Vision-language alignment;
D O I
10.1016/j.ipm.2023.103566
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Referring Video Object Segmentation (RVOS) requires segmenting the object in video referred by a natural language query. Existing methods mainly rely on sophisticated pipelines to tackle such cross-modal task, and do not explicitly model the object-level spatial context which plays an important role in locating the referred object. Therefore, we propose an end-to-end RVOS framework completely built upon transformers, termed Fully Transformer-Equipped Architecture (FTEA), which treats the RVOS task as a mask sequence learning problem and regards all the objects in video as candidate objects. Given a video clip with a text query, the visual-textual features are yielded by encoder, while the corresponding pixel-level and word-level features are aligned in terms of semantic similarity. To capture the object-level spatial context, we have developed the Stacked Transformer, which individually characterizes the visual appearance of each candidate object, whose feature map is decoded to the binary mask sequence in order directly. Finally, the model finds the best matching between mask sequence and text query. In addition, to diversify the generated masks for candidate objects, we impose a diversity loss on the model for capturing more accurate mask of the referred object. Empirical studies have shown the superiority of the proposed method on three benchmarks, e.g., FETA achieves 45.1% and 38.7% in terms of mAP on A2D Sentences (3782 videos) and J-HMDB Sentences (928 videos), respectively; it achieves 56.6% in terms of J&F on Ref-YouTube-VOS (3975 videos and 7451 objects). Particularly, compared to the best candidate method, it has a gain of 2.1% and 3.2% in terms of P@0.5 on the former two, respectively, while it has a gain of 2.9% in terms of J on the latter one.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] Fully Transformer-Equipped Architecture for end-to-end Referring Video Object Segmentation
    Li, Ping
    Zhang, Yu
    Yuan, Li
    Xu, Xianghua
    Information Processing and Management, 2024, 61 (01):
  • [2] End-to-End Referring Video Object Segmentation with Multimodal Transformers
    Botach, Adam
    Zheltonozhskii, Evgenii
    Baskin, Chaim
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4975 - 4985
  • [3] Fully and Weakly Supervised Referring Expression Segmentation With End-to-End Learning
    Li, Hui
    Sun, Mingjie
    Xiao, Jimin
    Lim, Eng Gee
    Zhao, Yao
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5999 - 6012
  • [4] A Generative Appearance Model for End-to-end Video Object Segmentation
    Johnander, Joakim
    Danelljan, Martin
    Brissman, Emil
    Khan, Fahad Shahbaz
    Felsberg, Michael
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8945 - 8954
  • [5] RVOS: End-to-End Recurrent Network for Video Object Segmentation
    Ventura, Carles
    Bellver, Miriam
    Girbau, Andreu
    Salvador, Amaia
    Marques, Ferran
    Giro-i-Nieto, Xavier
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 5272 - 5281
  • [6] FEELVOS: Fast End-to-End Embedding Learning for Video Object Segmentation
    Voigtlaender, Paul
    Chai, Yuning
    Schroff, Florian
    Adam, Hartwig
    Leibe, Bastian
    Chen, Liang-Chieh
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 9473 - 9482
  • [7] End-to-End Video Text Spotting with Transformer
    Wu, Weijia
    Cai, Yuanqiang
    Shen, Chunhua
    Zhang, Debing
    Fu, Ying
    Zhou, Hong
    Luo, Ping
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (09) : 4019 - 4035
  • [8] End-to-End Video Instance Segmentation with Transformers
    Wang, Yuqing
    Xu, Zhaoliang
    Wang, Xinlong
    Shen, Chunhua
    Cheng, Baoshan
    Shen, Hao
    Xia, Huaxia
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 8737 - 8746
  • [9] SRDD: a lightweight end-to-end object detection with transformer
    Zhu, Yuan
    Xia, Qingyuan
    Jin, Wen
    CONNECTION SCIENCE, 2022, 34 (01) : 2448 - 2465
  • [10] End-to-End Dense Video Captioning with Masked Transformer
    Zhou, Luowei
    Zhou, Yingbo
    Corso, Jason J.
    Socher, Richard
    Xiong, Caiming
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8739 - 8748