Fully Transformer-Equipped Architecture for end-to-end Referring Video Object Segmentation

被引:6
|
作者
Li, Ping [1 ,2 ]
Zhang, Yu [1 ]
Yuan, Li [3 ]
Xu, Xianghua [1 ]
机构
[1] Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou, Peoples R China
[2] Guangdong Lab Artificial Intelligence & Digital Ec, Shenzhen, Peoples R China
[3] Peking Univ, Sch Elect & Comp Engn, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Video object segmentation; Stacked transformer; Diverse object mask; Vision-language alignment;
D O I
10.1016/j.ipm.2023.103566
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Referring Video Object Segmentation (RVOS) requires segmenting the object in video referred by a natural language query. Existing methods mainly rely on sophisticated pipelines to tackle such cross-modal task, and do not explicitly model the object-level spatial context which plays an important role in locating the referred object. Therefore, we propose an end-to-end RVOS framework completely built upon transformers, termed Fully Transformer-Equipped Architecture (FTEA), which treats the RVOS task as a mask sequence learning problem and regards all the objects in video as candidate objects. Given a video clip with a text query, the visual-textual features are yielded by encoder, while the corresponding pixel-level and word-level features are aligned in terms of semantic similarity. To capture the object-level spatial context, we have developed the Stacked Transformer, which individually characterizes the visual appearance of each candidate object, whose feature map is decoded to the binary mask sequence in order directly. Finally, the model finds the best matching between mask sequence and text query. In addition, to diversify the generated masks for candidate objects, we impose a diversity loss on the model for capturing more accurate mask of the referred object. Empirical studies have shown the superiority of the proposed method on three benchmarks, e.g., FETA achieves 45.1% and 38.7% in terms of mAP on A2D Sentences (3782 videos) and J-HMDB Sentences (928 videos), respectively; it achieves 56.6% in terms of J&F on Ref-YouTube-VOS (3975 videos and 7451 objects). Particularly, compared to the best candidate method, it has a gain of 2.1% and 3.2% in terms of P@0.5 on the former two, respectively, while it has a gain of 2.9% in terms of J on the latter one.
引用
收藏
页数:17
相关论文
共 50 条
  • [31] An End-to-End Video Coding Method via Adaptive Vision Transformer
    Yang, Haoyan
    Zhou, Mingliang
    Shang, Zhaowei
    Pu, Huayan
    Luo, Jun
    Huang, Xiaoxu
    Wang, Shilong
    Cao, Huajun
    Wei, Xuekai
    Xian, Weizhi
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2024, 38 (01)
  • [32] An End-to-End Human Segmentation by Region Proposed Fully Convolutional Network
    Jiang, Xiaoyan
    Gao, Yongbin
    Fang, Zhijun
    Wang, Peng
    Huang, Bo
    IEEE ACCESS, 2019, 7 : 16395 - 16405
  • [33] End-to-End Video Object Detection with Spatial-Temporal Transformers
    He, Lu
    Zhou, Qianyu
    Li, Xiangtai
    Niu, Li
    Cheng, Guangliang
    Li, Xiao
    Liu, Wenxuan
    Tong, Yunhai
    Ma, Lizhuang
    Zhang, Liqing
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1507 - 1516
  • [34] SwinDocSegmenter: An End-to-End Unified Domain Adaptive Transformer for Document Instance Segmentation
    Banerjee, Ayan
    Biswas, Sanket
    Lladós, Josep
    Pal, Umapada
    arXiv, 2023,
  • [35] MCTE: MARRYING CONVOLUTION AND TRANSFORMER EFFICIENTLY FOR END-TO-END MEDICAL IMAGE SEGMENTATION
    Li, Jiuqiang
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1100 - 1104
  • [36] DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer
    Biswas, Sanket
    Banerjee, Ayan
    Lladós, Josep
    Pal, Umapada
    arXiv, 2022,
  • [37] RESC: REfine the SCore with adaptive transformer head for end-to-end object detection
    Wang, Honglie
    Jiang, Rong
    Xu, Jian
    Sun, Shouqian
    NEURAL COMPUTING & APPLICATIONS, 2022, 34 (14): : 12017 - 12028
  • [38] RESC: REfine the SCore with adaptive transformer head for end-to-end object detection
    Honglie Wang
    Rong Jiang
    Jian Xu
    Shouqian Sun
    Neural Computing and Applications, 2022, 34 : 12017 - 12028
  • [39] End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning
    Ran, Yuting
    Fang, Bin
    Chen, Lei
    Wei, Xuekai
    Xian, Weizhi
    Zhou, Mingliang
    JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2024, 33 (04)
  • [40] Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems
    Hung Le
    Sahoo, Doyen
    Chen, Nancy F.
    Hoi, Steven C. H.
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 5612 - 5623