Fully Transformer-Equipped Architecture for end-to-end Referring Video Object Segmentation

被引:0
|
作者
Li, Ping [1 ,2 ]
Zhang, Yu [1 ]
Yuan, Li [3 ]
Xu, Xianghua [1 ]
机构
[1] School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
[2] Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China
[3] School of Electronic and Computer Engineering, Peking University, Beijing, China
来源
Information Processing and Management | 2024年 / 61卷 / 01期
基金
中国国家自然科学基金;
关键词
Motion compensation - Semantic Segmentation;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring Video Object Segmentation (RVOS) requires segmenting the object in video referred by a natural language query. Existing methods mainly rely on sophisticated pipelines to tackle such cross-modal task, and do not explicitly model the object-level spatial context which plays an important role in locating the referred object. Therefore, we propose an end-to-end RVOS framework completely built upon transformers, termed Fully Transformer-Equipped Architecture (FTEA), which treats the RVOS task as a mask sequence learning problem and regards all the objects in video as candidate objects. Given a video clip with a text query, the visual–textual features are yielded by encoder, while the corresponding pixel-level and word-level features are aligned in terms of semantic similarity. To capture the object-level spatial context, we have developed the Stacked Transformer, which individually characterizes the visual appearance of each candidate object, whose feature map is decoded to the binary mask sequence in order directly. Finally, the model finds the best matching between mask sequence and text query. In addition, to diversify the generated masks for candidate objects, we impose a diversity loss on the model for capturing more accurate mask of the referred object. Empirical studies have shown the superiority of the proposed method on three benchmarks, e.g., FETA achieves 45.1% and 38.7% in terms of mAP on A2D Sentences (3782 videos) and J-HMDB Sentences (928 videos), respectively; it achieves 56.6% in terms of J&F on Ref-YouTube-VOS (3975 videos and 7451 objects). Particularly, compared to the best candidate method, it has a gain of 2.1% and 3.2% in terms of P@0.5 on the former two, respectively, while it has a gain of 2.9% in terms of J on the latter one. © 2023 Elsevier Ltd
引用
收藏
相关论文
共 50 条
  • [21] RQFormer: Rotated Query Transformer for end-to-end oriented object detection
    Zhao, Jiaqi
    Ding, Zeyu
    Zhou, Yong
    Zhu, Hancheng
    Du, Wen-Liang
    Yao, Rui
    El Saddik, Abdulmotaleb
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 266
  • [22] Transformer-based End-to-End Object Detection in Aerial Images
    Vo, Nguyen D.
    Le, Nguyen
    Ngo, Giang
    Doan, Du
    Le, Do
    Nguyen, Khang
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (10) : 1072 - 1079
  • [23] V-DETR: Pure Transformer for End-to-End Object Detection
    Dung Nguyen
    Van-Dung Hoang
    Van-Tuong-Lan Le
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, PT II, ACIIDS 2024, 2024, 14796 : 120 - 131
  • [24] An End-to-End Transformer Model for 3D Object Detection
    Misra, Ishan
    Girdhar, Rohit
    Joulin, Armand
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2886 - 2897
  • [25] DeoT: an end-to-end encoder-only Transformer object detector
    Tonghe Ding
    Kaili Feng
    Yanjun Wei
    Yu Han
    Tianping Li
    Journal of Real-Time Image Processing, 2023, 20
  • [26] E2-VOR: An End-to-End En/Decoder Architecture for Efficient Video Object Recognition
    Song, Zhuoran
    Jing, Naifeng
    Liang, Xiaoyao
    ACM TRANSACTIONS ON DESIGN AUTOMATION OF ELECTRONIC SYSTEMS, 2023, 28 (01)
  • [27] Expression Prompt Collaboration Transformer for universal referring video object segmentation
    Chen, Jiajun
    Lin, Jiacheng
    Zhong, Guojin
    Fu, Haolong
    Nai, Ke
    Yang, Kailun
    Li, Zhiyong
    KNOWLEDGE-BASED SYSTEMS, 2025, 311
  • [28] Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer
    Peng, Min
    Wang, Chongyang
    Shi, Yu
    Zhou, Xiang-Dong
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 2, 2023, : 2038 - 2046
  • [29] End-to-End Video Scene Graph Generation With Temporal Propagation Transformer
    Zhang, Yong
    Pan, Yingwei
    Yao, Ting
    Huang, Rui
    Mei, Tao
    Chen, Chang-Wen
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1613 - 1625
  • [30] Decoupled Cross-Modal Transformer for Referring Video Object Segmentation
    Wu, Ao
    Wang, Rong
    Tan, Quange
    Song, Zhenfeng
    SENSORS, 2024, 24 (16)