Fully Transformer-Equipped Architecture for end-to-end Referring Video Object Segmentation

被引:0
|
作者
Li, Ping [1 ,2 ]
Zhang, Yu [1 ]
Yuan, Li [3 ]
Xu, Xianghua [1 ]
机构
[1] School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
[2] Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China
[3] School of Electronic and Computer Engineering, Peking University, Beijing, China
来源
Information Processing and Management | 2024年 / 61卷 / 01期
基金
中国国家自然科学基金;
关键词
Motion compensation - Semantic Segmentation;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring Video Object Segmentation (RVOS) requires segmenting the object in video referred by a natural language query. Existing methods mainly rely on sophisticated pipelines to tackle such cross-modal task, and do not explicitly model the object-level spatial context which plays an important role in locating the referred object. Therefore, we propose an end-to-end RVOS framework completely built upon transformers, termed Fully Transformer-Equipped Architecture (FTEA), which treats the RVOS task as a mask sequence learning problem and regards all the objects in video as candidate objects. Given a video clip with a text query, the visual–textual features are yielded by encoder, while the corresponding pixel-level and word-level features are aligned in terms of semantic similarity. To capture the object-level spatial context, we have developed the Stacked Transformer, which individually characterizes the visual appearance of each candidate object, whose feature map is decoded to the binary mask sequence in order directly. Finally, the model finds the best matching between mask sequence and text query. In addition, to diversify the generated masks for candidate objects, we impose a diversity loss on the model for capturing more accurate mask of the referred object. Empirical studies have shown the superiority of the proposed method on three benchmarks, e.g., FETA achieves 45.1% and 38.7% in terms of mAP on A2D Sentences (3782 videos) and J-HMDB Sentences (928 videos), respectively; it achieves 56.6% in terms of J&F on Ref-YouTube-VOS (3975 videos and 7451 objects). Particularly, compared to the best candidate method, it has a gain of 2.1% and 3.2% in terms of P@0.5 on the former two, respectively, while it has a gain of 2.9% in terms of J on the latter one. © 2023 Elsevier Ltd
引用
收藏
相关论文
共 50 条
  • [41] End-to-End Video Captioning
    Olivastri, Silvio
    Singh, Gurkirt
    Cuzzolin, Fabio
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 1474 - 1482
  • [42] SymFormer: End-to-End Symbolic Regression Using Transformer-Based Architecture
    Vastl, Martin
    Kulhanek, Jonas
    Kubalik, Jiri
    Derner, Erik
    Babuska, Robert
    IEEE ACCESS, 2024, 12 : 37840 - 37849
  • [43] SFormer: An end-to-end spatio-temporal transformer architecture for deepfake detection
    Kingra, Staffy
    Aggarwal, Naveen
    Kaur, Nirmal
    FORENSIC SCIENCE INTERNATIONAL-DIGITAL INVESTIGATION, 2024, 51
  • [44] NeuralREG: An end-to-end approach to referring expression generation
    Ferreira, Thiago Castro
    Moussallem, Diego
    Kadar, Akos
    Wubben, Sander
    Krahmer, Emiel
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 1959 - 1969
  • [45] RETR: END-TO-END REFERRING EXPRESSION COMPREHENSION WITH TRANSFORMERS
    Rui, Yang
    2022 19TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING (ICCWAMTIP), 2022,
  • [46] Video Object Segmentation with Referring Expressions
    Khoreva, Anna
    Rohrbach, Anna
    Schiele, Bernt
    COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 7 - 12
  • [47] Methods for Referring Video Object Segmentation
    Wei, Caiying
    Jia, Lei
    Computer Engineering and Applications, 61 (02): : 73 - 83
  • [48] MPNET: An End-to-End Deep Neural Network for Object Detection in Surveillance Video
    Wang, Hanyu
    Wang, Ping
    Qian, Xueming
    IEEE ACCESS, 2018, 6 : 30296 - 30308
  • [49] TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers
    Zhou, Qianyu
    Li, Xiangtai
    He, Lu
    Yang, Yibo
    Cheng, Guangliang
    Tong, Yunhai
    Ma, Lizhuang
    Tao, Dacheng
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7853 - 7869
  • [50] AN END-TO-END ARCHITECTURE FOR CLASS-INCREMENTAL OBJECT DETECTION WITH KNOWLEDGE DISTILLATION
    Hao, Yu
    Fu, Yanwei
    Jiang, Yu-Gang
    Tian, Qi
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1 - 6