Fully Transformer-Equipped Architecture for end-to-end Referring Video Object Segmentation

被引：0

作者：

Li, Ping ^{[1
,2
]}

Zhang, Yu ^{[1
]}

Yuan, Li ^{[3
]}

Xu, Xianghua ^{[1
]}

机构：

[1] School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China

[2] Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China

[3] School of Electronic and Computer Engineering, Peking University, Beijing, China

来源：

Information Processing and Management | 2024年 / 61卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Motion compensation - Semantic Segmentation;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Referring Video Object Segmentation (RVOS) requires segmenting the object in video referred by a natural language query. Existing methods mainly rely on sophisticated pipelines to tackle such cross-modal task, and do not explicitly model the object-level spatial context which plays an important role in locating the referred object. Therefore, we propose an end-to-end RVOS framework completely built upon transformers, termed Fully Transformer-Equipped Architecture (FTEA), which treats the RVOS task as a mask sequence learning problem and regards all the objects in video as candidate objects. Given a video clip with a text query, the visual–textual features are yielded by encoder, while the corresponding pixel-level and word-level features are aligned in terms of semantic similarity. To capture the object-level spatial context, we have developed the Stacked Transformer, which individually characterizes the visual appearance of each candidate object, whose feature map is decoded to the binary mask sequence in order directly. Finally, the model finds the best matching between mask sequence and text query. In addition, to diversify the generated masks for candidate objects, we impose a diversity loss on the model for capturing more accurate mask of the referred object. Empirical studies have shown the superiority of the proposed method on three benchmarks, e.g., FETA achieves 45.1% and 38.7% in terms of mAP on A2D Sentences (3782 videos) and J-HMDB Sentences (928 videos), respectively; it achieves 56.6% in terms of J&F on Ref-YouTube-VOS (3975 videos and 7451 objects). Particularly, compared to the best candidate method, it has a gain of 2.1% and 3.2% in terms of P@0.5 on the former two, respectively, while it has a gain of 2.9% in terms of J on the latter one. © 2023 Elsevier Ltd

引用

共 50 条

[41] End-to-End Video Captioning
Olivastri, Silvio
Singh, Gurkirt
Cuzzolin, Fabio
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 1474 - 1482
[42] SymFormer: End-to-End Symbolic Regression Using Transformer-Based Architecture
Vastl, Martin
Kulhanek, Jonas
Kubalik, Jiri
Derner, Erik
Babuska, Robert
IEEE ACCESS, 2024, 12 : 37840 - 37849
[43] SFormer: An end-to-end spatio-temporal transformer architecture for deepfake detection
Kingra, Staffy
Aggarwal, Naveen
Kaur, Nirmal
FORENSIC SCIENCE INTERNATIONAL-DIGITAL INVESTIGATION, 2024, 51
[44] NeuralREG: An end-to-end approach to referring expression generation
Ferreira, Thiago Castro
Moussallem, Diego
Kadar, Akos
Wubben, Sander
Krahmer, Emiel
PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 1959 - 1969
[45] RETR: END-TO-END REFERRING EXPRESSION COMPREHENSION WITH TRANSFORMERS
Rui, Yang
2022 19TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING (ICCWAMTIP), 2022,
[46] Video Object Segmentation with Referring Expressions
Khoreva, Anna
Rohrbach, Anna
Schiele, Bernt
COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 7 - 12
[47] Methods for Referring Video Object Segmentation
Wei, Caiying
Jia, Lei
Computer Engineering and Applications, 61 (02): : 73 - 83
[48] MPNET: An End-to-End Deep Neural Network for Object Detection in Surveillance Video
Wang, Hanyu
Wang, Ping
Qian, Xueming
IEEE ACCESS, 2018, 6 : 30296 - 30308
[49] TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers
Zhou, Qianyu
Li, Xiangtai
He, Lu
Yang, Yibo
Cheng, Guangliang
Tong, Yunhai
Ma, Lizhuang
Tao, Dacheng
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7853 - 7869
[50] AN END-TO-END ARCHITECTURE FOR CLASS-INCREMENTAL OBJECT DETECTION WITH KNOWLEDGE DISTILLATION
Hao, Yu
Fu, Yanwei
Jiang, Yu-Gang
Tian, Qi
2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1 - 6

← 1 2 3 4 5 →