Local-Global Context Aware Transformer for Language-Guided Video Segmentation

被引：44

作者：

Liang, Chen ^{[1
]}

Wang, Wenguan ^{[1
]}

Zhou, Tianfei ^{[2
]}

Miao, Jiaxu ^{[1
]}

Luo, Yawei ^{[1
]}

Yang, Yi ^{[1
]}

机构：

[1] Zhejiang Univ, ReLER, CCAI, Hangzhou 310027, Zhejiang, Peoples R China

[2] Swiss Fed Inst Technol, CH-8092 Zurich, Switzerland

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2023年 / 45卷 / 08期

基金：

国家重点研发计划;

关键词：

Transformers; Task analysis; Visualization; Three-dimensional displays; Linguistics; Object segmentation; Grounding; Language-guided video segmentation; memory network; multi-modal transformer;

D O I：

10.1109/TPAMI.2023.3262578

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present LOCATER (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. The memory is designed to involve two components - one for persistently preserving global video content, and one for dynamically gathering local temporal context and segmentation history. Based on the memorized local-global context and the particular content of each frame, LOCATER holistically and flexibly comprehends the expression as an adaptive query vector for each frame. The vector is used to query the corresponding frame for mask generation. The memory also allows LOCATER to process videos with linear time complexity and constant size memory, while Transformer-style self-attention computation scales quadratically with sequence length. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset but poses increased challenges in disambiguating among similar objects. Experiments on three LVS datasets and our A2D-S+ show that LOCATER outperforms previous state-of-the-arts. Further, we won the 1st place in the Referring Video Object Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge, where LOCATER served as the foundation for the winning solution.

引用

页码：10055 / 10069

页数：15

共 50 条

[1] Towards Global Video Scene Segmentation with Context-Aware Transformer
Yang, Yang
Huang, Yurui
Guo, Weili
Xu, Baohua
Xia, Dingyin
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3206 - 3213
[2] DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
Rao, Yongming
Zhao, Wenliang
Chen, Guangyi
Tang, Yansong
Zhu, Zheng
Huang, Guan
Zhou, Jie
Lu, Jiwen
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18061 - 18070
[3] CLUE: Contrastive language-guided learning for referring video object segmentation
Gao, Qiqi
Zhong, Wanjun
Li, Jie
Zhao, Tiejun
PATTERN RECOGNITION LETTERS, 2024, 178 : 115 - 121
[4] Local-Global Transformer Neural Network for temporal action segmentation
Tian, Xiaoyan
Jin, Ye
Tang, Xianglong
MULTIMEDIA SYSTEMS, 2023, 29 (02) : 615 - 626
[5] Video Diffusion Models with Local-Global Context Guidance
Yang, Siyuan
Zhang, Lu
Liu, Yu
Jiang, Zhizhuo
He, You
PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 1640 - 1648
[6] DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition
Liang, Yuxuan
Zhou, Pan
Zimmermann, Roger
Yan, Shuicheng
COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 577 - 595
[7] SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation
Ouyang, Shuyi
Wang, Hongyi
Xie, Shiao
Niu, Ziwei
Tong, Ruofeng
Chen, Yen-Wei
Lin, Lanfen
PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 1294 - 1302
[8] CLIP-It! Language-Guided Video Summarization
Narasimhan, Medhini
Rohrbach, Anna
Darrell, Trevor
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[9] Hybrid Local-Global Context Learning for Neural Video Compression
Zhai, Yongqi
Yang, Jiayu
Jiang, Wei
Yang, Chunhui
Tang, Luyang
Wang, Ronggang
2024 DATA COMPRESSION CONFERENCE, DCC, 2024, : 322 - 331
[10] mmFilter: Language-Guided Video Analytics at the Edge
Hu, Zhiming
Ye, Ning
Phillips, Caleb
Capes, Tim
Mohomed, Iqbal
PROCEEDINGS OF THE 2020 21ST INTERNATIONAL MIDDLEWARE CONFERENCE INDUSTRIAL TRACK (MIDDLEWARE INDUSTRY '20), 2020, : 1 - 7

← 1 2 3 4 5 →