Local-Global Context Aware Transformer for Language-Guided Video Segmentation

被引:44
|
作者
Liang, Chen [1 ]
Wang, Wenguan [1 ]
Zhou, Tianfei [2 ]
Miao, Jiaxu [1 ]
Luo, Yawei [1 ]
Yang, Yi [1 ]
机构
[1] Zhejiang Univ, ReLER, CCAI, Hangzhou 310027, Zhejiang, Peoples R China
[2] Swiss Fed Inst Technol, CH-8092 Zurich, Switzerland
基金
国家重点研发计划;
关键词
Transformers; Task analysis; Visualization; Three-dimensional displays; Linguistics; Object segmentation; Grounding; Language-guided video segmentation; memory network; multi-modal transformer;
D O I
10.1109/TPAMI.2023.3262578
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present LOCATER (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. The memory is designed to involve two components - one for persistently preserving global video content, and one for dynamically gathering local temporal context and segmentation history. Based on the memorized local-global context and the particular content of each frame, LOCATER holistically and flexibly comprehends the expression as an adaptive query vector for each frame. The vector is used to query the corresponding frame for mask generation. The memory also allows LOCATER to process videos with linear time complexity and constant size memory, while Transformer-style self-attention computation scales quadratically with sequence length. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset but poses increased challenges in disambiguating among similar objects. Experiments on three LVS datasets and our A2D-S+ show that LOCATER outperforms previous state-of-the-arts. Further, we won the 1st place in the Referring Video Object Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge, where LOCATER served as the foundation for the winning solution.
引用
收藏
页码:10055 / 10069
页数:15
相关论文
共 50 条
  • [1] Towards Global Video Scene Segmentation with Context-Aware Transformer
    Yang, Yang
    Huang, Yurui
    Guo, Weili
    Xu, Baohua
    Xia, Dingyin
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3206 - 3213
  • [2] DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
    Rao, Yongming
    Zhao, Wenliang
    Chen, Guangyi
    Tang, Yansong
    Zhu, Zheng
    Huang, Guan
    Zhou, Jie
    Lu, Jiwen
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18061 - 18070
  • [3] CLUE: Contrastive language-guided learning for referring video object segmentation
    Gao, Qiqi
    Zhong, Wanjun
    Li, Jie
    Zhao, Tiejun
    PATTERN RECOGNITION LETTERS, 2024, 178 : 115 - 121
  • [4] Local-Global Transformer Neural Network for temporal action segmentation
    Tian, Xiaoyan
    Jin, Ye
    Tang, Xianglong
    MULTIMEDIA SYSTEMS, 2023, 29 (02) : 615 - 626
  • [5] Video Diffusion Models with Local-Global Context Guidance
    Yang, Siyuan
    Zhang, Lu
    Liu, Yu
    Jiang, Zhizhuo
    He, You
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 1640 - 1648
  • [6] DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition
    Liang, Yuxuan
    Zhou, Pan
    Zimmermann, Roger
    Yan, Shuicheng
    COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 577 - 595
  • [7] SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation
    Ouyang, Shuyi
    Wang, Hongyi
    Xie, Shiao
    Niu, Ziwei
    Tong, Ruofeng
    Chen, Yen-Wei
    Lin, Lanfen
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 1294 - 1302
  • [8] CLIP-It! Language-Guided Video Summarization
    Narasimhan, Medhini
    Rohrbach, Anna
    Darrell, Trevor
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [9] Hybrid Local-Global Context Learning for Neural Video Compression
    Zhai, Yongqi
    Yang, Jiayu
    Jiang, Wei
    Yang, Chunhui
    Tang, Luyang
    Wang, Ronggang
    2024 DATA COMPRESSION CONFERENCE, DCC, 2024, : 322 - 331
  • [10] mmFilter: Language-Guided Video Analytics at the Edge
    Hu, Zhiming
    Ye, Ning
    Phillips, Caleb
    Capes, Tim
    Mohomed, Iqbal
    PROCEEDINGS OF THE 2020 21ST INTERNATIONAL MIDDLEWARE CONFERENCE INDUSTRIAL TRACK (MIDDLEWARE INDUSTRY '20), 2020, : 1 - 7