Improving visual grounding with multi-modal interaction and auto-regressive vertex generation

被引:0
|
作者
Qin, Xiaofei [1 ]
Li, Fan [1 ]
He, Changxiang [2 ]
Pei, Ruiqi [1 ]
Zhang, Xuedian [1 ,3 ,4 ,5 ]
机构
[1] Univ Shanghai Sci & Technol, Sch Opt Elect & Comp Engn, Shanghai 200093, Peoples R China
[2] Univ Shanghai Sci & Technol, Coll Sci, Shanghai 200093, Peoples R China
[3] Shanghai Key Lab Modern Opt Syst, Shanghai 200093, Peoples R China
[4] Minist Educ, Key Lab Biomed Opt Technol & Devices, Shanghai 200093, Peoples R China
[5] Tongji Univ, Shanghai Inst Intelligent Sci & Technol, Shanghai 201210, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual grounding; Multi-modal learning; Multi-task learning; Computer vision; Natural language processing;
D O I
10.1016/j.neucom.2024.128227
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a concise and consistent network focusing on multi-task learning of Referring Expression Comprehension (REC) and Segmentation (RES) within Visual grounding (VG). To simplify the model architecture and achieve parameter sharing, we reconstruct the Visual grounding task as a floating-point coordinate generation problem based on both image and text inputs. Consequently, rather than separately predicting bounding boxes and pixel-level segmentation masks, we represent them uniformly as a sequence of coordinate tokens and output two corner points of bounding boxes and polygon vertices autoregressively. To improve the accuracy of point generation, we introduce a regression-based decoder. Inspired by bilinear interpolation, this decoder can directly predict precise floating-point coordinates, thus avoiding quantization errors. Additionally, we devise a Multi-Modal Interaction Fusion ( M 2 IF ) to address the imbalance between visual and language features in the model. This module focuses visual information on regions relevant to textual descriptions while suppressing the influence of irrelevant areas. Based on our model, Visual grounding is realized through a unified network structure. Experiments conducted on five benchmark datasets (RefCOCO, RefCOCO+, RefCOCOg, ReferItGame and Flickr30K Entities) demonstrate that the proposed unified network outperforms or is on par with many existing task-customized models. Codes are available at https://github.com/LFUSST/MMI-VG.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Dynamic Multi-modal Prompting for Efficient Visual Grounding
    Wu, Wansen
    Liu, Ting
    Wang, Youkai
    Xu, Kai
    Yin, Quanjun
    Hu, Yue
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VII, 2024, 14431 : 359 - 371
  • [2] Multi-Modal Hallucination Control by Visual Information Grounding
    Favero, Alessandro
    Zancato, Luca
    Trager, Matthew
    Choudhary, Siddharth
    Perera, Pramuditha
    Achille, Alessandro
    Swaminathan, Ashwin
    Soatto, Stefano
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 14303 - 14312
  • [3] Multi-Modal Dynamic Graph Transformer for Visual Grounding
    Chen, Sijia
    Li, Baochun
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15513 - 15522
  • [4] Probabilistic kernels for the classification of auto-regressive visual processes
    Chan, AB
    Vasconcelos, N
    2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, : 846 - 851
  • [5] Locally Hierarchical Auto-Regressive Modeling for Image Generation
    You, Tackgeun
    Kim, Saehoon
    Kim, Chiheon
    Lee, Doyup
    Han, Bohyung
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [6] Revisit the Scalability of Deep Auto-Regressive Models for Graph Generation
    Yang, Shuai
    Shen, Xipeng
    Lim, Seung-Hwan
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [7] Practical generation of video textures using the auto-regressive process
    Campbell, N
    Dalton, C
    Gibson, D
    Oziem, D
    Thomas, B
    IMAGE AND VISION COMPUTING, 2004, 22 (10) : 819 - 827
  • [8] ReCell: replicating recurrent cell for auto-regressive pose generation
    Korzun, Vladislav
    Beloborodova, Anna
    Ilin, Arkady
    COMPANION PUBLICATION OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 94 - 97
  • [9] Visual sign language recognition based on HMMs and Auto-regressive HMMs
    Yang, Xiaolin
    Jiang, Feng
    Liu, Han
    Yao, Hongxun
    Gao, Wen
    Wang, Chunli
    GESTURE IN HUMAN-COMPUTER INTERACTION AND SIMULATION, 2006, 3881 : 80 - 83
  • [10] Multi-modal visual tracking based on textual generation
    Wang, Jiahao
    Liu, Fang
    Jiao, Licheng
    Wang, Hao
    Li, Shuo
    Li, Lingling
    Chen, Puhua
    Liu, Xu
    INFORMATION FUSION, 2024, 112