Improving visual grounding with multi-modal interaction and auto-regressive vertex generation

被引:0
|
作者
Qin, Xiaofei [1 ]
Li, Fan [1 ]
He, Changxiang [2 ]
Pei, Ruiqi [1 ]
Zhang, Xuedian [1 ,3 ,4 ,5 ]
机构
[1] Univ Shanghai Sci & Technol, Sch Opt Elect & Comp Engn, Shanghai 200093, Peoples R China
[2] Univ Shanghai Sci & Technol, Coll Sci, Shanghai 200093, Peoples R China
[3] Shanghai Key Lab Modern Opt Syst, Shanghai 200093, Peoples R China
[4] Minist Educ, Key Lab Biomed Opt Technol & Devices, Shanghai 200093, Peoples R China
[5] Tongji Univ, Shanghai Inst Intelligent Sci & Technol, Shanghai 201210, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual grounding; Multi-modal learning; Multi-task learning; Computer vision; Natural language processing;
D O I
10.1016/j.neucom.2024.128227
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a concise and consistent network focusing on multi-task learning of Referring Expression Comprehension (REC) and Segmentation (RES) within Visual grounding (VG). To simplify the model architecture and achieve parameter sharing, we reconstruct the Visual grounding task as a floating-point coordinate generation problem based on both image and text inputs. Consequently, rather than separately predicting bounding boxes and pixel-level segmentation masks, we represent them uniformly as a sequence of coordinate tokens and output two corner points of bounding boxes and polygon vertices autoregressively. To improve the accuracy of point generation, we introduce a regression-based decoder. Inspired by bilinear interpolation, this decoder can directly predict precise floating-point coordinates, thus avoiding quantization errors. Additionally, we devise a Multi-Modal Interaction Fusion ( M 2 IF ) to address the imbalance between visual and language features in the model. This module focuses visual information on regions relevant to textual descriptions while suppressing the influence of irrelevant areas. Based on our model, Visual grounding is realized through a unified network structure. Experiments conducted on five benchmark datasets (RefCOCO, RefCOCO+, RefCOCOg, ReferItGame and Flickr30K Entities) demonstrate that the proposed unified network outperforms or is on par with many existing task-customized models. Codes are available at https://github.com/LFUSST/MMI-VG.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] Simulation of bridge stochastic wind field using multi-variate Auto-Regressive model
    Zhang, Tian
    Xia, He
    Guo, Wei-Wei
    Zhang, T. (saghb@126.com), 1600, Central South University of Technology (43): : 1114 - 1121
  • [42] Semi-supervised Grounding Alignment for Multi-modal Feature Learning
    Chou, Shih-Han
    Fan, Zicong
    Little, James J.
    Sigal, Leonid
    2022 19TH CONFERENCE ON ROBOTS AND VISION (CRV 2022), 2022, : 48 - 57
  • [43] VGV: Verilog Generation using Visual Capabilities of Multi-Modal Large Language Models
    Wong, Sam-Zaak
    Wan, Gwok-Waa
    Liu, Dongping
    Wang, Xi
    2024 IEEE LLM AIDED DESIGN WORKSHOP, LAD 2024, 2024,
  • [44] A Multi-Modal Chinese Poetry Generation Model
    Liu, Dayiheng
    Guo, Quan
    Li, Wubo
    2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [45] Multi-modal Sarcasm Generation: Dataset and Solution
    Zhao, Wenye
    Huang, Qingbao
    Xu, Dongsheng
    Zhao, Peizhi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 5601 - 5613
  • [46] Enhancing multi-modal fusion in visual dialog via sample debiasing and feature interaction
    Lu, Chenyu
    Yin, Jun
    Yang, Hao
    Sun, Shiliang
    INFORMATION FUSION, 2024, 107
  • [47] Meme Generation with Multi-modal Input and Planning
    Ranjan, Ashutosh
    Srivastava, Vivek
    Khatri, Jyotsana
    Bhat, Savita
    Karande, Shirish
    PROCEEDINGS OF THE 2ND INTERNATIONAL WORKSHOP ON DEEP MULTIMODAL GENERATION AND RETRIEVAL, MMGR 2024, 2024, : 21 - +
  • [48] Robotics-based telepresence using multi-modal interaction for individuals with visual impairments
    Park, Chung Hyuk
    Howard, Ayanna M.
    INTERNATIONAL JOURNAL OF ADAPTIVE CONTROL AND SIGNAL PROCESSING, 2014, 28 (12) : 1514 - 1532
  • [49] Generation and validation of comprehensive synthetic weather histories using auto-regressive moving-average models
    Rigby, Aidan
    Baker, Una
    Lindley, Benjamin
    Wagner, Michael
    RENEWABLE ENERGY, 2024, 224
  • [50] Multi-Modal Movement: Interaction and Mobility Reviewed
    Jenkings, K. Neil
    SYMBOLIC INTERACTION, 2014, 37 (02) : 315 - 317