Improving visual grounding with multi-modal interaction and auto-regressive vertex generation

被引:0
|
作者
Qin, Xiaofei [1 ]
Li, Fan [1 ]
He, Changxiang [2 ]
Pei, Ruiqi [1 ]
Zhang, Xuedian [1 ,3 ,4 ,5 ]
机构
[1] Univ Shanghai Sci & Technol, Sch Opt Elect & Comp Engn, Shanghai 200093, Peoples R China
[2] Univ Shanghai Sci & Technol, Coll Sci, Shanghai 200093, Peoples R China
[3] Shanghai Key Lab Modern Opt Syst, Shanghai 200093, Peoples R China
[4] Minist Educ, Key Lab Biomed Opt Technol & Devices, Shanghai 200093, Peoples R China
[5] Tongji Univ, Shanghai Inst Intelligent Sci & Technol, Shanghai 201210, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual grounding; Multi-modal learning; Multi-task learning; Computer vision; Natural language processing;
D O I
10.1016/j.neucom.2024.128227
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a concise and consistent network focusing on multi-task learning of Referring Expression Comprehension (REC) and Segmentation (RES) within Visual grounding (VG). To simplify the model architecture and achieve parameter sharing, we reconstruct the Visual grounding task as a floating-point coordinate generation problem based on both image and text inputs. Consequently, rather than separately predicting bounding boxes and pixel-level segmentation masks, we represent them uniformly as a sequence of coordinate tokens and output two corner points of bounding boxes and polygon vertices autoregressively. To improve the accuracy of point generation, we introduce a regression-based decoder. Inspired by bilinear interpolation, this decoder can directly predict precise floating-point coordinates, thus avoiding quantization errors. Additionally, we devise a Multi-Modal Interaction Fusion ( M 2 IF ) to address the imbalance between visual and language features in the model. This module focuses visual information on regions relevant to textual descriptions while suppressing the influence of irrelevant areas. Based on our model, Visual grounding is realized through a unified network structure. Experiments conducted on five benchmark datasets (RefCOCO, RefCOCO+, RefCOCOg, ReferItGame and Flickr30K Entities) demonstrate that the proposed unified network outperforms or is on par with many existing task-customized models. Codes are available at https://github.com/LFUSST/MMI-VG.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] Implementation of ActiveCube for multi-modal interaction
    Itoh, Y
    Kitamura, Y
    Kawai, M
    Kishino, F
    HUMAN-COMPUTER INTERACTION - INTERACT'01, 2001, : 682 - 683
  • [32] Multi-modal interaction for UAS control
    Taylor, Glenn
    Purman, Ben
    Schermerhorn, Paul
    Garcia-Sampedro, Guillermo
    Hubal, Robert
    Crabtree, Kathleen
    Rowe, Allen
    Spriggs, Sarah
    UNMANNED SYSTEMS TECHNOLOGY XVII, 2015, 9468
  • [33] Multi-Modal Interaction for Robotics Mules
    Taylor, Glenn
    Quist, Michael
    Lanting, Matthew
    Dunham, Cory
    Muench, Paul
    UNMANNED SYSTEMS TECHNOLOGY XIX, 2017, 10195
  • [34] QUALITY OF EXPERIENCING MULTI-MODAL INTERACTION
    Weiss, Benjamin
    Moeller, Sebastian
    Wechsung, Ina
    Kuehnel, Christine
    SPOKEN DIALOGUE SYSTEMS: TECHNOLOGY AND DESIGN, 2011, : 213 - 230
  • [35] Multi-modal interaction in AAL systems
    Bianchi, Valentina
    Grossi, Ferdinando
    De Munari, Ilaria
    Ciampolini, Paolo
    EVERYDAY TECHNOLOGY FOR INDEPENDENCE AND CARE, 2011, 29 : 440 - 447
  • [36] Visual Perception for Multi-Modal Human-Machine-Interaction in and with Attentive Environments
    Voit, Michael
    van de Camp, Florian
    Ijsselmuiden, Joris
    Schick, Alexander
    Stiefelhagen, Rainer
    AT-AUTOMATISIERUNGSTECHNIK, 2013, 61 (11) : 784 - 792
  • [37] Data Generation using a Probabilistic Auto-Regressive Model with Application to Student Exam Performance Analysis
    Chan, Jackson Tsz Wah
    Chui, Kwok Tai
    Lee, Lap-Kei
    Paoprasert, Naraphom
    Ng, Kwan-Keung
    2024 INTERNATIONAL SYMPOSIUM ON EDUCATIONAL TECHNOLOGY, ISET, 2024, : 87 - 90
  • [38] Multi-modal interaction of human and home robot in the context of room map generation
    Ghidary, SS
    Nakata, Y
    Saito, H
    Hattori, M
    Takamori, T
    AUTONOMOUS ROBOTS, 2002, 13 (02) : 169 - 184
  • [39] Multi-Modal Interaction of Human and Home Robot in the Context of Room Map Generation
    Saeed Shiry Ghidary
    Yasushi Nakata
    Hiroshi Saito
    Motofumi Hattori
    Toshi Takamori
    Autonomous Robots, 2002, 13 : 169 - 184
  • [40] A Multi-Modal Stimulator System for Visual Prosthesis
    Abdo, Emad A.
    Yuan, Peimin
    Zheng, Yujin
    Yakovlev, Alex
    Degenaar, Patrick
    2023 21ST IEEE INTERREGIONAL NEWCAS CONFERENCE, NEWCAS, 2023,