Visual-Semantic Graph Matching for Visual Grounding

被引:15
|
作者
Jing, Chenchen [1 ]
Wu, Yuwei [1 ]
Pei, Mingtao [1 ]
Hu, Yao [2 ]
Jia, Yunde [1 ]
Wu, Qi [3 ]
机构
[1] Beijing Inst Technol, Beijing, Peoples R China
[2] Alibaba Youku Cognit & Intelligent Lab, Beijing, Peoples R China
[3] Univ Adelaide, Adelaide, SA, Australia
关键词
Visual Grounding; Graph Matching; Visual Scene Graph; Language Scene Graph; LANGUAGE;
D O I
10.1145/3394171.3413902
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Grounding is the task of associating entities in a natural language sentence with objects in an image. In this paper, we formulate visual grounding as a graph matching problem to find node correspondences between a visual scene graph and a language scene graph. These two graphs are heterogeneous, representing structure layouts of the sentence and image, respectively. We learn unified contextual node representations of the two graphs by using a cross-modal graph convolutional network to reduce their discrepancy. The graph matching is thus relaxed as a linear assignment problem because the learned node representations characterize both node information and structure information. A permutation loss and a semantic cycle-consistency loss are further introduced to solve the linear assignment problem with or without ground-truth correspondences. Experimental results on two visual grounding tasks, i.e., referring expression comprehension and phrase localization, demonstrate the effectiveness of our method.
引用
收藏
页码:4041 / 4050
页数:10
相关论文
共 50 条
  • [1] Visual-Semantic Graph Matching Net for Zero-Shot Learning
    Duan, Bowen
    Chen, Shiming
    Guo, Yufei
    Xie, Guo-Sen
    Ding, Weiping
    Wang, Yisong
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,
  • [2] AUGMENTED VISUAL-SEMANTIC EMBEDDINGS FOR IMAGE AND SENTENCE MATCHING
    Chen, Zerui
    Huang, Yan
    Wang, Liang
    2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2019, : 290 - 294
  • [3] Visual-Semantic Graph Reasoning for Pedestrian Attribute Recognition
    Li, Qiaozhe
    Zhao, Xin
    He, Ran
    Huang, Kaiqi
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8634 - 8641
  • [4] Visual-semantic network: a visual and semantic enhanced model for gesture recognition
    Yizhe Wang
    Congqi Cao
    Yanning Zhang
    Visual Intelligence, 1 (1):
  • [5] Multilabel Deep Visual-Semantic Embedding
    Yeh, Mei-Chen
    Li, Yi-Nan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (06) : 1530 - 1536
  • [6] Hierarchical Graph Attention Network for Few-shot Visual-Semantic Learning
    Yin, Chengxiang
    Wu, Kun
    Che, Zhengping
    Jiang, Bo
    Xu, Zhiyuan
    Tang, Jian
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2157 - 2166
  • [7] Image Captioning with Visual-Semantic LSTM
    Li, Nannan
    Chen, Zhenzhong
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 793 - 799
  • [8] Learning Robust Visual-Semantic Embeddings
    Tsai, Yao-Hung Hubert
    Huang, Liang-Kang
    Salakhutdinov, Ruslan
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 3591 - 3600
  • [9] Visual Relationship Detection Using Joint Visual-Semantic Embedding
    Li, Binglin
    Wang, Yang
    2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 3291 - 3296
  • [10] Visual-semantic consistency matching network for generalized zero-shot learning
    Zhang, Zhenqi
    Cao, Wenming
    NEUROCOMPUTING, 2023, 536 : 30 - 39