Visual-Semantic Graph Matching for Visual Grounding

被引:15
|
作者
Jing, Chenchen [1 ]
Wu, Yuwei [1 ]
Pei, Mingtao [1 ]
Hu, Yao [2 ]
Jia, Yunde [1 ]
Wu, Qi [3 ]
机构
[1] Beijing Inst Technol, Beijing, Peoples R China
[2] Alibaba Youku Cognit & Intelligent Lab, Beijing, Peoples R China
[3] Univ Adelaide, Adelaide, SA, Australia
关键词
Visual Grounding; Graph Matching; Visual Scene Graph; Language Scene Graph; LANGUAGE;
D O I
10.1145/3394171.3413902
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Grounding is the task of associating entities in a natural language sentence with objects in an image. In this paper, we formulate visual grounding as a graph matching problem to find node correspondences between a visual scene graph and a language scene graph. These two graphs are heterogeneous, representing structure layouts of the sentence and image, respectively. We learn unified contextual node representations of the two graphs by using a cross-modal graph convolutional network to reduce their discrepancy. The graph matching is thus relaxed as a linear assignment problem because the learned node representations characterize both node information and structure information. A permutation loss and a semantic cycle-consistency loss are further introduced to solve the linear assignment problem with or without ground-truth correspondences. Experimental results on two visual grounding tasks, i.e., referring expression comprehension and phrase localization, demonstrate the effectiveness of our method.
引用
收藏
页码:4041 / 4050
页数:10
相关论文
共 50 条
  • [31] Multi-Label Learning With Visual-Semantic Embedded Knowledge Graph for Diagnosis of Radiology Imaging
    Hou, Daibing
    Zhao, Zijian
    Hu, Sanyuan
    IEEE ACCESS, 2021, 9 : 15720 - 15730
  • [32] Deep Visual-Semantic Quantization for Efficient Image Retrieval
    Cao, Yue
    Long, Mingsheng
    Wang, Jianmin
    Liu, Shichen
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 916 - 925
  • [33] Image Tagging by Joint Deep Visual-Semantic Propagation
    Ma, Yuexin
    Zhu, Xinge
    Sun, Yujing
    Yan, Bingzheng
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2017, PT I, 2018, 10735 : 25 - 35
  • [34] Learning and grounding visual multimodal adaptive graph for visual navigation
    Zhou, Kang
    Wang, Jianping
    Xu, Weitao
    Song, Linqi
    Ye, Zaiqiao
    Guo, Chi
    Li, Cong
    INFORMATION FUSION, 2025, 118
  • [35] Learning Hierarchical Visual-Semantic Representation with Phrase Alignment
    Yan, Baoming
    Zhang, Qingheng
    Chen, Liyu
    Wang, Lin
    Pei, Leihao
    Yang, Jiang
    Yu, Enyun
    Li, Xiaobo
    Zhao, Binqiang
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 349 - 357
  • [36] Deep Visual-Semantic Alignments for Generating Image Descriptions
    Karpathy, Andrej
    Li Fei-Fei
    2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2015, : 3128 - 3137
  • [37] Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding
    Niu, Zhenxing
    Zhou, Mo
    Wang, Le
    Gao, Xinbo
    Hua, Gang
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1899 - 1907
  • [38] Visual graph mining for graph matching
    Zhang, Quanshi
    Song, Xuan
    Yang, Yu
    Ma, Haotian
    Shibasaki, Ryosuke
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2019, 178 : 16 - 29
  • [39] Weakly-Supervised Image Hashing through Masked Visual-Semantic Graph-based Reasoning
    Jin, Lu
    Li, Zechao
    Pan, Yonghua
    Tang, Jinhui
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 916 - 924
  • [40] Webly Supervised Image Classification with Metadata: Automatic Noisy Label Correction via Visual-Semantic Graph
    Yang, Jingkang
    Chen, Weirong
    Feng, Litong
    Yan, Xiaopeng
    Zheng, Huabin
    Zhang, Wayne
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 83 - 91