Visual-Semantic Graph Matching for Visual Grounding

被引:15
|
作者
Jing, Chenchen [1 ]
Wu, Yuwei [1 ]
Pei, Mingtao [1 ]
Hu, Yao [2 ]
Jia, Yunde [1 ]
Wu, Qi [3 ]
机构
[1] Beijing Inst Technol, Beijing, Peoples R China
[2] Alibaba Youku Cognit & Intelligent Lab, Beijing, Peoples R China
[3] Univ Adelaide, Adelaide, SA, Australia
关键词
Visual Grounding; Graph Matching; Visual Scene Graph; Language Scene Graph; LANGUAGE;
D O I
10.1145/3394171.3413902
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Grounding is the task of associating entities in a natural language sentence with objects in an image. In this paper, we formulate visual grounding as a graph matching problem to find node correspondences between a visual scene graph and a language scene graph. These two graphs are heterogeneous, representing structure layouts of the sentence and image, respectively. We learn unified contextual node representations of the two graphs by using a cross-modal graph convolutional network to reduce their discrepancy. The graph matching is thus relaxed as a linear assignment problem because the learned node representations characterize both node information and structure information. A permutation loss and a semantic cycle-consistency loss are further introduced to solve the linear assignment problem with or without ground-truth correspondences. Experimental results on two visual grounding tasks, i.e., referring expression comprehension and phrase localization, demonstrate the effectiveness of our method.
引用
收藏
页码:4041 / 4050
页数:10
相关论文
共 50 条
  • [41] Visual-semantic graph neural network with pose-position attentive learning for group activity recognition
    Liu, Tianshan
    Zhao, Rui
    Lam, Kin-Man
    Kong, Jun
    NEUROCOMPUTING, 2022, 491 : 217 - 231
  • [42] Fine-grained Image Classification by Visual-Semantic Embedding
    Xu, Huapeng
    Qi, Guilin
    Li, Jingjing
    Wang, Meng
    Xu, Kang
    Gao, Huan
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 1043 - 1049
  • [43] Multimodal Visual-Semantic Representations Learning for Scene Text Recognition
    Gao, Xinjian
    Pang, Ye
    Liu, Yuyu
    Han, Maokun
    Yu, Jun
    Wang, Wei
    Chen, Yuanxu
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (07)
  • [44] Adaptive Ladder Loss for Learning Coherent Visual-Semantic Embedding
    Wang, Le
    Zhou, Mo
    Niu, Zhenxing
    Zhang, Qilin
    Zheng, Nanning
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1133 - 1147
  • [45] The Structure of Process of Students' Learning the Visual-Semantic Representation of a Hieroglyphs
    Naumova, Yuliya
    PSYCHOLINGUISTICS, 2018, 23 (01): : 258 - 273
  • [46] Emergent Visual-Semantic Hierarchies in Image-Text Representations
    Alper, Morris
    Averbuch-Elor, Hadar
    COMPUTER VISION - ECCV 2024, PT LII, 2025, 15110 : 220 - 238
  • [47] Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
    Song, Yale
    Soleymani, Mohammad
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1979 - 1988
  • [48] Deep Visual-Semantic Hashing for Cross-Modal Retrieval
    Cao, Yue
    Long, Mingsheng
    Wang, Jianmin
    Yang, Qiang
    Yu, Philip S.
    KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 1445 - 1454
  • [49] Generic compact representation through visual-semantic ambiguity removal
    Long, Yang
    Guan, Yu
    Shao, Ling
    PATTERN RECOGNITION LETTERS, 2019, 117 : 186 - 192
  • [50] Transductive Visual-Semantic Embedding for Zero-shot Learning
    Xu, Xing
    Shen, Fumin
    Yang, Yang
    Shao, Jie
    Huang, Zi
    PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR'17), 2017, : 41 - 49