Visual-Semantic Graph Matching for Visual Grounding

被引:15
|
作者
Jing, Chenchen [1 ]
Wu, Yuwei [1 ]
Pei, Mingtao [1 ]
Hu, Yao [2 ]
Jia, Yunde [1 ]
Wu, Qi [3 ]
机构
[1] Beijing Inst Technol, Beijing, Peoples R China
[2] Alibaba Youku Cognit & Intelligent Lab, Beijing, Peoples R China
[3] Univ Adelaide, Adelaide, SA, Australia
关键词
Visual Grounding; Graph Matching; Visual Scene Graph; Language Scene Graph; LANGUAGE;
D O I
10.1145/3394171.3413902
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Grounding is the task of associating entities in a natural language sentence with objects in an image. In this paper, we formulate visual grounding as a graph matching problem to find node correspondences between a visual scene graph and a language scene graph. These two graphs are heterogeneous, representing structure layouts of the sentence and image, respectively. We learn unified contextual node representations of the two graphs by using a cross-modal graph convolutional network to reduce their discrepancy. The graph matching is thus relaxed as a linear assignment problem because the learned node representations characterize both node information and structure information. A permutation loss and a semantic cycle-consistency loss are further introduced to solve the linear assignment problem with or without ground-truth correspondences. Experimental results on two visual grounding tasks, i.e., referring expression comprehension and phrase localization, demonstrate the effectiveness of our method.
引用
收藏
页码:4041 / 4050
页数:10
相关论文
共 50 条
  • [21] From Node to Graph: Joint Reasoning on Visual-Semantic Relational Graph for Zero-Shot Detection
    Nie, Hui
    Wang, Ruiping
    Chen, Xilin
    2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 1648 - 1657
  • [22] Human-object interaction detection via interactive visual-semantic graph learning
    Wu, Tongtong
    Duan, Fuqing
    Chang, Liang
    Lu, Ke
    SCIENCE CHINA-INFORMATION SCIENCES, 2022, 65 (06)
  • [23] VSRN: Visual-Semantic Relation Network for Video Visual Relation Inference
    Cao, Qianwen
    Huang, Heyan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (02) : 768 - 777
  • [24] Human-object interaction detection via interactive visual-semantic graph learning
    Tongtong WU
    Fuqing DUAN
    Liang CHANG
    Ke LU
    Science China(Information Sciences), 2022, 65 (06) : 81 - 82
  • [25] Human-object interaction detection via interactive visual-semantic graph learning
    Tongtong Wu
    Fuqing Duan
    Liang Chang
    Ke Lu
    Science China Information Sciences, 2022, 65
  • [26] Artpedia: A New Visual-Semantic Dataset with Visual and Contextual Sentences in the Artistic Domain
    Stefanini, Matteo
    Cornia, Marcella
    Baraldi, Lorenzo
    Corsini, Massimiliano
    Cucchiara, Rita
    IMAGE ANALYSIS AND PROCESSING - ICIAP 2019, PT II, 2019, 11752 : 729 - 740
  • [27] Graph-Based Visual-Semantic Entanglement Network for Zero-Shot Image Recognition
    Hu, Yang
    Wen, Guihua
    Chapman, Adriane
    Yang, Pei
    Luo, Mingnan
    Xu, Yingxue
    Dai, Dan
    Hall, Wendy
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 2473 - 2487
  • [28] Recipe Popularity Prediction with Deep Visual-Semantic Fusion
    Sanjo, Satoshi
    Katsurai, Marie
    CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, : 2279 - 2282
  • [29] Deep Visual-Semantic Alignments for Generating Image Descriptions
    Karpathy, Andrej
    Li Fei-Fei
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (04) : 664 - 676
  • [30] Hierarchical visual-semantic interaction for scene text recognition
    Diao, Liang
    Tang, Xin
    Wang, Jun
    Xie, Guotong
    Hu, Junlin
    INFORMATION FUSION, 2024, 102