Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding

被引:10
|
作者
Zhao, Heng [1 ]
Zhou, Joey Tianyi [1 ]
Ong, Yew-Soon [1 ,2 ]
机构
[1] A STAR Ctr Frontier AI Res CFAR, Singapore 138632, Singapore
[2] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore
关键词
Cross-attention; deep learning; multimodal; referring expression comprehension; visual grounding;
D O I
10.1109/TNNLS.2022.3183827
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current one-stage methods for visual grounding encode the language query as one holistic sentence embedding before fusion with visual features for target localization. Such a formulation provides insufficient ability to model query at the word level, and therefore is prone to neglect words that may not be the most important ones for a sentence but are critical for the referred object. In this article, we propose Word2Pix: a one-stage visual grounding network based on the encoder-decoder transformer architecture that enables learning for textual to visual feature correspondence via word to pixel attention. Each word from the query sentence is given an equal opportunity when attending to visual pixels through multiple stacks of transformer decoder layers. In this way, the decoder can learn to model the language query and fuse language with the visual features for target prediction simultaneously. We conduct the experiments on RefCOCO, RefCOCO+, and RefCOCOg datasets, and the proposed Word2Pix outperforms the existing one-stage methods by a notable margin. The results obtained also show that Word2Pix surpasses the two-stage visual grounding models, while at the same time keeping the merits of the one-stage paradigm, namely, end-to-end training and fast inference speed. Code is available at https:// github.com/azurerain7/Word2Pix.
引用
收藏
页码:1523 / 1533
页数:11
相关论文
共 50 条
  • [1] Cross-Attention Transformer for Video Interpolation
    Kim, Hannah Halin
    Yu, Shuzhi
    Yuan, Shuai
    Tomasi, Carlo
    COMPUTER VISION - ACCV 2022 WORKSHOPS, 2023, 13848 : 325 - 342
  • [2] SCATT: Transformer tracking with symmetric cross-attention
    Zhang, Jianming
    Chen, Wentao
    Dai, Jiangxin
    Zhang, Jin
    APPLIED INTELLIGENCE, 2024, 54 (08) : 6069 - 6084
  • [3] Cross-attention Based Text-image Transformer for Visual Question Answering
    Rezapour M.
    Recent Advances in Computer Science and Communications, 2024, 17 (04) : 72 - 78
  • [4] Deblurring transformer tracking with conditional cross-attention
    Sun, Fuming
    Zhao, Tingting
    Zhu, Bing
    Jia, Xu
    Wang, Fasheng
    MULTIMEDIA SYSTEMS, 2023, 29 (03) : 1131 - 1144
  • [5] Deblurring transformer tracking with conditional cross-attention
    Fuming Sun
    Tingting Zhao
    Bing Zhu
    Xu Jia
    Fasheng Wang
    Multimedia Systems, 2023, 29 : 1131 - 1144
  • [6] Word recognition and visual attention
    Vitu, F
    Schroyens, W
    Brysbaert, M
    INTERNATIONAL JOURNAL OF PSYCHOLOGY, 1996, 31 (3-4) : 18449 - 18449
  • [7] Deformable Cross-Attention Transformer for Medical Image Registration
    Chen, Junyu
    Liu, Yihao
    He, Yufan
    Du, Yong
    MACHINE LEARNING IN MEDICAL IMAGING, MLMI 2023, PT I, 2024, 14348 : 115 - 125
  • [8] CAR-Transformer: Cross-Attention Reinforcement Transformer for Cross-Lingual Summarization
    Cai, Yuang
    Yuan, Yuyu
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17718 - 17726
  • [9] Unsupervised Domain Adaptation via Bidirectional Cross-Attention Transformer
    Wang, Xiyu
    Guo, Pengxin
    Zhang, Yu
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, ECML PKDD 2023, PT V, 2023, 14173 : 309 - 325
  • [10] Learning Cross-Attention Point Transformer With Global Porous Sampling
    Duan, Yueqi
    Sun, Haowen
    Yan, Juncheng
    Lu, Jiwen
    Zhou, Jie
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 6283 - 6297