Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

被引:1040
作者
Anderson, Peter [1 ,6 ]
He, Xiaodong [2 ]
Buehler, Chris [3 ]
Teney, Damien [4 ]
Johnson, Mark [5 ]
Gould, Stephen [1 ]
Zhang, Lei [3 ]
机构
[1] Australian Natl Univ, Canberra, ACT, Australia
[2] JD AI Res, Beijing, Peoples R China
[3] Microsoft Res, Redmond, WA USA
[4] Univ Adelaide, Adelaide, SA, Australia
[5] Macquarie Univ, N Ryde, NSW, Australia
[6] Microsoft, Redmond, WA 98052 USA
来源
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2018年
基金
澳大利亚研究理事会;
关键词
OBJECTS;
D O I
10.1109/CVPR.2018.00636
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.
引用
收藏
页码:6077 / 6086
页数:10
相关论文
共 50 条
[41]   Top-down versus bottom-up control of attention in the prefrontal and posterior parietal cortices [J].
Buschman, Timothy J. ;
Miller, Earl K. .
SCIENCE, 2007, 315 (5820) :1860-1862
[42]   Control of goal-directed and stimulus-driven attention in the brain [J].
Corbetta, M ;
Shulman, GL .
NATURE REVIEWS NEUROSCIENCE, 2002, 3 (03) :201-215
[43]   SHIFTING VISUAL-ATTENTION BETWEEN OBJECTS AND LOCATIONS - EVIDENCE FROM NORMAL AND PARIETAL LESION SUBJECTS [J].
EGLY, R ;
DRIVER, J ;
RAFAL, RD .
JOURNAL OF EXPERIMENTAL PSYCHOLOGY-GENERAL, 1994, 123 (02) :161-177
[44]  
Krishna R., 2016, CoRR
[45]   ImageNet Large Scale Visual Recognition Challenge [J].
Russakovsky, Olga ;
Deng, Jia ;
Su, Hao ;
Krause, Jonathan ;
Satheesh, Sanjeev ;
Ma, Sean ;
Huang, Zhiheng ;
Karpathy, Andrej ;
Khosla, Aditya ;
Bernstein, Michael ;
Berg, Alexander C. ;
Fei-Fei, Li .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2015, 115 (03) :211-252
[46]   Objects and attention: the state of the art [J].
Scholl, BJ .
COGNITION, 2001, 80 (1-2) :1-46
[47]  
Teney D., 2018, ECCV
[49]   FEATURE-INTEGRATION THEORY OF ATTENTION [J].
TREISMAN, AM ;
GELADE, G .
COGNITIVE PSYCHOLOGY, 1980, 12 (01) :97-136
[50]  
WILLIAMS RJ, 1992, MACH LEARN, V8, P229, DOI 10.1007/BF00992696