Scene Text Visual Question Answering

被引:145
|
作者
Biten, Ali Furkan [1 ]
Tito, Ruben [1 ]
Mafla, Andres [1 ]
Gomez, Lluis [1 ]
Rusinol, Marcal [1 ]
Valveny, Ernest [1 ]
Jawahar, C. V. [2 ]
Karatzas, Dimosthenis [1 ]
机构
[1] UAB, Comp Vis Ctr, Barcelona, Spain
[2] IIIT Hyderabad, CVIT, Hyderabad, India
关键词
D O I
10.1109/ICCV.2019.00439
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the Visual Question Answering process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research.
引用
收藏
页码:4290 / 4300
页数:11
相关论文
共 50 条
  • [31] Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation
    Yan, Xu
    Yuan, Zhihao
    Du, Yuhao
    Liao, Yinghong
    Guo, Yao
    Cui, Shuguang
    Li, Zhen
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2024, 30 (12) : 7473 - 7485
  • [32] Knowledge enhancement and scene understanding for knowledge-based visual question answering
    Zhenqiang Su
    Gang Gou
    Knowledge and Information Systems, 2024, 66 : 2193 - 2208
  • [33] OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement
    Yan, Feng
    Silamu, Wushouer
    Chai, Yachuang
    Li, Yanbing
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (03) : 7085 - 7096
  • [34] OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement
    Feng Yan
    Wushouer Silamu
    Yachuang Chai
    Yanbing Li
    Multimedia Tools and Applications, 2024, 83 : 7085 - 7096
  • [35] Question Modifiers in Visual Question Answering
    Britton, William
    Sarkhel, Somdeb
    Venugopal, Deepak
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1472 - 1479
  • [36] RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering
    Jin, Zan-Xia
    Wu, Heran
    Yang, Chun
    Zhou, Fang
    Qin, Jingyan
    Xiao, Lei
    Yin, Xu-Cheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1 - 12
  • [37] Question-aware dynamic scene graph of local semantic representation learning for visual question answering
    Wu, Jinmeng
    Ge, Fulin
    Hong, Hanyu
    Shi, Yu
    Hao, Yanbin
    Ma, Lei
    PATTERN RECOGNITION LETTERS, 2023, 170 : 93 - 99
  • [38] Answer-Based Entity Extraction and Alignment for Visual Text Question Answering
    Yu, Jun
    Jing, Mohan
    Liu, Weihao
    Luo, Tongxu
    Zhang, Bingyuan
    Lu, Keda
    Lei, Fangyu
    Sun, Jianqing
    Liang, Jiaen
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9487 - 9491
  • [39] VTQAGen: BART-based Generative Model For Visual Text Question Answering
    Chen, Haoru
    Wan, Tianjiao
    Lin, Zhimin
    Xu, Kele
    Wang, Jin
    Wang, Huaimin
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9456 - 9461
  • [40] Visual question answering based evaluation metrics for text-to-image generation
    Miyamoto, Mizuki
    Morita, Ryugo
    Zhou, Jinjia
    2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,