Scene Text Visual Question Answering

被引:145
|
作者
Biten, Ali Furkan [1 ]
Tito, Ruben [1 ]
Mafla, Andres [1 ]
Gomez, Lluis [1 ]
Rusinol, Marcal [1 ]
Valveny, Ernest [1 ]
Jawahar, C. V. [2 ]
Karatzas, Dimosthenis [1 ]
机构
[1] UAB, Comp Vis Ctr, Barcelona, Spain
[2] IIIT Hyderabad, CVIT, Hyderabad, India
关键词
D O I
10.1109/ICCV.2019.00439
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the Visual Question Answering process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research.
引用
收藏
页码:4290 / 4300
页数:11
相关论文
共 50 条
  • [1] A Multilingual Approach to Scene Text Visual Question Answering
    Brugues i Pujolras, Josep
    Gomez i Bigorda, Llufs
    Karatzas, Dimosthenis
    DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 65 - 79
  • [2] Scene text visual question answering by using YOLO and STN
    Nourali K.
    Dolkhani E.
    International Journal of Speech Technology, 2024, 27 (01) : 69 - 76
  • [3] Towards Reasoning Ability in Scene Text Visual Question Answering
    Wang, Qingqing
    Xiao, Liqiang
    Lu, Yue
    Jin, Yaohui
    He, Hao
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2281 - 2289
  • [4] Improving visual question answering by combining scene-text information
    Sharma, Himanshu
    Jalal, Anand Singh
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (09) : 12177 - 12208
  • [5] An Empirical Study of Multilingual Scene-Text Visual Question Answering
    Li, Lin
    Zhang, Haohan
    Fang, Zeqing
    PROCEEDINGS OF THE 2ND WORKSHOP ON USER-CENTRIC NARRATIVE SUMMARIZATION OF LONG VIDEOS, NARSUM 2023, 2023, : 3 - 8
  • [6] Improving visual question answering by combining scene-text information
    Himanshu Sharma
    Anand Singh Jalal
    Multimedia Tools and Applications, 2022, 81 : 12177 - 12208
  • [7] Transductive Cross-Lingual Scene-Text Visual Question Answering
    Li, Lin
    Zhang, Haohan
    Fang, Zeqin
    Xie, Zhongwei
    Liu, Jianquan
    NEURAL INFORMATION PROCESSING, ICONIP 2023, PT VI, 2024, 14452 : 452 - 467
  • [8] Multimodal grid features and cell pointers for scene text visual question answering
    Gomez, Lluis
    Biten, Ali Furkan
    Tito, Ruben
    Mafla, Andres
    Rusinol, Marcal
    Valveny, Ernest
    Karatzas, Dimosthenis
    PATTERN RECOGNITION LETTERS, 2021, 150 : 242 - 249
  • [9] A framework for visual question answering with the integration of scene-text using PHOCs and fisher vectors
    Sharma, Himanshu
    Jalal, Anand Singh
    EXPERT SYSTEMS WITH APPLICATIONS, 2022, 190
  • [10] Lightweight Visual Question Answering using Scene Graphs
    Nuthalapati, Sai Vidyaranya
    Chandradevan, Ramraj
    Giunchiglia, Eleonora
    Li, Bowen
    Kayser, Maxime
    Lukasiewicz, Thomas
    Yang, Carl
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 3353 - 3357