Scene Text Visual Question Answering

被引：145

作者：

Biten, Ali Furkan ^{[1
]}

Tito, Ruben ^{[1
]}

Mafla, Andres ^{[1
]}

Gomez, Lluis ^{[1
]}

Rusinol, Marcal ^{[1
]}

Valveny, Ernest ^{[1
]}

Jawahar, C. V. ^{[2
]}

Karatzas, Dimosthenis ^{[1
]}

机构：

[1] UAB, Comp Vis Ctr, Barcelona, Spain

[2] IIIT Hyderabad, CVIT, Hyderabad, India

来源：

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年

关键词：

D O I：

10.1109/ICCV.2019.00439

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the Visual Question Answering process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research.

引用

页码：4290 / 4300

页数：11

共 50 条

[1] A Multilingual Approach to Scene Text Visual Question Answering
Brugues i Pujolras, Josep
Gomez i Bigorda, Llufs
Karatzas, Dimosthenis
DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 65 - 79
[2] Scene text visual question answering by using YOLO and STN
Nourali K.
Dolkhani E.
International Journal of Speech Technology, 2024, 27 (01) : 69 - 76
[3] Towards Reasoning Ability in Scene Text Visual Question Answering
Wang, Qingqing
Xiao, Liqiang
Lu, Yue
Jin, Yaohui
He, Hao
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2281 - 2289
[4] Improving visual question answering by combining scene-text information
Sharma, Himanshu
Jalal, Anand Singh
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (09) : 12177 - 12208
[5] An Empirical Study of Multilingual Scene-Text Visual Question Answering
Li, Lin
Zhang, Haohan
Fang, Zeqing
PROCEEDINGS OF THE 2ND WORKSHOP ON USER-CENTRIC NARRATIVE SUMMARIZATION OF LONG VIDEOS, NARSUM 2023, 2023, : 3 - 8
[6] Improving visual question answering by combining scene-text information
Himanshu Sharma
Anand Singh Jalal
Multimedia Tools and Applications, 2022, 81 : 12177 - 12208
[7] Transductive Cross-Lingual Scene-Text Visual Question Answering
Li, Lin
Zhang, Haohan
Fang, Zeqin
Xie, Zhongwei
Liu, Jianquan
NEURAL INFORMATION PROCESSING, ICONIP 2023, PT VI, 2024, 14452 : 452 - 467
[8] Multimodal grid features and cell pointers for scene text visual question answering
Gomez, Lluis
Biten, Ali Furkan
Tito, Ruben
Mafla, Andres
Rusinol, Marcal
Valveny, Ernest
Karatzas, Dimosthenis
PATTERN RECOGNITION LETTERS, 2021, 150 : 242 - 249
[9] A framework for visual question answering with the integration of scene-text using PHOCs and fisher vectors
Sharma, Himanshu
Jalal, Anand Singh
EXPERT SYSTEMS WITH APPLICATIONS, 2022, 190
[10] Lightweight Visual Question Answering using Scene Graphs
Nuthalapati, Sai Vidyaranya
Chandradevan, Ramraj
Giunchiglia, Eleonora
Li, Bowen
Kayser, Maxime
Lukasiewicz, Thomas
Yang, Carl
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 3353 - 3357

← 1 2 3 4 5 →