Scene Text Visual Question Answering

被引：145

作者：

Biten, Ali Furkan ^{[1
]}

Tito, Ruben ^{[1
]}

Mafla, Andres ^{[1
]}

Gomez, Lluis ^{[1
]}

Rusinol, Marcal ^{[1
]}

Valveny, Ernest ^{[1
]}

Jawahar, C. V. ^{[2
]}

Karatzas, Dimosthenis ^{[1
]}

机构：

[1] UAB, Comp Vis Ctr, Barcelona, Spain

[2] IIIT Hyderabad, CVIT, Hyderabad, India

来源：

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年

关键词：

D O I：

10.1109/ICCV.2019.00439

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the Visual Question Answering process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research.

引用

页码：4290 / 4300

页数：11

共 50 条

[31] Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation
Yan, Xu
Yuan, Zhihao
Du, Yuhao
Liao, Yinghong
Guo, Yao
Cui, Shuguang
Li, Zhen
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2024, 30 (12) : 7473 - 7485
[32] Knowledge enhancement and scene understanding for knowledge-based visual question answering
Zhenqiang Su
Gang Gou
Knowledge and Information Systems, 2024, 66 : 2193 - 2208
[33] OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement
Yan, Feng
Silamu, Wushouer
Chai, Yachuang
Li, Yanbing
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (03) : 7085 - 7096
[34] OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement
Feng Yan
Wushouer Silamu
Yachuang Chai
Yanbing Li
Multimedia Tools and Applications, 2024, 83 : 7085 - 7096
[35] Question Modifiers in Visual Question Answering
Britton, William
Sarkhel, Somdeb
Venugopal, Deepak
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1472 - 1479
[36] RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering
Jin, Zan-Xia
Wu, Heran
Yang, Chun
Zhou, Fang
Qin, Jingyan
Xiao, Lei
Yin, Xu-Cheng
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1 - 12
[37] Question-aware dynamic scene graph of local semantic representation learning for visual question answering
Wu, Jinmeng
Ge, Fulin
Hong, Hanyu
Shi, Yu
Hao, Yanbin
Ma, Lei
PATTERN RECOGNITION LETTERS, 2023, 170 : 93 - 99
[38] Answer-Based Entity Extraction and Alignment for Visual Text Question Answering
Yu, Jun
Jing, Mohan
Liu, Weihao
Luo, Tongxu
Zhang, Bingyuan
Lu, Keda
Lei, Fangyu
Sun, Jianqing
Liang, Jiaen
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9487 - 9491
[39] VTQAGen: BART-based Generative Model For Visual Text Question Answering
Chen, Haoru
Wan, Tianjiao
Lin, Zhimin
Xu, Kele
Wang, Jin
Wang, Huaimin
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9456 - 9461
[40] Visual question answering based evaluation metrics for text-to-image generation
Miyamoto, Mizuki
Morita, Ryugo
Zhou, Jinjia
2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,

← 1 2 3 4 5 →