RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering

被引:20
|
作者
Jin, Zan-Xia [1 ]
Wu, Heran [1 ]
Yang, Chun [1 ]
Zhou, Fang [1 ]
Qin, Jingyan [1 ,2 ]
Xiao, Lei [3 ]
Yin, Xu-Cheng [1 ,4 ,5 ]
机构
[1] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Dept Comp Sci & Technol, Beijing 100083, Peoples R China
[2] Univ Sci & Technol Beijing, Sch Mech Engn, Dept Ind Design, Beijing 100083, Peoples R China
[3] Tencent Technol Shenzhen Co Ltd, Shenzhen 518057, Peoples R China
[4] Univ Sci & Technol Beijing, Inst Artificial Intelligence, Beijing 100083, Peoples R China
[5] Univ Sci & Technol Beijing, USTB EEasy Tech Joint Lab Artificial Intelligence, Beijing 100083, Peoples R China
关键词
Attention mechanism; computer vision; machine reading comprehension; natural language processing; visual question answering;
D O I
10.1109/TMM.2021.3120194
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text-based visual question answering (VQA) requires to read and understand text in an image to correctly answer a given question. However, most current methods simply add optical character recognition (OCR) tokens extracted from the image into the VQA model without considering contextual information of OCR tokens and mining the relationships between OCR tokens and scene objects. In this paper, we propose a novel text-centered method called RUArt (Reading, Understanding and Answering the Related Text) for text-based VQA. Taking an image and a question as input, RUArt first reads the image and obtains text and scene objects. Then, it understands the question, OCRed text and objects in the context of the scene, and further mines the relationships among them. Finally, it answers the related text for the given question through text semantic matching and reasoning. We evaluate our RUArt on two text-based VQA benchmarks (ST-VQA and TextVQA) and conduct extensive ablation studies for exploring the reasons behind RUArt's effectiveness. Experimental results demonstrate that our method can effectively explore the contextual information of the text and mine the stable relationships between the text and objects.
引用
收藏
页码:1 / 12
页数:12
相关论文
共 50 条
  • [11] Weakly-Supervised 3D Spatial Reasoning for Text-Based Visual Question Answering
    Li, Hao
    Huang, Jinfa
    Jin, Peng
    Song, Guoli
    Wu, Qi
    Chen, Jie
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3367 - 3382
  • [12] MaQA: A Manual Text-Based Approach for Car-Specific Question Answering
    Park, Cheoneum
    Jeong, Seohyeong
    Kim, Juae
    ELECTRONICS, 2024, 13 (24):
  • [13] Fusing Essential Knowledge for Text-Based Open-Domain Question Answering
    Su, Xiao
    Li, Ying
    Wu, Zhonghai
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2021, PT II, 2021, 12713 : 627 - 639
  • [14] So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question Answering
    Zheng, Wenbo
    Yan, Lan
    Wang, Fei-Yue
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2024, 54 (02): : 854 - 865
  • [15] AN ATTEMPT AT A TEXT-CENTERED EXEGESIS OF JOHN 21
    HARTMAN, L
    STUDIA THEOLOGICA, 1984, 38 (01): : 29 - 45
  • [16] Focal Visual-Text Attention for Visual Question Answering
    Liang, Junwei
    Jiang, Lu
    Cao, Liangliang
    Li, Li-Jia
    Hauptmann, Alexander
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6135 - 6143
  • [17] Text Visual Question Answering Based on Interactive Learning and Relationship Modeling
    Zhang, Chao
    Wu, Wei
    Ma, Bingzhuo
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT VI, 2024, 15021 : 95 - 109
  • [18] A Multilingual Approach to Scene Text Visual Question Answering
    Brugues i Pujolras, Josep
    Gomez i Bigorda, Llufs
    Karatzas, Dimosthenis
    DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 65 - 79
  • [19] SPATIAL COGNITION IN LITERATURE, TEXT-CENTERED CONTEXTUALIZATION
    HOLLISTER, M
    MOSAIC-A JOURNAL FOR THE INTERDISCIPLINARY STUDY OF LITERATURE, 1995, 28 (02): : 1 - 21
  • [20] CQAVis: Visual Text Analytics for Community Question Answering
    Hoque, Enamul
    Joty, Shafiq
    Marquez, Lluis
    Carenini, Giuseppe
    IUI'17: PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON INTELLIGENT USER INTERFACES, 2017, : 161 - 172