RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering

被引:20
|
作者
Jin, Zan-Xia [1 ]
Wu, Heran [1 ]
Yang, Chun [1 ]
Zhou, Fang [1 ]
Qin, Jingyan [1 ,2 ]
Xiao, Lei [3 ]
Yin, Xu-Cheng [1 ,4 ,5 ]
机构
[1] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Dept Comp Sci & Technol, Beijing 100083, Peoples R China
[2] Univ Sci & Technol Beijing, Sch Mech Engn, Dept Ind Design, Beijing 100083, Peoples R China
[3] Tencent Technol Shenzhen Co Ltd, Shenzhen 518057, Peoples R China
[4] Univ Sci & Technol Beijing, Inst Artificial Intelligence, Beijing 100083, Peoples R China
[5] Univ Sci & Technol Beijing, USTB EEasy Tech Joint Lab Artificial Intelligence, Beijing 100083, Peoples R China
关键词
Attention mechanism; computer vision; machine reading comprehension; natural language processing; visual question answering;
D O I
10.1109/TMM.2021.3120194
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text-based visual question answering (VQA) requires to read and understand text in an image to correctly answer a given question. However, most current methods simply add optical character recognition (OCR) tokens extracted from the image into the VQA model without considering contextual information of OCR tokens and mining the relationships between OCR tokens and scene objects. In this paper, we propose a novel text-centered method called RUArt (Reading, Understanding and Answering the Related Text) for text-based VQA. Taking an image and a question as input, RUArt first reads the image and obtains text and scene objects. Then, it understands the question, OCRed text and objects in the context of the scene, and further mines the relationships among them. Finally, it answers the related text for the given question through text semantic matching and reasoning. We evaluate our RUArt on two text-based VQA benchmarks (ST-VQA and TextVQA) and conduct extensive ablation studies for exploring the reasons behind RUArt's effectiveness. Experimental results demonstrate that our method can effectively explore the contextual information of the text and mine the stable relationships between the text and objects.
引用
收藏
页码:1 / 12
页数:12
相关论文
共 50 条
  • [1] Separate and Locate: Rethink the Text in Text-based Visual Question Answering
    Fang, Chengyang
    Li, Jiangnan
    Li, Liang
    Ma, Can
    Hu, Dayong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4378 - 4388
  • [2] Learning Hierarchical Reasoning for Text-Based Visual Question Answering
    Li, Caiyuan
    Du, Qinyi
    Wang, Qingqing
    Jin, Yaohui
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT III, 2021, 12893 : 305 - 316
  • [3] Cascade Reasoning Network for Text-based Visual Question Answering
    Liu, Fen
    Xu, Guanghui
    Wu, Qi
    Du, Qing
    Jia, Wei
    Tan, Mingkui
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4060 - 4069
  • [4] Text-instance graph: Exploring the relational semantics for text-based visual question answering
    Li, Xiangpeng
    Wu, Bo
    Song, Jingkuan
    Gao, Lianli
    Zeng, Pengpeng
    Gan, Chuang
    PATTERN RECOGNITION, 2022, 124
  • [5] CNN for Text-Based Multiple Choice Question Answering
    Chaturvedi, Akshay
    Pandit, Onkar
    Garain, Utpal
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, 2018, : 272 - 277
  • [6] Transformer models used for text-based question answering systems
    Nassiri, Khalid
    Akhloufi, Moulay
    APPLIED INTELLIGENCE, 2023, 53 (09) : 10602 - 10635
  • [7] Transformer models used for text-based question answering systems
    Khalid Nassiri
    Moulay Akhloufi
    Applied Intelligence, 2023, 53 : 10602 - 10635
  • [8] Scene Text Visual Question Answering
    Biten, Ali Furkan
    Tito, Ruben
    Mafla, Andres
    Gomez, Lluis
    Rusinol, Marcal
    Valveny, Ernest
    Jawahar, C. V.
    Karatzas, Dimosthenis
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4290 - 4300
  • [9] Understanding Video Scenes through Text: Insights from Text-based Video Question Answering
    Jahagirdar, Soumya
    Mathew, Minesh
    Karatzas, Dimosthenis
    Jawahar, C. V.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 4648 - 4652
  • [10] Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering
    Li, Bingjia
    Wang, Jie
    Zhao, Minyi
    Zhou, Shuigeng
    COMPUTER VISION - ACCV 2022, PT IV, 2023, 13844 : 658 - 674