RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering

被引:20
|
作者
Jin, Zan-Xia [1 ]
Wu, Heran [1 ]
Yang, Chun [1 ]
Zhou, Fang [1 ]
Qin, Jingyan [1 ,2 ]
Xiao, Lei [3 ]
Yin, Xu-Cheng [1 ,4 ,5 ]
机构
[1] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Dept Comp Sci & Technol, Beijing 100083, Peoples R China
[2] Univ Sci & Technol Beijing, Sch Mech Engn, Dept Ind Design, Beijing 100083, Peoples R China
[3] Tencent Technol Shenzhen Co Ltd, Shenzhen 518057, Peoples R China
[4] Univ Sci & Technol Beijing, Inst Artificial Intelligence, Beijing 100083, Peoples R China
[5] Univ Sci & Technol Beijing, USTB EEasy Tech Joint Lab Artificial Intelligence, Beijing 100083, Peoples R China
关键词
Attention mechanism; computer vision; machine reading comprehension; natural language processing; visual question answering;
D O I
10.1109/TMM.2021.3120194
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text-based visual question answering (VQA) requires to read and understand text in an image to correctly answer a given question. However, most current methods simply add optical character recognition (OCR) tokens extracted from the image into the VQA model without considering contextual information of OCR tokens and mining the relationships between OCR tokens and scene objects. In this paper, we propose a novel text-centered method called RUArt (Reading, Understanding and Answering the Related Text) for text-based VQA. Taking an image and a question as input, RUArt first reads the image and obtains text and scene objects. Then, it understands the question, OCRed text and objects in the context of the scene, and further mines the relationships among them. Finally, it answers the related text for the given question through text semantic matching and reasoning. We evaluate our RUArt on two text-based VQA benchmarks (ST-VQA and TextVQA) and conduct extensive ablation studies for exploring the reasons behind RUArt's effectiveness. Experimental results demonstrate that our method can effectively explore the contextual information of the text and mine the stable relationships between the text and objects.
引用
收藏
页码:1 / 12
页数:12
相关论文
共 50 条
  • [21] Semantic Text Recognition via Visual Question Answering
    Beltran, Viviana
    Journet, Nicholas
    Coustaty, Mickael
    Doucet, Antoine
    2019 INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION WORKSHOPS (ICDARW), VOL 5, 2019, : 97 - 102
  • [22] Fusion of Detected Objects in Text for Visual Question Answering
    Alberti, Chris
    Ling, Jeffrey
    Collins, Michael
    Reitter, David
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 2131 - 2140
  • [23] Text-Based Visual Secret Sharing
    Fang, Wen-Pinn
    Hsu, Jia-Hao
    Cheng, Wei-chi
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2013, 13 (05): : 38 - 40
  • [24] Answer-Based Entity Extraction and Alignment for Visual Text Question Answering
    Yu, Jun
    Jing, Mohan
    Liu, Weihao
    Luo, Tongxu
    Zhang, Bingyuan
    Lu, Keda
    Lei, Fangyu
    Sun, Jianqing
    Liang, Jiaen
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9487 - 9491
  • [25] VTQAGen: BART-based Generative Model For Visual Text Question Answering
    Chen, Haoru
    Wan, Tianjiao
    Lin, Zhimin
    Xu, Kele
    Wang, Jin
    Wang, Huaimin
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9456 - 9461
  • [26] Visual question answering based evaluation metrics for text-to-image generation
    Miyamoto, Mizuki
    Morita, Ryugo
    Zhou, Jinjia
    2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,
  • [27] Text-based question answering from information retrieval and deep neural network perspectives: A survey
    Abbasiantaeb, Zahra
    Momtazi, Saeedeh
    WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2021, 11 (06)
  • [28] Scene text visual question answering by using YOLO and STN
    Nourali K.
    Dolkhani E.
    International Journal of Speech Technology, 2024, 27 (01) : 69 - 76
  • [29] Towards Reasoning Ability in Scene Text Visual Question Answering
    Wang, Qingqing
    Xiao, Liqiang
    Lu, Yue
    Jin, Yaohui
    He, Hao
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2281 - 2289
  • [30] Focal Visual-Text Attention for Memex Question Answering
    Liang, Junwei
    Jiang, Lu
    Cao, Liangliang
    Kalantidis, Yannis
    Li, Li-Jia
    Hauptmann, Alexander G.
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (08) : 1893 - 1908