RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering

被引:20
|
作者
Jin, Zan-Xia [1 ]
Wu, Heran [1 ]
Yang, Chun [1 ]
Zhou, Fang [1 ]
Qin, Jingyan [1 ,2 ]
Xiao, Lei [3 ]
Yin, Xu-Cheng [1 ,4 ,5 ]
机构
[1] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Dept Comp Sci & Technol, Beijing 100083, Peoples R China
[2] Univ Sci & Technol Beijing, Sch Mech Engn, Dept Ind Design, Beijing 100083, Peoples R China
[3] Tencent Technol Shenzhen Co Ltd, Shenzhen 518057, Peoples R China
[4] Univ Sci & Technol Beijing, Inst Artificial Intelligence, Beijing 100083, Peoples R China
[5] Univ Sci & Technol Beijing, USTB EEasy Tech Joint Lab Artificial Intelligence, Beijing 100083, Peoples R China
关键词
Attention mechanism; computer vision; machine reading comprehension; natural language processing; visual question answering;
D O I
10.1109/TMM.2021.3120194
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text-based visual question answering (VQA) requires to read and understand text in an image to correctly answer a given question. However, most current methods simply add optical character recognition (OCR) tokens extracted from the image into the VQA model without considering contextual information of OCR tokens and mining the relationships between OCR tokens and scene objects. In this paper, we propose a novel text-centered method called RUArt (Reading, Understanding and Answering the Related Text) for text-based VQA. Taking an image and a question as input, RUArt first reads the image and obtains text and scene objects. Then, it understands the question, OCRed text and objects in the context of the scene, and further mines the relationships among them. Finally, it answers the related text for the given question through text semantic matching and reasoning. We evaluate our RUArt on two text-based VQA benchmarks (ST-VQA and TextVQA) and conduct extensive ablation studies for exploring the reasons behind RUArt's effectiveness. Experimental results demonstrate that our method can effectively explore the contextual information of the text and mine the stable relationships between the text and objects.
引用
收藏
页码:1 / 12
页数:12
相关论文
共 50 条
  • [31] VISION AND TEXT TRANSFORMER FOR PREDICTING ANSWERABILITY ON VISUAL QUESTION ANSWERING
    Le, Tung
    Huy Tien Nguyen
    Minh Le Nguyen
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 934 - 938
  • [32] Towards Video Text Visual Question Answering: Benchmark and Baseline
    Zhao, Minyi
    Li, Bingjia
    Wang, Jie
    Li, Wanqing
    Zhou, Wenjing
    Zhang, Lan
    Xuyang, Shijie
    Yu, Zhihang
    Yu, Xinkun
    Li, Guangze
    Dai, Aobotao
    Zhou, Shuigeng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [33] RUSSIAN HISTORICAL LEXICOGRAPHY: WORD-CENTERED AND TEXT-CENTERED APPROACHES
    Generalova, Elena V.
    VOPROSY LEKSIKOGRAFII-RUSSIAN JOURNAL OF LEXICOGRAPHY, 2018, 13 : 7 - 22
  • [34] Combining Text Classification and Text Matching for FAQ-Based Question Answering
    Mo Q.
    Wang X.-J.
    Beijing Youdian Daxue Xuebao/Journal of Beijing University of Posts and Telecommunications, 2019, 42 (04): : 76 - 81
  • [35] Image Captioning with Text-Based Visual Attention
    Chen He
    Haifeng Hu
    Neural Processing Letters, 2019, 49 : 177 - 185
  • [36] Image Captioning with Text-Based Visual Attention
    He, Chen
    Hu, Haifeng
    NEURAL PROCESSING LETTERS, 2019, 49 (01) : 177 - 185
  • [37] Cross-attention Based Text-image Transformer for Visual Question Answering
    Rezapour M.
    Recent Advances in Computer Science and Communications, 2024, 17 (04) : 72 - 78
  • [38] Text-based neural networks for question intent recognition
    Trewhela, Alvaro
    Figueroa, Alejandro
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 121
  • [39] ViOCRVQA: novel benchmark dataset and VisionReader for visual question answering by understanding Vietnamese text in images
    Pham, Huy Quang
    Nguyen, Thang Kien-Bao
    Nguyen, Quan Van
    Tran, Dan Quang
    Nguyen, Nghia Hieu
    Nguyen, Kiet Van
    Nguyen, Ngan Luu-Thuy
    MULTIMEDIA SYSTEMS, 2025, 31 (02)
  • [40] The Effects of a Text-Centered Literacy Curriculum for Students With Intellectual Disability
    Allor, Jill H.
    Gifford, Diane B.
    Jones, Francesca G.
    Al Otaiba, Stephanie
    Yovanoff, Paul
    Ortiz, Miriam B.
    Cheatham, Jennifer P.
    AJIDD-AMERICAN JOURNAL ON INTELLECTUAL AND DEVELOPMENTAL DISABILITIES, 2018, 123 (05): : 474 - 494