Exploring Better Text Image Translation with Multimodal Codebook

被引:0
|
作者
Lan, Zhibin [1 ,3 ]
Yu, Jiawei [1 ,3 ]
Li, Xiang [2 ]
Zhang, Wen [2 ]
Luan, Jian [2 ]
Wang, Bin [2 ]
Huang, Degen [4 ]
Su, Jinsong [1 ,3 ]
机构
[1] Xiamen Univ, Sch Informat, Xiamen, Peoples R China
[2] Xiaomi AI Lab, Beijing, Peoples R China
[3] Xiamen Univ, Key Lab Digital Protect & Intelligent Proc Intang, Minist Culture & Tourism, Xiamen, Peoples R China
[4] Dalian Univ Technol, Dalian, Peoples R China
来源
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1 | 2023年
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text image translation (TIT) aims to translate the source texts embedded in the image to target translations, which has a wide range of applications and thus has important research value. However, current studies on TIT are confronted with two main bottlenecks: 1) this task lacks a publicly available TIT dataset, 2) dominant models are constructed in a cascaded manner, which tends to suffer from the error propagation of optical character recognition (OCR). In this work, we first annotate a Chinese-English TIT dataset named OCRMT30K, providing convenience for subsequent studies. Then, we propose a TIT model with a multimodal codebook, which is able to associate the image with relevant texts, providing useful supplementary information for translation. Moreover, we present a multi-stage training framework involving text machine translation, image-text alignment, and TIT tasks, which fully exploits additional bilingual texts, OCR dataset and our OCRMT30K dataset to train our model. Extensive experiments and in-depth analyses strongly demonstrate the effectiveness of our proposed model and training framework.1
引用
收藏
页码:3479 / 3491
页数:13
相关论文
共 50 条
  • [1] Audio description from the image to the word. Intersemiotic translation of a multimodal text
    Di Pasquale, Veronica
    ARTIFARA-REVISTA DE LENGUAS Y LITERATURAS IBERICAS Y LATINOAMERICANAS, 2023, (23):
  • [2] Multimodal supervised image translation
    Ruan, Congcong
    Chen, Dihu
    Hu, Haifeng
    ELECTRONICS LETTERS, 2019, 55 (04) : 190 - 191
  • [3] Fusion of Image-text attention for Transformer-based Multimodal Machine Translation
    Ma, Junteng
    Qin, Shihao
    Su, Lan
    Li, Xia
    Xiao, Lixian
    PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 199 - 204
  • [4] Image-Text Multimodal Translation Based on AIGC Human-Machine Interaction
    Yang, Lixue
    2024 4TH INTERNATIONAL CONFERENCE ON HUMAN-MACHINE INTERACTION, ICHMI 2024, 2024, : 44 - 51
  • [5] Toward Multimodal Image-to-Image Translation
    Zhu, Jun-Yan
    Zhang, Richard
    Pathak, Deepak
    Darrell, Trevor
    Efros, Alexei A.
    Wang, Oliver
    Shechtman, Eli
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [6] Multimodal Unsupervised Image-to-Image Translation
    Huang, Xun
    Liu, Ming-Yu
    Belongie, Serge
    Kautz, Jan
    COMPUTER VISION - ECCV 2018, PT III, 2018, 11207 : 179 - 196
  • [7] A codebook design technique for better image quality in vector quantization
    Ling, N
    Li, JH
    DCC '96 - DATA COMPRESSION CONFERENCE, PROCEEDINGS, 1996, : 447 - 447
  • [8] Multimodal Pivots for Image Caption Translation
    Hitschler, Julian
    Schamoni, Shigehiko
    Riezler, Stefan
    PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 2399 - 2409
  • [9] Principles of libretto translation and problems of multimodal text interpretation
    Boyarkina, Albina
    RUSSIAN JOURNAL OF LINGUISTICS, 2022, 26 (03): : 807 - 830
  • [10] Multimodal AutoML for Image, Text and Tabular Data
    Erickson, Nick
    Shi, Xingjian
    Sharpnack, James
    Smola, Alex
    PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022, 2022, : 4786 - 4787