Exploring Better Text Image Translation with Multimodal Codebook

被引：0

作者：

Lan, Zhibin ^{[1
,3
]}

Yu, Jiawei ^{[1
,3
]}

Li, Xiang ^{[2
]}

Zhang, Wen ^{[2
]}

Luan, Jian ^{[2
]}

Wang, Bin ^{[2
]}

Huang, Degen ^{[4
]}

Su, Jinsong ^{[1
,3
]}

机构：

[1] Xiamen Univ, Sch Informat, Xiamen, Peoples R China

[2] Xiaomi AI Lab, Beijing, Peoples R China

[3] Xiamen Univ, Key Lab Digital Protect & Intelligent Proc Intang, Minist Culture & Tourism, Xiamen, Peoples R China

[4] Dalian Univ Technol, Dalian, Peoples R China

来源：

PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1 | 2023年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text image translation (TIT) aims to translate the source texts embedded in the image to target translations, which has a wide range of applications and thus has important research value. However, current studies on TIT are confronted with two main bottlenecks: 1) this task lacks a publicly available TIT dataset, 2) dominant models are constructed in a cascaded manner, which tends to suffer from the error propagation of optical character recognition (OCR). In this work, we first annotate a Chinese-English TIT dataset named OCRMT30K, providing convenience for subsequent studies. Then, we propose a TIT model with a multimodal codebook, which is able to associate the image with relevant texts, providing useful supplementary information for translation. Moreover, we present a multi-stage training framework involving text machine translation, image-text alignment, and TIT tasks, which fully exploits additional bilingual texts, OCR dataset and our OCRMT30K dataset to train our model. Extensive experiments and in-depth analyses strongly demonstrate the effectiveness of our proposed model and training framework.1

引用

页码：3479 / 3491

页数：13

共 50 条

[1] Audio description from the image to the word. Intersemiotic translation of a multimodal text
Di Pasquale, Veronica
ARTIFARA-REVISTA DE LENGUAS Y LITERATURAS IBERICAS Y LATINOAMERICANAS, 2023, (23):
[2] Multimodal supervised image translation
Ruan, Congcong
Chen, Dihu
Hu, Haifeng
ELECTRONICS LETTERS, 2019, 55 (04) : 190 - 191
[3] Fusion of Image-text attention for Transformer-based Multimodal Machine Translation
Ma, Junteng
Qin, Shihao
Su, Lan
Li, Xia
Xiao, Lixian
PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 199 - 204
[4] Image-Text Multimodal Translation Based on AIGC Human-Machine Interaction
Yang, Lixue
2024 4TH INTERNATIONAL CONFERENCE ON HUMAN-MACHINE INTERACTION, ICHMI 2024, 2024, : 44 - 51
[5] Toward Multimodal Image-to-Image Translation
Zhu, Jun-Yan
Zhang, Richard
Pathak, Deepak
Darrell, Trevor
Efros, Alexei A.
Wang, Oliver
Shechtman, Eli
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
[6] Multimodal Unsupervised Image-to-Image Translation
Huang, Xun
Liu, Ming-Yu
Belongie, Serge
Kautz, Jan
COMPUTER VISION - ECCV 2018, PT III, 2018, 11207 : 179 - 196
[7] A codebook design technique for better image quality in vector quantization
Ling, N
Li, JH
DCC '96 - DATA COMPRESSION CONFERENCE, PROCEEDINGS, 1996, : 447 - 447
[8] Multimodal Pivots for Image Caption Translation
Hitschler, Julian
Schamoni, Shigehiko
Riezler, Stefan
PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 2399 - 2409
[9] Principles of libretto translation and problems of multimodal text interpretation
Boyarkina, Albina
RUSSIAN JOURNAL OF LINGUISTICS, 2022, 26 (03): : 807 - 830
[10] Multimodal AutoML for Image, Text and Tabular Data
Erickson, Nick
Shi, Xingjian
Sharpnack, James
Smola, Alex
PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022, 2022, : 4786 - 4787

← 1 2 3 4 5 →