Improving Parallel Corpus Quality for Chinese-Vietnamese Statistical Machine Translation

被引:0
|
作者
Tran H.-A. [1 ]
Guo Y. [1 ]
Jian P. [1 ]
Shi S. [2 ]
Huang H. [1 ,2 ]
机构
[1] Department of Computer Science and Technology, Beijing Institute of Technology, Beijing
[2] Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Application, Beijing Institute of Technology, Beijing
基金
中国国家自然科学基金;
关键词
Bilingual movie subtitles; Chinese-Vietnamese translation; Low resource languages; Machine translation; Parallel corpus filtering;
D O I
10.15918/j.jbit1004-0579.201827.0116
中图分类号
学科分类号
摘要
The performance of a machine translation system heavily depends on the quantity and quality of the bilingual language resource. However, getting a parallel corpus, which has a large scale and is of high quality, is a very difficult task especially for low resource languages such as Chinese-Vietnamese. Fortunately, multilingual user generated contents (UGC), such as bilingual movie subtitles, provide us access to automatic construction of the parallel corpus. Although the amount of UGC parallel corpora can be considerable, the original corpus is not suitable for statistical machine translation (SMT) systems. The corpus may contain translation errors, sentence mismatching, free translations, etc. To improve the quality of the bilingual corpus for SMT systems, three filtering methods are proposed: sentence length difference, the semantic of sentence pairs, and machine learning. Experiments are conducted on the Chinese to Vietnamese translation corpus. Experimental results demonstrate that all the three methods effectively improve the corpus quality, and the machine translation performance (BLEU score) can be improved by 1.32. © 2018 Editorial Department of Journal of Beijing Institute of Technology.
引用
收藏
页码:127 / 136
页数:9
相关论文
共 50 条
  • [1] Improving Parallel Corpus Quality for Chinese-Vietnamese Statistical Machine Translation
    Huu-anh Tran
    Yuhang Guo
    Ping Jian
    Shumin Shi
    Heyan Huang
    Journal of Beijing Institute of Technology, 2018, 27 (01) : 127 - 136
  • [2] Preordering for Chinese-Vietnamese Statistical Machine Translation
    Huu-Anh Tran
    Huang, Heyan
    Phuoc Tran
    Shi, Shumin
    Huu Nguyen
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2019, E102D (02): : 375 - 382
  • [3] Integrating Pronunciation into Chinese-Vietnamese Statistical Machine Translation
    Anh Tran Huu
    Huang, Heyan
    Guo, Yuhang
    Shi, Shumin
    Jian, Ping
    TSINGHUA SCIENCE AND TECHNOLOGY, 2018, 23 (06) : 715 - 723
  • [4] Integrating Pronunciation into Chinese-Vietnamese Statistical Machine Translation
    Anh Tran Huu
    Heyan Huang
    Yuhang Guo
    Shumin Shi
    Ping Jian
    Tsinghua Science and Technology, 2018, 23 (06) : 715 - 723
  • [5] A Method of Chinese-Vietnamese Bilingual Corpus Construction for Machine Translation
    Tran, Phuoc
    Nguyen, Thien
    Vu, Dinh-Hong
    Tran, Huu-Anh
    Vo, Bay
    IEEE ACCESS, 2022, 10 : 78928 - 78938
  • [6] Improving Chinese-Vietnamese Neural Machine Translation with Linguistic Differences
    Yu, Zhiqiang
    Yu, Zhengtao
    Xian, Yantuan
    Huang, Yuxin
    Guo, Junjun
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (02)
  • [7] Exploring Machine Translation on the Chinese-Vietnamese Language Pair
    Huu-Anh Tran
    Phuoc Tran
    Phuong-Thuy Dao
    Thi-Mien Pham
    COMPUTATIONAL DATA AND SOCIAL NETWORKS, 2019, 11917 : 205 - 206
  • [8] Language Post Positioned Characteristic Based Chinese-Vietnamese Statistical Machine Translation Method
    He, Jianyalin
    Yu, Zhengtao
    Lv, Changtao
    Lai, Hua
    Gao, Shengxiang
    Zhang, Yang
    2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 180 - 184
  • [9] Handling syntactic difference in Chinese-Vietnamese neural machine translation
    Yu, Zhiqiang
    Wang, Ting
    Liu, Shihu
    Tan, Xuewen
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2024, 46 (03) : 5533 - 5544
  • [10] Word Re-Segmentation in Chinese-Vietnamese Machine Translation
    Phuoc Tran
    Dien Dinh
    Nguyen, Long H. B.
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2016, 16 (02)