Learning How to Translate North Korean through South Korean

被引:0
|
作者
Kim, Hwichan [1 ]
Moon, Sangwhan [2 ,3 ]
Okazaki, Naoaki [2 ]
Komachi, Mamoru [1 ]
机构
[1] Tokyo Metropolitan Univ, 6-6 Asahigaoka, Hino, Tokyo 1910065, Japan
[2] Tokyo Inst Technol, 2-12-1 Ookayama, Tokyo 1528550, Japan
[3] Google LLC, 1600 Amphitheatre Pkwy, Mountain View, CA 1600 USA
关键词
Parallel corpus construction; Machine translation; Korean;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
South and North Korea both use the Korean language, but there are some differences in their linguistic aspects, such as vocabulary and spelling rules. Korean NLP research has focused on South Korean only, and existing NLP systems for the Korean language, such as neural machine translation (NMT) models, cannot properly handle North Korean input. Training a model using North Korean data is the most straightforward approach to solving this problem, but there is insufficient data to train NMT models. In this study, we create data for North Korean NMT models using a comparable corpus. First, we manually create evaluation data for automatic alignment and machine translation. Then, we investigate automatic alignment methods suitable for North Korean data. Finally, we verify that a model trained using North Korean bilingual data without human annotation can significantly increase North Korean translation accuracy compared to existing South Korean models in zero-shot settings.
引用
收藏
页码:6711 / 6718
页数:8
相关论文
共 50 条