Learning How to Translate North Korean through South Korean
被引:0
|
作者:
Kim, Hwichan
论文数: 0引用数: 0
h-index: 0
机构:
Tokyo Metropolitan Univ, 6-6 Asahigaoka, Hino, Tokyo 1910065, JapanTokyo Metropolitan Univ, 6-6 Asahigaoka, Hino, Tokyo 1910065, Japan
Kim, Hwichan
[1
]
Moon, Sangwhan
论文数: 0引用数: 0
h-index: 0
机构:
Tokyo Inst Technol, 2-12-1 Ookayama, Tokyo 1528550, Japan
Google LLC, 1600 Amphitheatre Pkwy, Mountain View, CA 1600 USATokyo Metropolitan Univ, 6-6 Asahigaoka, Hino, Tokyo 1910065, Japan
Moon, Sangwhan
[2
,3
]
Okazaki, Naoaki
论文数: 0引用数: 0
h-index: 0
机构:
Tokyo Inst Technol, 2-12-1 Ookayama, Tokyo 1528550, JapanTokyo Metropolitan Univ, 6-6 Asahigaoka, Hino, Tokyo 1910065, Japan
Okazaki, Naoaki
[2
]
论文数: 引用数:
h-index:
机构:
Komachi, Mamoru
[1
]
机构:
[1] Tokyo Metropolitan Univ, 6-6 Asahigaoka, Hino, Tokyo 1910065, Japan
[2] Tokyo Inst Technol, 2-12-1 Ookayama, Tokyo 1528550, Japan
[3] Google LLC, 1600 Amphitheatre Pkwy, Mountain View, CA 1600 USA
Parallel corpus construction;
Machine translation;
Korean;
D O I:
暂无
中图分类号:
TP39 [计算机的应用];
学科分类号:
081203 ;
0835 ;
摘要:
South and North Korea both use the Korean language, but there are some differences in their linguistic aspects, such as vocabulary and spelling rules. Korean NLP research has focused on South Korean only, and existing NLP systems for the Korean language, such as neural machine translation (NMT) models, cannot properly handle North Korean input. Training a model using North Korean data is the most straightforward approach to solving this problem, but there is insufficient data to train NMT models. In this study, we create data for North Korean NMT models using a comparable corpus. First, we manually create evaluation data for automatic alignment and machine translation. Then, we investigate automatic alignment methods suitable for North Korean data. Finally, we verify that a model trained using North Korean bilingual data without human annotation can significantly increase North Korean translation accuracy compared to existing South Korean models in zero-shot settings.