Bertalign: Improved word embedding-based sentence alignment for Chinese-English parallel corpora of literary texts

被引:3
|
作者
Liu, Lei [1 ]
Zhu, Min [1 ]
机构
[1] Yanshan Univ, Sch Foreign Languages, Qinhuangdao, Peoples R China
关键词
CORPUS;
D O I
10.1093/llc/fqac089
中图分类号
C [社会科学总论];
学科分类号
03 ; 0303 ;
摘要
Bertalign is designed to improve sentence alignment accuracy for Chinese-English parallel corpora of literary texts. Aligning bilingual literary texts is not trivial, since most of the translation is interpretative and not based on 1-to-1 mappings between source and target sentences. Existing alignment methods highlight 1-to-1 links while having difficulty coping with 1-to-many and many-to-many alignments that are common in literary texts. To overcome the weaknesses of current approaches, we propose a novel two-step algorithm for bilingual sentence alignment. The first step finds the optimal paths for 1-to-1 alignments based on the top-k most semantically similar target sentences for each source sentence using the bidirectional encoder representations from transformer-based cross-lingual word embeddings. The second step relies on search paths found in the previous step to recover all valid alignments with more than one sentence on each side of the bilingual text. A comprehensive experiment was conducted on a newly built Chinese-English literary parallel corpus and a large-scale publicly available bilingual corpus of the Bible to compare the performance of Bertalign with five baseline systems: Gale-Church, Hunalign, Bleualign, Bleurtalign, and Vecalign. The results show that Bertalign achieves the highest accuracy in terms of F-1 score on the two evaluation datasets than previous methods.
引用
收藏
页码:621 / 634
页数:14
相关论文
共 7 条
  • [1] A Research on Length Based Sentence Alignment for Chinese-English Parallel Corpus
    Zan, Hongying
    Zhang, Xia
    Fan, Ming
    FIFTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 4, PROCEEDINGS, 2008, : 145 - 149
  • [2] Research of English-Chinese alignment at word granularity on parallel corpora
    Xu Yang
    Wang Hou-feng
    Lue Xue-qiang
    7TH IEEE/ACIS INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE IN CONJUNCTION WITH 2ND IEEE/ACIS INTERNATIONAL WORKSHOP ON E-ACTIVITY, PROCEEDINGS, 2008, : 223 - +
  • [3] Two-phase base noun phrase alignment in Chinese-English parallel corpora
    Zhao, J
    Liu, FF
    Liu, DM
    Proceedings of the 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE'05), 2005, : 360 - 365
  • [4] Extracting Historical Terms Based on Aligned Chinese-English Parallel Corpora
    Li, Xiuying
    Che, Chao
    Han, Limin
    Liu, Xiaoxia
    IEEE NLP-KE 2009: PROCEEDINGS OF INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, 2009, : 296 - 301
  • [5] Research of Chinese-English word alignment algorithm based on bilingual dictionary
    Deng, Dan
    Liu, Qun
    Yu, Hongkui
    Jisuanji Gongcheng/Computer Engineering, 2005, 31 (16): : 45 - 47
  • [7] A new double attention decoding model based on cascade RCNN and word embedding fusion for Chinese-English multimodal translation
    Liu H.
    International Journal of Reasoning-based Intelligent Systems, 2024, 16 (01) : 26 - 36