A Holistic Approach to Bilingual Sentence Fragment Extraction from Comparable Corpora

被引：0

作者：

Khademian, Mahdi ^{[1
]}

Taghipour, Kaveh ^{[1
]}

Mansour, Saab ^{[2
]}

Khadivi, Shahram ^{[1
]}

机构：

[1] Amirkabir Univ Technol, Dept Comp Engn & IT, Human Language Technol Lab, 424 Hafez Ave, Tehran, Iran

[2] Rhein Westfal TH Aachen, Human Language Technol & Pattern Recognit Grp, Dept Comp Sci, Aachen, Germany

来源：

LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2012年

关键词：

Parallel Fragment Extraction; Hough Transform; Statistical Machine Translation;

D O I：

暂无

中图分类号：

H0 [语言学];

学科分类号：

030303 ; 0501 ; 050102 ;

摘要：

Achieving accurate translation, especially in multiple domain documents with statistical machine translation systems, requires more and more bilingual texts and this need becomes more critical when training such systems for language pairs with scarce training data. In the recent years, there have been some researches on new sources of parallel texts that are documents which are not necessarily parallel but are comparable. Since these methods search for possible translation equivalences in a greedy manner, they are unable to consider all possible parallel texts in comparable documents. This paper investigates a different approach for this need by considering relationships between all words of two comparable documents, which works fairly well even in the worst case of comparability. We represent each document pair in a matrix and then transform it to a new space to find parallel fragments. Evaluations show that the system is successful in extraction of useful fragment pairs.

引用

页码：4073 / 4079

页数：7

共 50 条

[1] Addressing polysemy in bilingual lexicon extraction from comparable corpora
Fiser, Darja
Ljubesic, Nikola
Kubelka, Ozren
LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 3031 - 3035
[2] Adaptive Dictionary for Bilingual Lexicon Extraction from Comparable Corpora
Hazem, Amir
Morin, Emmanuel
LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 288 - 292
[3] Bilingual Lexicon Extraction with Forced Correlation from Comparable Corpora
Zhang, Chunyue
Zhao, Tiejun
NEURAL INFORMATION PROCESSING, PT II, 2015, 9490 : 528 - 535
[4] Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: A Case Study on Chinese-Japanese Wikipedia
Chu, Chenhui
Nakazawa, Toshiaki
Kurohashi, Sadao
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2016, 15 (02)
[5] Parallel Sentence Extraction from Comparable Corpora with Neural Network Features
Chu, Chenhui
Dabre, Raj
Kurohashi, Sadao
LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 2931 - 2935
[6] Bilingual Terminology Extraction from Comparable E-Commerce Corpora
Jia, Hao
Gu, Shuqin
Zhang, Yuqi
Duan, Xiangyu
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[7] A Multilingual Dataset for Evaluating Parallel Sentence Extraction from Comparable Corpora
Zweigenbaum, Pierre
Sharoff, Serge
Rapp, Reinhard
PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3828 - 3833
[8] Extraction of bilingual lexicons from comparable corpora specialty: study of the lexical context
Hazem, Amir
Morin, Emmanuel
TRAITEMENT AUTOMATIQUE DES LANGUES, 2014, 55 (01): : 13 - 44
[9] Iterative Bilingual Lexicon Extraction from Comparable Corpora with Topical and Contextual Knowledge
Chu, Chenhui
Nakazawa, Toshiaki
Kurohashi, Sadao
COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, CICLING 2014, PART II, 2014, 8404 : 296 - 309
[10] Bilingual Lexicon Extraction from Comparable Corpora Based on Closed Concepts Mining
Chebel, Mohamed
Latiri, Chiraz
Gaussier, Eric
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2017, PT I, 2017, 10234 : 586 - 598

← 1 2 3 4 5 →