A Holistic Approach to Bilingual Sentence Fragment Extraction from Comparable Corpora

被引:0
|
作者
Khademian, Mahdi [1 ]
Taghipour, Kaveh [1 ]
Mansour, Saab [2 ]
Khadivi, Shahram [1 ]
机构
[1] Amirkabir Univ Technol, Dept Comp Engn & IT, Human Language Technol Lab, 424 Hafez Ave, Tehran, Iran
[2] Rhein Westfal TH Aachen, Human Language Technol & Pattern Recognit Grp, Dept Comp Sci, Aachen, Germany
来源
LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2012年
关键词
Parallel Fragment Extraction; Hough Transform; Statistical Machine Translation;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
Achieving accurate translation, especially in multiple domain documents with statistical machine translation systems, requires more and more bilingual texts and this need becomes more critical when training such systems for language pairs with scarce training data. In the recent years, there have been some researches on new sources of parallel texts that are documents which are not necessarily parallel but are comparable. Since these methods search for possible translation equivalences in a greedy manner, they are unable to consider all possible parallel texts in comparable documents. This paper investigates a different approach for this need by considering relationships between all words of two comparable documents, which works fairly well even in the worst case of comparability. We represent each document pair in a matrix and then transform it to a new space to find parallel fragments. Evaluations show that the system is successful in extraction of useful fragment pairs.
引用
收藏
页码:4073 / 4079
页数:7
相关论文
共 50 条
  • [41] The treatment of polysemy in the extraction of bilingual lexics from parallel corpora
    Gamallo Otero, Pablo
    Sotelo Docio, Susana
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2005, (35): : 103 - 110
  • [42] Automatic Methods for the Extension of a Bilingual Dictionary using Comparable Corpora
    Rosner, Michael
    Sultana, Kurt
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 3790 - 3797
  • [43] Effect of cross-language IR in bilingual lexicon acquisition from comparable corpora
    Utsuro, T
    Horiuchi, T
    Hino, K
    Hamamoto, T
    Nakayama, T
    EACL 2003: 10TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2003, : 355 - 362
  • [44] A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora
    Fung, P
    MACHINE TRANSLATION AND THE INFORMATION SOUP, 1998, 1529 : 1 - 17
  • [45] Automatic Parallel Corpora and Bilingual Terminology extraction from Parallel WebSites
    Almeida, Jose Joao
    Simoes, Alberto
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 50 - 55
  • [46] Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
    Ke, Guiyao
    Marteau, Pierre-Francois
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 1992 - 1999
  • [47] MERONYMY IN A TOURIST GUIDE: TOWARDS A COHERENT ANALYSIS OF BILINGUAL COMPARABLE CORPORA
    Sliwa, Dorota
    ROCZNIKI HUMANISTYCZNE, 2012, 60 (08): : 97 - 128
  • [48] Unsupervised word-sense disambiguation using bilingual comparable corpora
    Kaji, H
    Morimoto, Y
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2005, E88D (02) : 289 - 301
  • [49] Document and Sentence Alignment in Comparable Corpora Using Bipartite Graph Matching
    Rahimi, Zeinab
    Taghipour, Kaveh
    Khadivi, Shahram
    Afhami, Nasim
    2012 SIXTH INTERNATIONAL SYMPOSIUM ON TELECOMMUNICATIONS (IST), 2012, : 817 - 821
  • [50] Automatic Generation of Bilingual Dictionaries Using Intermediary Languages and Comparable Corpora
    Gamallo Otero, Pablo
    Pichel Campos, Jose Ramom
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2010, 6008 : 473 - +