Towards Producing Bilingual Lexica from Monolingual Corpora

被引:0
|
作者
Han, Jingyi [1 ]
Bel, Nuria [1 ]
机构
[1] Univ Pompeu Fabra, Roc Boronat 138, Barcelona 08018, Spain
关键词
automatic bilingual lexicon production; lexical resources; bilingual dictionaries;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
Bilingual lexica are the basis for many cross-lingual natural language processing tasks. Recent works have shown success in learning bilingual dictionary by taking advantages of comparable corpora and a diverse set of signals derived from monolingual corpora. In the present work, we describe an approach to automatically learn bilingual lexica by training a supervised classifier using word embedding-based vectors of only a few hundred translation equivalent word pairs. The word embedding representations of translation pairs were obtained from source and target monolingual corpora, which are not necessarily related. Our classifier is able to predict whether a new word pair is under a translation relation or not. We tested it on two quite distinct language pairs Chinese-Spanish and English-Spanish. The classifiers achieved more than 0.90 precision and recall for both language pairs in different evaluation scenarios. These results show a high potential for this method to be used in bilingual lexica production for language pairs with reduced amount of parallel or comparable corpora, in particular for phrase table expansion in Statistical Machine Translation systems.
引用
收藏
页码:2222 / 2227
页数:6
相关论文
共 50 条
  • [1] Mining monolingual and bilingual corpora
    Latiri, Chiraz
    Smaili, Kamel
    Lavecchia, Caroline
    Langlois, David
    INTELLIGENT DATA ANALYSIS, 2010, 14 (06) : 663 - 682
  • [2] Towards mining bilingual lexicons and parallel phrases from large-scale monolingual corpora
    Wu, Shilong
    Wang, Xu
    Ning, Qiuyi
    Qiu, Shigui
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [3] An improved method for finding bilingual collocation correspondences from monolingual corpora
    Xu, Ruifeng
    Wong, Kam-Fai
    Lu, Qin
    Li, Wenjie
    COMPUTER PROCESSING OF ORIENTAL LANGUAGES, PROCEEDINGS: BEYOND THE ORIENT: THE RESEARCH CHALLENGES AHEAD, 2006, 4285 : 51 - +
  • [4] Top-Level Domain Crawling for Producing Comprehensive Monolingual Corpora from the Web
    Goldhahn, Dirk
    Remus, Steffen
    Quasthoff, Uwe
    Biemann, Chris
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014,
  • [5] INCORPORATING MONOLINGUAL CORPORA INTO BILINGUAL LATENT SEMANTIC ANALYSIS FOR CROSSLINGUAL LM ADAPTATION
    Tam, Yik-Cheung
    Schultz, Tanja
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 4821 - 4824
  • [6] A Novel Method for Cross-Language Retrieval of Chunks Using Monolingual and Bilingual Corpora
    Miangah, Tayebeh Mosavi
    Nezarat, Amin
    INFORMATION AND COMMUNICATION TECHNOLOGIES, 2010, 101 : 307 - +
  • [7] Extracting loanwords from Mongolian corpora and producing a Japanese-Mongolian bilingual dictionary
    Khaltar, Badam-Osor
    Fujii, Atsushi
    Ishikawa, Tetsuya
    COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, 2006, : 657 - 664
  • [8] Monolingual and bilingual children's social preferences for monolingual and bilingual speakers
    Byers-Heinlein, Krista
    Behrend, Douglas A.
    Said, Lyakout Mohamed
    Girgis, Helana
    Poulin-Dubois, Diane
    DEVELOPMENTAL SCIENCE, 2017, 20 (04)
  • [9] CODE-SWITCHED SPEECH SYNTHESIS USING BILINGUAL PHONETIC POSTERIORGRAM WITH ONLY MONOLINGUAL CORPORA
    Cao, Yuewen
    Liu, Songxiang
    Wu, Xixin
    Kang, Shiyin
    Liu, Peng
    Wu, Zhiyong
    Liu, Xunying
    Su, Dan
    Yu, Dong
    Meng, Helen
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7619 - 7623
  • [10] Knowledge extraction from bilingual corpora
    Somers, H
    INFORMATION EXTRACTION: TOWARDS SCALABLE, ADAPTABLE SYSTEMS, 1999, 1714 : 120 - 133