Wikipedia as Multilingual Source of Comparable Corpora

被引:0
|
作者
Gamallo Otero, Pablo [1 ]
Gonzalez Lopez, Isaac [1 ]
机构
[1] Univ Santiago de Compostela, Galiza, Spain
关键词
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
This article describes an automatic method to build comparable corpora from Wikipedia using Categories as topic restrictions. Our strategy relies of the fact Wikipedia is a multilingual encyclopedia containing semi-structured information. Given two languages and a particular topic, our strategy builds a corpus with texts in the two selected languages, whose content is focused on the selected topic. Tools and corpora will be distributed under free linceses (General Public License and Creative Commons).
引用
收藏
页码:21 / 25
页数:5
相关论文
共 50 条
  • [41] Named Entity Transliteration with Comparable Corpora
    Sproat, Richard
    Tao, Tao
    Zhai, ChengXiang
    COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, 2006, : 73 - 80
  • [42] Text mining applied to multilingual corpora
    Neri, F
    Raffaelli, R
    Knowledge Mining, 2005, 185 : 123 - 131
  • [43] Bilingual comparable corpora and the training of translators
    Zanettin, F
    META, 1998, 43 (04) : 616 - 630
  • [44] Multimodal Comparable Corpora for Machine Translation
    Afli, Haithem
    Barrault, Loic
    Schwenk, Holger
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014,
  • [45] Sentence alignment for monolingual comparable corpora
    Barzilay, R
    Elhadad, N
    PROCEEDINGS OF THE 2003 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, 2003, : 25 - 32
  • [46] Revisiting comparable corpora in connected space
    Zweigenbaum, Pierre
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014,
  • [47] Repetition and Language Models and Comparable Corpora
    Church, Ken
    BUCC 2009 - 2nd Workshop on Building and Using Comparable Corpora: From Parallel to Non-Parallel Corpora at the ACL-IJCNLP 2009 - Proceedings, 2009,
  • [48] Modeling Popularity and Reliability of Sources in Multilingual Wikipedia
    Lewoniewski, Wlodzimierz
    Wecel, Krzysztof
    Abramowicz, Witold
    INFORMATION, 2020, 11 (05)
  • [49] Effectively Mining Wikipedia for Clustering Multilingual Documents
    Kumar, N. Kiran
    Santosh, G. S. K.
    Varma, Vasudeva
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, 2011, 6716 : 254 - 257
  • [50] A Wikipedia-based multilingual retrieval model
    Potthast, Martin
    Stein, Benno
    Anderka, Maik
    ADVANCES IN INFORMATION RETRIEVAL, 2008, 4956 : 522 - 530