Wikipedia as Multilingual Source of Comparable Corpora

被引:0
|
作者
Gamallo Otero, Pablo [1 ]
Gonzalez Lopez, Isaac [1 ]
机构
[1] Univ Santiago de Compostela, Galiza, Spain
关键词
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
This article describes an automatic method to build comparable corpora from Wikipedia using Categories as topic restrictions. Our strategy relies of the fact Wikipedia is a multilingual encyclopedia containing semi-structured information. Given two languages and a particular topic, our strategy builds a corpus with texts in the two selected languages, whose content is focused on the selected topic. Tools and corpora will be distributed under free linceses (General Public License and Creative Commons).
引用
收藏
页码:21 / 25
页数:5
相关论文
共 50 条
  • [31] INFORMATION OVERLAP IN MULTILINGUAL WIKIPEDIA AND SUMMARIZATION
    Filatova, Elena
    INTERNATIONAL JOURNAL OF COOPERATIVE INFORMATION SYSTEMS, 2012, 21 (04) : 383 - 403
  • [32] Multilingual Schema Matching for Wikipedia Infoboxes
    Thanh Nguyen
    Moreira, Viviane
    Huong Nguyen
    Hoa Nguyen
    Freire, Juliana
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2011, 5 (02): : 133 - 144
  • [33] Seeing through multilingual corpora: On the use of corpora in contrastive studies
    Viberg, Ake
    LANGUAGE, 2009, 85 (02) : 476 - 480
  • [34] Property type distribution in Wordnet, corpora and Wikipedia
    Barbu, Eduard
    EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (07) : 3501 - 3507
  • [35] Building Bilingual Parallel Corpora based on Wikipedia
    Mohammadi, Mehdi
    GhasemAghaee, Nasser
    2010 SECOND INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND APPLICATIONS: ICCEA 2010, PROCEEDINGS, VOL 2, 2010, : 264 - 268
  • [36] MULTEXT: Multilingual text tools and corpora
    Armstrong, S
    LEXICON AND TEST: REUSABLE METHODS AND RESOURCES FOR THE LINGUISTIC DEVELOPMENT OF GERMAN, 1996, 73 : 107 - 119
  • [37] Building and Modelling Multilingual Subjective Corpora
    Saad, Motaz
    Langlois, David
    Smaili, Kamel
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 3086 - 3091
  • [38] Comparability of corpora and search multilingual terminology
    Morin, Emmanuel
    Daille, Beatrice
    TRAITEMENT AUTOMATIQUE DES LANGUES, 2006, 47 (01): : 113 - 136
  • [39] Pseudo-Aligned Multilingual Corpora
    Diaz, Fernando
    Metzler, Donald
    20TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2007, : 2727 - 2732
  • [40] Enhanced Entity Annotations for Multilingual Corpora
    Strobl, Michael
    Trabelsi, Amine
    Zaiane, Osmar
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3732 - 3740