Wikipedia as Multilingual Source of Comparable Corpora

被引:0
|
作者
Gamallo Otero, Pablo [1 ]
Gonzalez Lopez, Isaac [1 ]
机构
[1] Univ Santiago de Compostela, Galiza, Spain
关键词
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
This article describes an automatic method to build comparable corpora from Wikipedia using Categories as topic restrictions. Our strategy relies of the fact Wikipedia is a multilingual encyclopedia containing semi-structured information. Given two languages and a particular topic, our strategy builds a corpus with texts in the two selected languages, whose content is focused on the selected topic. Tools and corpora will be distributed under free linceses (General Public License and Creative Commons).
引用
收藏
页码:21 / 25
页数:5
相关论文
共 50 条
  • [1] Extracting Multilingual Topics from Unaligned Comparable Corpora
    Jagarlamudi, Jagadeesh
    Daume, Hal, III
    ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS, 2010, 5993 : 444 - 456
  • [2] Tailoring and evaluating the Wikipedia for in-domain comparable corpora extraction
    Cristina España-Bonet
    Alberto Barrón-Cedeño
    Lluís Màrquez
    Knowledge and Information Systems, 2023, 65 : 1365 - 1397
  • [3] Tailoring and evaluating the Wikipedia for in-domain comparable corpora extraction
    Espana-Bonet, Cristina
    Barron-Cedeno, Alberto
    Marquez, Lluis
    KNOWLEDGE AND INFORMATION SYSTEMS, 2023, 65 (03) : 1365 - 1397
  • [4] Extracting Directional and Comparable Corpora from a Multilingual Corpus for Translation Studies
    Cartoni, Bruno
    Meyer, Thomas
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 2132 - 2137
  • [5] A Multilingual Dataset for Evaluating Parallel Sentence Extraction from Comparable Corpora
    Zweigenbaum, Pierre
    Sharoff, Serge
    Rapp, Reinhard
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3828 - 3833
  • [6] Document Alignment for Generation of English-Punjabi Comparable Corpora from Wikipedia
    Goyal, Vishal
    Kumar, Ajit
    Lehal, Manpreet Singh
    INTERNATIONAL JOURNAL OF E-ADOPTION, 2020, 12 (01) : 42 - 51
  • [7] Weakly Supervised Named Entity Transliteration and Discovery from Multilingual Comparable Corpora
    Klementiev, Alexandre
    Roth, Dan
    COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, 2006, : 817 - 824
  • [8] Wikipedia as a source of monolingual and multilingual information about the Spanish heritage
    Olvera-Lobo, Maria-Dolores
    Gutierrez-Artacho, Juncal
    Amo Valdivieso, Macarena
    TRANSINFORMACAO, 2017, 29 (01): : 5 - 13
  • [9] In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora
    Terryn, Ayla Rigouts
    Hoste, Veronique
    Lefever, Els
    LANGUAGE RESOURCES AND EVALUATION, 2020, 54 (02) : 385 - 418
  • [10] In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora
    Ayla Rigouts Terryn
    Véronique Hoste
    Els Lefever
    Language Resources and Evaluation, 2020, 54 : 385 - 418