Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor's Love Affair

被引：0

作者：

Ljubesic, Nikola ^{[1
]}

Espla-Gomis, Miquel ^{[2
]}

Toral, Antonio ^{[3
]}

Ortiz-Rojas, Sergio ^{[4
]}

Klubicka, Filip ^{[1
]}

机构：

[1] Univ Zagreb, Dept Informat & Commun Sci, Zagreb, Croatia

[2] Univ Alacant, Dept Lenguatges & Sistemes Informat, Alacant, Croatia

[3] Dublin City Univ, Sch Comp, ADAPT Ctr, Dublin, Ireland

[4] Prompsit Language Engenering, Elx, Spain

来源：

LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2016年

关键词：

crawling; top-level domain; monolingual corpus; parallel corpus;

D O I：

暂无

中图分类号：

H [语言、文字];

学科分类号：

05 ;

摘要：

This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain.hr and the Slovene top-level domain.si, and extrinsically on the English-Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English-Croatian, English-Finnish, English-Serbian and English-Slovene language pairs.

引用

页码：2949 / 2956

页数：8

共 3 条

[1] Top-Level Domain Crawling for Producing Comprehensive Monolingual Corpora from the Web
Goldhahn, Dirk
Remus, Steffen
Quasthoff, Uwe
Biemann, Chris
LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014,
[2] It's time to reveal a long love affair
Patten, Judith
UNDERWATER TECHNOLOGY, 2020, 37 (01): : 1 - 2
[3] Cardiac dysfunction in heart failure: The cardiologist's love affair with time
Brutsaert, Dirk L.
PROGRESS IN CARDIOVASCULAR DISEASES, 2006, 49 (03) : 157 - 181

← 1 →