Corpus-Based Diacritic Restoration for South Slavic Languages

被引：0

作者：

Ljubesic, Nikola ^{[1
,3
]}

Erjavec, Tomaz ^{[1
]}

Fiser, Darja ^{[1
,2
]}

机构：

[1] Jozef Stefan Inst, Dept Knowledge Technol, Jamova Cesta 39, SI-1000 Ljubljana, Slovenia

[2] Univ Ljubljana, Fac Arts, Askerceva Cesta 2, SI-1000 Ljubljana, Slovenia

[3] Univ Zagreb, Dept Informat & Commun Sci, Ivana Lucica 3, HR-10000 Zagreb, Croatia

来源：

LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2016年

基金：

瑞士国家科学基金会;

关键词：

computer-mediated communication; diacritic restoration; South-Slavic languages;

D O I：

暂无

中图分类号：

H [语言、文字];

学科分类号：

05 ;

摘要：

In computer-mediated communication, Latin-based scripts users often omit diacritics when writing. Such text is typically easily understandable to humans but very difficult for computational processing because many words become ambiguous or unknown. Letter-level approaches to diacritic restoration generalise better and do not require a lot of training data but word-level approaches tend to yield better results. However, they typically rely on a lexicon which is an expensive resource, not covering non-standard forms, and often not available for less-resourced languages. In this paper we present diacritic restoration models that are trained on easy-to-acquire corpora. We test three different types of corpora (Wikipedia, general web, Twitter) for three South Slavic languages (Croatian, Serbian and Slovene) and evaluate them on two types of text: standard (Wikipedia) and non-standard (Twitter). The proposed approach considerably outperforms charlifter, so far the only open source tool available for this task. We make the best performing systems freely available.

引用

页码：3612 / 3616

页数：5

共 50 条

[1] Impersonalization in Slavic: A Corpus-Based Study of Impersonalization Strategies in Six Slavic Languages
Bauer, Anastasia
JOURNAL OF SLAVIC LINGUISTICS, 2021, 29 (02) : 123 - 178
[2] East Slavic indefinite pronouns: a corpus-based approach
Yana Penkova
Achim Rabus
Russian Linguistics, 2021, 45 : 227 - 252
[3] East Slavic indefinite pronouns: a corpus-based approach
Penkova, Yana
Rabus, Achim
RUSSIAN LINGUISTICS, 2021, 45 (03) : 227 - 252
[4] Automatic diacritic restoration for resource-scarce languages
De Pauw, Guy
Wagacha, Peter W.
de Schryver, Gilles-Maurice
TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2007, 4629 : 170 - +
[5] Corpus-based analysis of domain-specific languages
Robert Tairas
Jordi Cabot
Software & Systems Modeling, 2015, 14 : 889 - 904
[6] Corpus-based analysis of domain-specific languages
Tairas, Robert
Cabot, Jordi
SOFTWARE AND SYSTEMS MODELING, 2015, 14 (02): : 889 - 904
[7] Corpus-based linguistic investigation for the South African Bantu languages: a Northern Sotho case study
Taljard, Elsabe
SOUTH AFRICAN JOURNAL OF AFRICAN LANGUAGES, 2006, 26 (04) : 165 - 183
[8] Corpus-based Lexicography for Lesser-resourced Languages - Maximizing the Limited Corpus
Prinsloo, D. J.
LEXIKOS, 2015, 25 : 285 - 300
[9] Corpus-based vocabulary lists for language learners for nine languages
Adam Kilgarriff
Frieda Charalabopoulou
Maria Gavrilidou
Janne Bondi Johannessen
Saussan Khalil
Sofie Johansson Kokkinakis
Robert Lew
Serge Sharoff
Ravikiran Vadlapudi
Elena Volodina
Language Resources and Evaluation, 2014, 48 : 121 - 163
[10] Structural priming within and across languages: a corpus-based perspective
Gries, Stefan Th.
Kootstra, Gerrit Jan
BILINGUALISM-LANGUAGE AND COGNITION, 2017, 20 (02) : 235 - 250

← 1 2 3 4 5 →