Corpus-Based Diacritic Restoration for South Slavic Languages

被引:0
|
作者
Ljubesic, Nikola [1 ,3 ]
Erjavec, Tomaz [1 ]
Fiser, Darja [1 ,2 ]
机构
[1] Jozef Stefan Inst, Dept Knowledge Technol, Jamova Cesta 39, SI-1000 Ljubljana, Slovenia
[2] Univ Ljubljana, Fac Arts, Askerceva Cesta 2, SI-1000 Ljubljana, Slovenia
[3] Univ Zagreb, Dept Informat & Commun Sci, Ivana Lucica 3, HR-10000 Zagreb, Croatia
基金
瑞士国家科学基金会;
关键词
computer-mediated communication; diacritic restoration; South-Slavic languages;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
In computer-mediated communication, Latin-based scripts users often omit diacritics when writing. Such text is typically easily understandable to humans but very difficult for computational processing because many words become ambiguous or unknown. Letter-level approaches to diacritic restoration generalise better and do not require a lot of training data but word-level approaches tend to yield better results. However, they typically rely on a lexicon which is an expensive resource, not covering non-standard forms, and often not available for less-resourced languages. In this paper we present diacritic restoration models that are trained on easy-to-acquire corpora. We test three different types of corpora (Wikipedia, general web, Twitter) for three South Slavic languages (Croatian, Serbian and Slovene) and evaluate them on two types of text: standard (Wikipedia) and non-standard (Twitter). The proposed approach considerably outperforms charlifter, so far the only open source tool available for this task. We make the best performing systems freely available.
引用
收藏
页码:3612 / 3616
页数:5
相关论文
共 50 条
  • [1] Impersonalization in Slavic: A Corpus-Based Study of Impersonalization Strategies in Six Slavic Languages
    Bauer, Anastasia
    JOURNAL OF SLAVIC LINGUISTICS, 2021, 29 (02) : 123 - 178
  • [2] East Slavic indefinite pronouns: a corpus-based approach
    Yana Penkova
    Achim Rabus
    Russian Linguistics, 2021, 45 : 227 - 252
  • [3] East Slavic indefinite pronouns: a corpus-based approach
    Penkova, Yana
    Rabus, Achim
    RUSSIAN LINGUISTICS, 2021, 45 (03) : 227 - 252
  • [4] Automatic diacritic restoration for resource-scarce languages
    De Pauw, Guy
    Wagacha, Peter W.
    de Schryver, Gilles-Maurice
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2007, 4629 : 170 - +
  • [5] Corpus-based analysis of domain-specific languages
    Robert Tairas
    Jordi Cabot
    Software & Systems Modeling, 2015, 14 : 889 - 904
  • [6] Corpus-based analysis of domain-specific languages
    Tairas, Robert
    Cabot, Jordi
    SOFTWARE AND SYSTEMS MODELING, 2015, 14 (02): : 889 - 904
  • [7] Corpus-based linguistic investigation for the South African Bantu languages: a Northern Sotho case study
    Taljard, Elsabe
    SOUTH AFRICAN JOURNAL OF AFRICAN LANGUAGES, 2006, 26 (04) : 165 - 183
  • [8] Corpus-based Lexicography for Lesser-resourced Languages - Maximizing the Limited Corpus
    Prinsloo, D. J.
    LEXIKOS, 2015, 25 : 285 - 300
  • [9] Corpus-based vocabulary lists for language learners for nine languages
    Adam Kilgarriff
    Frieda Charalabopoulou
    Maria Gavrilidou
    Janne Bondi Johannessen
    Saussan Khalil
    Sofie Johansson Kokkinakis
    Robert Lew
    Serge Sharoff
    Ravikiran Vadlapudi
    Elena Volodina
    Language Resources and Evaluation, 2014, 48 : 121 - 163
  • [10] Structural priming within and across languages: a corpus-based perspective
    Gries, Stefan Th.
    Kootstra, Gerrit Jan
    BILINGUALISM-LANGUAGE AND COGNITION, 2017, 20 (02) : 235 - 250