STATISTICAL METHODS FOR PHRASAL VERB DETECTION IN ESTONIAN DIALECTS

被引:3
|
作者
Uiboaed, Kristel [1 ]
机构
[1] Tartu Ulikool, Tartu, Estonia
来源
EESTI RAKENDUSLINGVISTIKA UHINGU AASTARAAMAT | 2010年 / 6卷
关键词
computational linguistics; corpus linguistics; dialectology; methods and tools; statistics; Estonian;
D O I
10.5128/ERYa6.19
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
The aim of this study was to assess different statistical methods of automatic collocations extraction from the corpus. To extract the collocations, association measures (AM) were applied and the association scores (AS) for the collocation candidates found in the corpus were calculated. An AS indicates the collocational strength between two words. An advantage of the AMs is the fact that in addition to the co-occurrence frequency, the marginal frequencies of collocating words are also taken into account. To calculate the AS, the following data is needed: co-occurrence frequency, marginal frequencies of collocating words, expected frequency and the sample size. There are different approaches to applying AMs: words can be considered collocational only if they appear in the same collocational span, in one text unit (clause, sentence, utterance), or if they carry together some syntactic function. This paper attempts to apply AMs for phrasal verb detection from the Corpus of Estonian Dialects (CED). Texts of CED were morphologically tagged and parsed. Combinations of adverbs and verbs were extracted and AS was calculated for every collocation candidate. Experiments were run on three different dialect groups applying four different association scores: t-score, Mutual Information, chi-squared test and log-likelihood. The results indicate that log-likelihood and t-score outperform MI and chi-squared test. The outcomes of different measures vary the most in the Northern dialect group. The best measure for dialect data in general is log-likelihood. However, MI and chi-squared test work well with low frequency data. In the Northern dialect group the best AM for low-frequency phrasal verb detection is MI, however, in the North-Eastern and Southern groups chi-square test works well for the same purpose. To achieve better results different scores should be combined.
引用
收藏
页码:307 / 326
页数:20
相关论文
共 50 条