共 50 条
Effects of diacritics on Turkish information retrieval
被引:7
|作者:
Alpkocak, Adil
[1
]
Ceylan, Meltem
[1
]
机构:
[1] Dokuz Eylul Univ, Dept Comp Engn, TR-35160 Izmir, Turkey
关键词:
Turkish information retrieval;
diacritics;
document expansion;
query expansion;
D O I:
10.3906/elk-1010-819
中图分类号:
TP18 [人工智能理论];
学科分类号:
081104 ;
0812 ;
0835 ;
1405 ;
摘要:
We investigate the effects of improper use of diacritics in the Turkish alphabet on information retrieval. A diacritic is simply a supplementary sign added to a letter to change the sound value of the letter, and the Turkish alphabet has 5 special letters derived from Latin by adding different diacritics. The statistical analysis performed in this study shows that retrieval performance significantly decreases when documents and queries contain letters with different forms, such that documents consist of letters with diacritics while queries consist of standard Latin letters and vice versa. In order to tackle this challenge, we propose 3 approaches: token normalization by equivalence classes, document expansion, and query expansion. The experimental evaluations carried on the Bilkent Turkish information retrieval test collection suggests that the proposed approaches are promising as a remedy in this line of research.
引用
收藏
页码:787 / 804
页数:18
相关论文