Understanding inverse document frequency: on theoretical arguments for IDF

被引:793
作者
Robertson, S [1 ]
机构
[1] Microsoft Res, Cambridge, England
[2] City Univ London, London EC1V 0HB, England
关键词
information theory; probabilistic analysis; modelling; text retrieval;
D O I
10.1108/00220410410560582
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The term-weighting function known as IDF was proposed in 1972, and has since been extremely widely used, usually as part of a TF*IDF function. It is often described as a heuristic, and many papers have been written (some based on Shannon's Information Theory) seeking to establish some theoretical basis for it Some of these attempts are reviewed, and it is shown that the Information Theory approaches are problematic, but that there are good theoretical justifications of both IDF and TF*IDF in the traditional probabilistic model of information retrieval.
引用
收藏
页码:503 / 520
页数:18
相关论文
共 29 条
[1]   An information-theoretic perspective of tf-idf measures [J].
Aizawa, A .
INFORMATION PROCESSING & MANAGEMENT, 2003, 39 (01) :45-65
[2]  
[Anonymous], P SIGIR
[3]  
[Anonymous], 1949, Human behaviour and the principle of least-effort
[4]   PROBABILISTIC MODELS FOR AUTOMATIC INDEXING [J].
BOOKSTEIN, A .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1974, 25 (05) :312-318
[5]  
CHURCH, 1995, 3 WORKSHOP VERY LARG, P121
[6]   USING PROBABILISTIC MODELS OF DOCUMENT-RETRIEVAL WITHOUT RELEVANCE INFORMATION [J].
CROFT, WB ;
HARPER, DJ .
JOURNAL OF DOCUMENTATION, 1979, 35 (04) :285-295
[7]  
CROFT WB, 2003, LANGUAGE MODELLING I
[8]  
GRAY RM, 1990, ENTROPY INFORMATION
[9]  
HARMAN D, IN PRESS HIST IDF IN
[10]  
HARTER SP, 1975, J AM SOC INFORM SCI, V197, P280