On collection size and retrieval effectiveness

被引:28
作者
Hawking, D [1 ]
Robertson, S
机构
[1] CSIRO Math & Informat Sci, Canberra, ACT, Australia
[2] Microsoft Res, Cambridge, England
来源
INFORMATION RETRIEVAL | 2003年 / 6卷 / 01期
关键词
text retrieval models; signal detection theory; collection sampling; relevance score distributions;
D O I
10.1023/A:1022904715765
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The relationship between collection size and retrieval effectiveness is particularly important in the context of Web search. We investigate it first analytically and then experimentally, using samples and subsets of test collections. Different retrieval systems vary in how the score assigned to an individual document in a sample collection relates to the score it receives in the full collection; we identify four cases. We apply signal detection (SD) theory to retrieval from samples, taking into account the four cases and using a variety of shapes for relevant and irrelevant distributions. We note that the SD model subsumes several earlier hypotheses about the causes of the decreased precision in samples. We also discuss other models which contribute to an understanding of the phenomenon, particularly relating to the effects of discreteness. Different models provide complementary insights. Extensive use is made of test data, some from official submissions to the TREC-6 VLC track and some new, to illustrate the effects and test hypotheses. We empirically confirm predictions, based on SD theory, that P@n should decline when moving to a sample collection and that average precision and R-precision should remain constant. SD theory suggests the use of recall-fallout plots as operating characteristic (OC) curves. We plot OC curves of this type for a real retrieval system and query set and show that curves for sample collections are similar but not identical to the curve for the full collection.
引用
收藏
页码:99 / 150
页数:52
相关论文
共 19 条
[1]  
[Anonymous], 2001, P 24 ANN INT ACM SIG
[2]  
Arampatzis A., 2000, NIST SPECIAL PUBLICA
[3]   A probabilistic solution to the selection and fusion problem in distributed information retrieval [J].
Baumgarten, C .
SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 1999, :246-253
[4]  
CLARKE CLA, 1995, NIST SPECIAL PUBLICA, P295
[5]  
Cormack G. V., 1999, Proceedings of SIGIR '99. 22nd International Conference on Research and Development in Information Retrieval, P273, DOI 10.1145/312624.312692
[6]   Scaling Up the TREC Collection [J].
David Hawking ;
Paul Thistlewaite ;
Donna Harman .
Information Retrieval, 1999, 1 (1-2) :115-137
[7]  
HAWKING D, 1997, P TREC 6 C GAITH MD, P275
[8]  
Hawking David, 1996, TRCS9608 AUSTR NAT U
[9]  
Hays WL., 1963, STAT PSYCHOLOGISTS
[10]   Accessibility of information on the web [J].
Lawrence, S ;
Giles, CL .
NATURE, 1999, 400 (6740) :107-109