Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments

被引：61

作者：

Carterette, Benjamin A. ^{[1
]}

机构：

[1] Univ Delaware, Dept Comp & Informat Syst, Newark, DE 19716 USA

来源：

ACM TRANSACTIONS ON INFORMATION SYSTEMS | 2012年 / 30卷 / 01期

关键词：

Experimentation; Measurement; Theory; Information retrieval; effectiveness evaluation; test collections; experimental design; statistical analysis; INFERENCE;

D O I：

10.1145/2094072.2094076

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

High-quality reusable test collections and formal statistical hypothesis testing together support a rigorous experimental environment for information retrieval research. But as Armstrong et al. [2009b] recently argued, global analysis of experiments suggests that there has actually been little real improvement in ad hoc retrieval effectiveness over time. We investigate this phenomenon in the context of simultaneous testing of many hypotheses using a fixed set of data. We argue that the most common approaches to significance testing ignore a great deal of information about the world. Taking into account even a fairly small amount of this information can lead to very different conclusions about systems than those that have appeared in published literature. We demonstrate how to model a set of IR experiments for analysis both mathematically and practically, and show that doing so can cause p-values from statistical hypothesis tests to increase by orders of magnitude. This has major consequences on the interpretation of experimental results using reusable test collections: it is very difficult to conclude that anything is significant once we have modeled many of the sources of randomness in experimental design and analysis.

引用

页数：34

共 50 条

[1] Statistical evaluation of music information retrieval experiments
Flexer, Arthur
JOURNAL OF NEW MUSIC RESEARCH, 2006, 35 (02) : 113 - 120
[2] Statistical power and effect size in information retrieval experiments
Nelson, MJ
INFORMATION SCIENCE AT THE DAWN OF THE NEXT MILLENNIUM, 1998, : 393 - 400
[3] Statistical Significance Testing in Information Retrieval: Theory and Practice
Carterette, Ben
SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 1387 - 1389
[4] Statistical Significance Testing in Information Retrieval: Theory and Practice
Carterette, Ben
SIGIR'14: PROCEEDINGS OF THE 37TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2014, : 1286 - 1286
[5] Statistical principal components analysis for retrieval experiments
Dincer, Bekir Taner
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2007, 58 (04): : 560 - 574
[6] Testing of facsimile and facsimile information retrieval systems
Gaved, T.J.
Farquharson, A.A.
British Telecom technology journal, 1994, 12 (01): : 76 - 82
[7] A Mutual Information-based Framework for the Analysis of Information Retrieval Systems
Golbus, Peter B.
Aslam, Javed A.
SIGIR'13: THE PROCEEDINGS OF THE 36TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH & DEVELOPMENT IN INFORMATION RETRIEVAL, 2013, : 683 - 692
[8] Mapping the glycome with systems-based analysis
Mahal, Lara
ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2015, 249
[9] Estimating reliability of the retrieval systems effectiveness rank based on performance in multiple experiments
Shuxiang Zhang
Sri Devi Ravana
Cluster Computing, 2017, 20 : 925 - 940
[10] Estimating reliability of the retrieval systems effectiveness rank based on performance in multiple experiments
Zhang, Shuxiang
Ravana, Sri Devi
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2017, 20 (01): : 925 - 940

← 1 2 3 4 5 →