Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments

被引:61
|
作者
Carterette, Benjamin A. [1 ]
机构
[1] Univ Delaware, Dept Comp & Informat Syst, Newark, DE 19716 USA
关键词
Experimentation; Measurement; Theory; Information retrieval; effectiveness evaluation; test collections; experimental design; statistical analysis; INFERENCE;
D O I
10.1145/2094072.2094076
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
High-quality reusable test collections and formal statistical hypothesis testing together support a rigorous experimental environment for information retrieval research. But as Armstrong et al. [2009b] recently argued, global analysis of experiments suggests that there has actually been little real improvement in ad hoc retrieval effectiveness over time. We investigate this phenomenon in the context of simultaneous testing of many hypotheses using a fixed set of data. We argue that the most common approaches to significance testing ignore a great deal of information about the world. Taking into account even a fairly small amount of this information can lead to very different conclusions about systems than those that have appeared in published literature. We demonstrate how to model a set of IR experiments for analysis both mathematically and practically, and show that doing so can cause p-values from statistical hypothesis tests to increase by orders of magnitude. This has major consequences on the interpretation of experimental results using reusable test collections: it is very difficult to conclude that anything is significant once we have modeled many of the sources of randomness in experimental design and analysis.
引用
收藏
页数:34
相关论文
共 50 条
  • [1] Statistical evaluation of music information retrieval experiments
    Flexer, Arthur
    JOURNAL OF NEW MUSIC RESEARCH, 2006, 35 (02) : 113 - 120
  • [2] Statistical power and effect size in information retrieval experiments
    Nelson, MJ
    INFORMATION SCIENCE AT THE DAWN OF THE NEXT MILLENNIUM, 1998, : 393 - 400
  • [3] Statistical Significance Testing in Information Retrieval: Theory and Practice
    Carterette, Ben
    SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 1387 - 1389
  • [4] Statistical Significance Testing in Information Retrieval: Theory and Practice
    Carterette, Ben
    SIGIR'14: PROCEEDINGS OF THE 37TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2014, : 1286 - 1286
  • [5] Statistical principal components analysis for retrieval experiments
    Dincer, Bekir Taner
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2007, 58 (04): : 560 - 574
  • [6] Testing of facsimile and facsimile information retrieval systems
    Gaved, T.J.
    Farquharson, A.A.
    British Telecom technology journal, 1994, 12 (01): : 76 - 82
  • [7] A Mutual Information-based Framework for the Analysis of Information Retrieval Systems
    Golbus, Peter B.
    Aslam, Javed A.
    SIGIR'13: THE PROCEEDINGS OF THE 36TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH & DEVELOPMENT IN INFORMATION RETRIEVAL, 2013, : 683 - 692
  • [8] Mapping the glycome with systems-based analysis
    Mahal, Lara
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2015, 249
  • [9] Estimating reliability of the retrieval systems effectiveness rank based on performance in multiple experiments
    Shuxiang Zhang
    Sri Devi Ravana
    Cluster Computing, 2017, 20 : 925 - 940
  • [10] Estimating reliability of the retrieval systems effectiveness rank based on performance in multiple experiments
    Zhang, Shuxiang
    Ravana, Sri Devi
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2017, 20 (01): : 925 - 940