Histograms as statistical estimators for aggregate queries

被引:5
|
作者
Chen, Lixia [1 ]
Dobra, Alin [1 ]
机构
[1] Univ Florida, Dept Comp & Informat Sci & Engn, Gainesville, FL 32611 USA
基金
美国国家科学基金会;
关键词
Histograms; Statistical analysis; Random shuffling assumption; XSKETCH SYNOPSES; ANSWER SIZES; XML;
D O I
10.1016/j.is.2012.08.003
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The traditional statistical assumption for interpreting histograms and justifying approximate query processing methods based on them is that all elements in a bucket have the same frequency-this is called uniform distribution assumption. In this paper, we analyze histograms from a statistical point of view. We show that a significantly less restrictive statistical assumption - the elements within a bucket are randomly arranged even though they might have different frequencies - leads to identical formulas for approximating aggregate queries using histograms. Under this assumption, we analyze the behavior of both unidimensional and multidimensional histograms and provide tight error guarantees for the quality of approximations. We conclude that histograms are the best estimators if the assumption holds; sampling and sketching are significantly worse. As an example of how the statistical theory of histograms can be extended, we show how XSketches - an approximation technique for XML queries that uses histograms as building blocks - can be statistically analyzed. The combination of the random shuffling assumption and the other statistical assumptions associated with XSketch estimators ensures a complete statistical model and error analysis for XSketches. Published by Elsevier Ltd.
引用
收藏
页码:213 / 230
页数:18
相关论文
共 50 条
  • [41] On using extended statistical queries to avoid membership queries
    Bshouty, NH
    Feldman, V
    COMPUTATIONAL LEARNING THEORY, PROCEEDINGS, 2001, 2111 : 529 - 545
  • [42] Range-Consistent Answers of Aggregate Queries under Aggregate Constraints
    Flesca, Sergio
    Furfaro, Filippo
    Parisi, Francesco
    SCALABLE UNCERTAINTY MANAGEMENT, SUM 2010, 2010, 6379 : 163 - 176
  • [43] Minimizing statistical bias with queries
    Cohn, DA
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 9: PROCEEDINGS OF THE 1996 CONFERENCE, 1997, 9 : 417 - 423
  • [44] Bagging using statistical queries
    Van Assche, Anneleen
    Blockeel, Hendrik
    MACHINE LEARNING: ECML 2006, PROCEEDINGS, 2006, 4212 : 809 - 816
  • [45] A comparison of selectivity estimators for range queries on metric attributes
    Blohsfeld, B
    Korus, D
    Seeger, B
    SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999: SIGMOD99: PROCEEDINGS OF THE 1999 ACM SIGMOD - INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 1999, : 239 - 250
  • [46] Estimating the selectivity of LIKE queries using pattern-based histograms
    Aytimur, Mehmet
    Cakmak, Ali
    TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2018, 26 (06) : 3319 - 3334
  • [47] Efficient Aggregate Queries on Location Data with Confidentiality
    Feng, Da
    Zhou, Fucai
    Wang, Qiang
    Wu, Qiyu
    Li, Bao
    SENSORS, 2022, 22 (13)
  • [48] The Semantics of Aggregate Queries in Data Exchange Revisited
    Kolaitis, Phokion G.
    Spezzano, Francesca
    SCALABLE UNCERTAINTY MANAGEMENT, SUM 2013, 2013, 8078 : 233 - 246
  • [49] Efficient Methods for Aggregate Reverse Rank Queries
    Dong, Yuyang
    Chen, Hanxiong
    Furuse, Kazutaka
    Kitagawa, Hiroyuki
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2018, E101D (04): : 1012 - 1020
  • [50] Expanding Queries with Maximum Likelihood Estimators and Language Models
    Karras, Christos
    Karras, Aristeidis
    Theodorakopoulos, Leonidas
    Giannoukou, Ioanna
    Sioutas, Spyros
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INNOVATIONS IN COMPUTING RESEARCH (ICR'22), 2022, 1431 : 201 - 213