Histograms as statistical estimators for aggregate queries

被引：5

作者：

Chen, Lixia ^{[1
]}

Dobra, Alin ^{[1
]}

机构：

[1] Univ Florida, Dept Comp & Informat Sci & Engn, Gainesville, FL 32611 USA

来源：

INFORMATION SYSTEMS | 2013年 / 38卷 / 02期

基金：

美国国家科学基金会;

关键词：

Histograms; Statistical analysis; Random shuffling assumption; XSKETCH SYNOPSES; ANSWER SIZES; XML;

D O I：

10.1016/j.is.2012.08.003

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The traditional statistical assumption for interpreting histograms and justifying approximate query processing methods based on them is that all elements in a bucket have the same frequency-this is called uniform distribution assumption. In this paper, we analyze histograms from a statistical point of view. We show that a significantly less restrictive statistical assumption - the elements within a bucket are randomly arranged even though they might have different frequencies - leads to identical formulas for approximating aggregate queries using histograms. Under this assumption, we analyze the behavior of both unidimensional and multidimensional histograms and provide tight error guarantees for the quality of approximations. We conclude that histograms are the best estimators if the assumption holds; sampling and sketching are significantly worse. As an example of how the statistical theory of histograms can be extended, we show how XSketches - an approximation technique for XML queries that uses histograms as building blocks - can be statistically analyzed. The combination of the random shuffling assumption and the other statistical assumptions associated with XSketch estimators ensures a complete statistical model and error analysis for XSketches. Published by Elsevier Ltd.

引用

页码：213 / 230

页数：18

共 50 条

[41] On using extended statistical queries to avoid membership queries
Bshouty, NH
Feldman, V
COMPUTATIONAL LEARNING THEORY, PROCEEDINGS, 2001, 2111 : 529 - 545
[42] Range-Consistent Answers of Aggregate Queries under Aggregate Constraints
Flesca, Sergio
Furfaro, Filippo
Parisi, Francesco
SCALABLE UNCERTAINTY MANAGEMENT, SUM 2010, 2010, 6379 : 163 - 176
[43] Minimizing statistical bias with queries
Cohn, DA
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 9: PROCEEDINGS OF THE 1996 CONFERENCE, 1997, 9 : 417 - 423
[44] Bagging using statistical queries
Van Assche, Anneleen
Blockeel, Hendrik
MACHINE LEARNING: ECML 2006, PROCEEDINGS, 2006, 4212 : 809 - 816
[45] A comparison of selectivity estimators for range queries on metric attributes
Blohsfeld, B
Korus, D
Seeger, B
SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999: SIGMOD99: PROCEEDINGS OF THE 1999 ACM SIGMOD - INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 1999, : 239 - 250
[46] Estimating the selectivity of LIKE queries using pattern-based histograms
Aytimur, Mehmet
Cakmak, Ali
TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2018, 26 (06) : 3319 - 3334
[47] Efficient Aggregate Queries on Location Data with Confidentiality
Feng, Da
Zhou, Fucai
Wang, Qiang
Wu, Qiyu
Li, Bao
SENSORS, 2022, 22 (13)
[48] The Semantics of Aggregate Queries in Data Exchange Revisited
Kolaitis, Phokion G.
Spezzano, Francesca
SCALABLE UNCERTAINTY MANAGEMENT, SUM 2013, 2013, 8078 : 233 - 246
[49] Efficient Methods for Aggregate Reverse Rank Queries
Dong, Yuyang
Chen, Hanxiong
Furuse, Kazutaka
Kitagawa, Hiroyuki
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2018, E101D (04): : 1012 - 1020
[50] Expanding Queries with Maximum Likelihood Estimators and Language Models
Karras, Christos
Karras, Aristeidis
Theodorakopoulos, Leonidas
Giannoukou, Ioanna
Sioutas, Spyros
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INNOVATIONS IN COMPUTING RESEARCH (ICR'22), 2022, 1431 : 201 - 213

← 1 2 3 4 5 →