Improving estimation accuracy of aggregate queries on data cubes

被引:4
|
作者
Pourabbas, E. [1 ]
Shoshani, A. [2 ]
机构
[1] Ist Anal Sistemi & Informat Antonio Ruberti, Italian Natl Res Council, I-00185 Rome, Italy
[2] Univ Calif Berkeley, Lawrence Berkeley Lab, Berkeley, CA 94720 USA
关键词
Query estimation; Entropy; Accuracy analysis;
D O I
10.1016/j.datak.2009.08.010
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we investigate the problem of estimation of a target database from summary databases derived from a base data cube. We show that such estimates can be derived by choosing a primary database with the desired target measure but not the desired dimensions, and use a proxy database to estimate the results. This technique is common in statistics, but an important issue we are addressing is the accuracy of these estimates. Specifically, given multiple primary and multiple proxy databases, the problem is how to select the primary and proxy databases that will generate the most accurate target database estimation possible. We propose an algorithmic approach which makes use of the principles of information entropy for determining the steps to select or compute the primary and proxy databases that provide the most accurate target database. We show that the primary database with the largest number of cells in common with the target database and the proxy database provides the more accurate estimates. We prove that this is consistent with maximizing the entropy. We provide some experimental results on the accuracy of the target database estimation in order to verify our results. Furthermore, we investigate the accuracy results in cases where the dimensions are defined over a hierarchy of categories and roll-up and drill-down operations are needed to generate the desired target results. (C) 2009 Elsevier B.V. All rights reserved.
引用
收藏
页码:50 / 72
页数:23
相关论文
共 50 条
  • [21] Queries with aggregate functions over fuzzy RDF data
    Zongmin Ma
    Xiaowen Zhang
    Yuhan Zhao
    The Journal of Supercomputing, 2023, 79 : 14780 - 14807
  • [22] Performing Range Aggregate Queries in Stream Data Warehouse
    Gorawski, Marcin
    Malczok, Rafal
    MAN-MACHINE INTERACTIONS, 2009, 59 : 615 - 622
  • [23] Queries with aggregate functions over fuzzy RDF data
    Ma, Zongmin
    Zhang, Xiaowen
    Zhao, Yuhan
    JOURNAL OF SUPERCOMPUTING, 2023, 79 (13): : 14780 - 14807
  • [24] Estimation from aggregate data
    Gouno, E.
    Courtrai, L.
    Fredette, M.
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2011, 55 (01) : 615 - 626
  • [25] Improving accuracy for identifying related PubMed queries by an integrated approach
    Lu, Zhiyong
    Wilbur, W. John
    JOURNAL OF BIOMEDICAL INFORMATICS, 2009, 42 (05) : 831 - 838
  • [26] Efficient Range-Sum Queries along Dimensional Hierarchies in Data Cubes
    Lauer, Tobias
    Mai, Dominic
    Hagedorn, Philippe
    2009 FIRST INTERNATIONAL CONFERENCE ON ADVANCES IN DATABASES, KNOWLEDGE, AND DATA APPLICATIONS, 2009, : 7 - +
  • [27] Providing accurate answers to OLAP queries based on standardized moments of data cubes
    Pourabbas, Elaheh
    INFORMATION SYSTEMS, 2020, 94
  • [28] Partial-sum queries in OLAP data cubes using covering codes
    Ho, CT
    Bruck, J
    Agrawal, R
    IEEE TRANSACTIONS ON COMPUTERS, 1998, 47 (12) : 1326 - 1340
  • [29] A probabilistic framework for estimating the accuracy of aggregate range queries evaluated over histograms
    Buccafurri, Francesco
    Furfaro, Filippo
    Sacca, Domenico
    INFORMATION SCIENCES, 2012, 188 : 121 - 150
  • [30] Accuracy vs. Lifetime: Linear Sketches for Aggregate Queries in Sensor Networks
    Vasundhara Puttagunta
    Konstantinos Kalpakis
    Algorithmica, 2007, 49 : 357 - 385