When Good-Enough is Enough: Complex Queries at Fixed Cost

被引:4
|
作者
Mickulicz, Nathan D. [1 ]
Martins, Rolando [1 ]
Narasimhan, Priya [1 ]
Gandhi, Rajecv [1 ]
机构
[1] Carnegie Mellon Univ, Dept Elect & Comp Engn, Pittsburgh, PA 15213 USA
关键词
D O I
10.1109/BigDataService.2015.24
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Collections of time-series data appear in a wide variety of contexts. To gain insight into the underlying phenomenon (that the data represents), one must analyze the time-series data. Analysis can quickly become challenging for very large data (similar to terabytes or more) sets, and it may be infeasible to scan the entire data-set on each query due to time limits or resource constraints. To avoid this problem, one might pre-compute partial results by scanning the data-set (usually as the data arrives). However, for complex queries, where the value of a new data record depends on all of the data previously seen, this might be infeasible because incorporating a large amount of historical data into a query requires a large amount of storage. We present an approach to performing complex queries over very large data-sets in a manner that is (i) practical, meaning that a query does not require a scan of the entire data-set, and (ii) fixed-cost, meaning that the amount of storage required only depends on the time-range spanned by the entire data-set (and not the size of the data-set itself). We evaluate our approach with three different data-sets: (i) a 4-year commercial analytics data-set from a production content-delivery platform with over 15 million mobile users, (ii) an 18-year data-set from the Linux-kernel commit-history, and (iii) an 8-day data-set from Common Crawl HTTP logs. Our evaluation demonstrates the feasibility and practicality of our approach for a diverse set of complex queries on a diverse set of very large data-sets.
引用
收藏
页码:89 / 98
页数:10
相关论文
共 50 条