Data Partitioning Method for Efficient Parallel Skyline Computation

被引:0
|
作者
Zhao X. [1 ,3 ]
Shang H.-C. [2 ]
机构
[1] Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha
[2] Institute of Industrial Science, The University of Tokyo, Tokyo
[3] Collaborative Innovation Center of Geospatial Technology, Wuhan
来源
基金
中国国家自然科学基金;
关键词
Data partitioning; Parallel Skyline; Permutation model; Scalability;
D O I
10.11897/SP.J.1016.2020.02050
中图分类号
学科分类号
摘要
Skyline computation has long been an important and hot research topic in the field of data management. Given a set of multi-dimensional data points, a Skyline operator selects the points that are not dominated by any other points on all dimensions; the process of processing Skyline operator is referred as Skyline computation. Skyline operator enables users to select objects of interest from a comparatively small set of Skyline result set, without the need of care about those objects that have been filtered out. As a consequence, Skyline computation finds various applications like multi-criteria decision-making, visual data analytics, and user preference search, etc., and typical application tasks include but not limited to business marketing strategy analysis, production capability lateral assessment, etc. With the arrival of big data era, as well as the wide application of distributed networks and the rapid development of cloud computing platform-based solutions, the increase of data volume in various application domains has become a key technical challenge, and as a consequence, parallel Skyline operator for large-scale data sets was proposed, in order to partially resolve the difficulty imposed by big data; meanwhile, related research on parallel Skyline computation has received amplified attention from both academia and industry lately. Due to the lack of global distributional information regarding the overall dataset, processing of parallel Skyline computation is facing great technical challenges. Generally speaking, the computational framework of parallel Skyline processing normally comprises three major steps: (1)appropriately partition the possibly big dataset; (2)using local computing resources to evaluate local Skyline on each partition respectively; and (3) merge local Skylines into eventually a global Skyline. Among them, there are many existing algorithms for the latter two steps, i.e., compute local Skylines and merge local Skylines, and moreover, the related research is comparatively mature; on the contrary, related research on the first step is comparatively rare, although the effect of the first step directly determines the parallelism of the entire computation, and hence, the overall performance of the parallel computing system. Specifically, for the first step, there are two criteria that need to be considered: (1)whether computation workloads are balanced across partitions; (2)how to reduce local Skyline cardinalities on each partition. However, existing parallel Skyline algorithms, no matter random partitioning or grid-based method, can only satisfy only one of the criteria but not both. To address the issue, we exploit a probabilistic model to evaluate Skyline cardinality, and the probabilistic model can encapsulate relevant results from existing literature into a unified framework. Then, based on that, propose a novel permutation-based method for data partitioning which through simple data point mapping is able to achieve both load balance and generate smaller Skyline candidate sets in comparison with other existing methods. On the solid basis of theoretical study, extensive experiments on large-scale synthetic data sets and real-life data sets verify the effectiveness of the proposed model as well as the method; in other words, in comprehensive experimental study, the proposed method improves the execution efficiency of parallel Skyline operator, and it outperforms existing algorithms alike under most parameter settings. © 2020, Science Press. All right reserved.
引用
收藏
页码:2050 / 2066
页数:16
相关论文
共 30 条
  • [1] Meneghetti N, Mindolin D, Ciaccia P, Chomicki J., Output-sensitive evaluation of prioritized Skyline queries, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1955-1967, (2015)
  • [2] Nagendra M, Candan K S., Efficient processing of Skyline-join queries over multiple data sources, ACM Transactions on Database Systems, 40, 2, (2015)
  • [3] Shang H, Kitsuregawa M., Skyline operator on anti-correlated distributions, Proceedings of the VLDB Endowment, 6, 9, pp. 649-660, (2013)
  • [4] Kohler H, Yang J, Zhou X., Efficient parallel Skyline processing using hyperplane projections, Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 85-96, (2011)
  • [5] Park Y, Min J, Shim K., Parallel Computation of Skyline and reverse Skyline queries using MapReduce, Proceedings of the VLDB Endowment, 6, 14, pp. 2002-2013, (2013)
  • [6] Zhang B, Zhou S, Guan J., Adapting Skyline computation to the MapReduce framework: Algorithms and experiments, Proceedings of the 2011 DASFAA International Workshops, pp. 403-414, (2011)
  • [7] Wu P, Zhang C, Feng Y, Et al., Parallelizing Skyline queries for scalable distribution, Proceedings of the 2006 International Conference on Extending Database Technology, pp. 112-130, (2006)
  • [8] Wang S, Ooi B C, Tung A K H, Xu L., Efficient Skyline query processing on peer-to-peer networks, Proceedings of the 2007 IEEE International Conference on Data Engineering, pp. 1126-1135, (2007)
  • [9] Afrati F N, Koutris P, Suciu D, Ullman J D., Parallel Skyline queries, Proceedings of the 2012 International Conference on Database Theory, pp. 274-284, (2012)
  • [10] Vlachou A, Doulkeridis C, Kotidis Y., Angle-based space partitioning for efficient parallel Skyline computation, Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 227-238, (2008)