Cost-efficient Data Acquisition on Online Data Marketplaces for Correlation Analysis

被引:6
|
作者
Li, Yanying [1 ]
Sun, Haipei [1 ]
Dong, Boxiang [2 ]
Wang, Hui [1 ]
机构
[1] Stevens Inst Technol, 1 Castle Point Terrace, Hoboken, NJ 07030 USA
[2] Montclair State Univ, 1 Normal Ave, Montclair, NJ 07043 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2018年 / 12卷 / 04期
关键词
D O I
10.14778/3297753.3297757
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Incentivized by the enormous economic profits, the data marketplace platform has been proliferated recently. In this paper, we consider the data marketplace setting where a data shopper would like to buy data instances from the data marketplace for correlation analysis of certain attributes. We assume that the data in the marketplace is dirty and not free. The goal is to find the data instances from a large number of datasets in the marketplace whose join result not only is of high-quality and rich join informativeness, but also delivers the best correlation between the requested attributes. To achieve this goal, we design DANCE, a middleware that provides the desired data acquisition service. DANCE consists of two phases: (1) In the off-line phase, it constructs a two-layer join graph from samples. The join graph includes the information of the datasets in the marketplace at both schema and instance levels; (2) In the online phase, it searches for the data instances that satisfy the constraints of data quality, budget, and join informativeness, while maximizing the correlation of source and target attribute sets. We prove that the complexity of the search problem is NP-hard, and design a heuristic algorithm based on Markov chain Monte Carlo (MCMC). Experiment results on two benchmark and one real datasets demonstrate the efficiency and effectiveness of our heuristic data acquisition algorithm.
引用
收藏
页码:362 / 375
页数:14
相关论文
共 50 条
  • [1] A Cost-Efficient Approach to Storing Users' Data for Online Social Networks
    Zhou, Jing-Ya
    Fan, Jian-Xi
    Lin, Cheng-Kuan
    Cheng, Bao-Lei
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2019, 34 (01) : 234 - 252
  • [2] A Cost-Efficient Approach to Storing Users’ Data for Online Social Networks
    Jing-Ya Zhou
    Jian-Xi Fan
    Cheng-Kuan Lin
    Bao-Lei Cheng
    Journal of Computer Science and Technology, 2019, 34 : 234 - 252
  • [3] Cost-Efficient Data Redundancy in the Cloud
    Waibel, Philipp
    Hochreiner, Christoph
    Schulte, Stefan
    2016 IEEE 9TH INTERNATIONAL CONFERENCE ON SERVICE-ORIENTED COMPUTING AND APPLICATIONS (SOCA), 2016, : 1 - 9
  • [4] On Cost-Efficient Learning of Data Dependency
    Jang, Hyeryung
    Song, Hyungseok
    Yi, Yung
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2022, 30 (03) : 1382 - 1394
  • [5] Technically and cost-efficient centralized allocations in data envelopment analysis
    Cesaroni, Giovanni
    SOCIO-ECONOMIC PLANNING SCIENCES, 2020, 70
  • [6] Cost-efficient and Differentiated Data Availability Guarantees in Data Clouds
    Bonvin, Nicolas
    Papaioannou, Thanasis G.
    Aberer, Karl
    26TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING ICDE 2010, 2010, : 980 - 983
  • [7] Cost-Efficient Partitioning of Spatial Data on Cloud
    Akdogan, Afsin
    Indrakanti, Saratchandra
    Demiryurek, Ugur
    Shahabi, Cyrus
    PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 501 - 506
  • [8] Dynamic Cost-Efficient Replication in Data Clouds
    Bonvin, Nicolas
    Papaioannou, Thanasis G.
    Aberer, Karl
    FIRST WORKSHOP ON AUTOMATED CONTROL FOR DATACENTERS AND CLOUDS (ACDC '09), 2009, : 49 - 56
  • [9] Data distribution tailoring revisited: cost-efficient integration of representative data
    Chang, Jiwon
    Cui, Bohan
    Nargesian, Fatemeh
    Asudeh, Abolfazl
    Jagadish, H. V.
    VLDB JOURNAL, 2024, 33 (05): : 1283 - 1306
  • [10] Intelligent Mobile Data Mules for Cost-Efficient Sensor Data Collection
    Jayaraman, Prem Prakash
    Zaslavsky, Arkady
    Delsing, Jerker
    INTERNATIONAL JOURNAL OF NEXT-GENERATION COMPUTING, 2010, 1 (01): : 73 - 90