Cutting the Unnecessary Long Tail: Cost-Effective Big Data Clustering in the Cloud

被引:4
|
作者
Li, Dongwei [1 ,2 ]
Wang, Shuliang [1 ]
Gao, Nan [3 ]
He, Qiang [2 ]
Yang, Yun [2 ]
机构
[1] Beijing Inst Technol, Sch Comp Sci, Beijing 100811, Haidian, Peoples R China
[2] Swinburne Univ Technol, Sch Software & Elect Engn, Hawthorn, Vic 3122, Australia
[3] RMIT Univ, Sch Sci, Melbourne, Vic 3000, Australia
关键词
Cloud computing; cost-effectiveness; clustering algorithms; big data; data mining; ALGORITHMS; EM;
D O I
10.1109/TCC.2019.2947678
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Clustering big data often requires tremendous computational resources where cloud computing is undoubtedly one of the promising solutions. However, the computation cost in the cloud can be unexpectedly high if it cannot be managed properly. The long tail phenomenon has been observed widely in the big data clustering area, which indicates that the majority of time is often consumed in the middle to late stages in the clustering process. In this research, we try to cut the unnecessary long tail in the clustering process to achieve a sufficiently satisfactory accuracy at the lowest possible computation cost. A novel approach is proposed to achieve cost-effective big data clustering in the cloud. By training the regression model with the sampling data, we can make widely used k-means and EM (Expectation-Maximization) algorithms stop automatically at an early point when the desired accuracy is obtained. Experiments are conducted on four popular data sets and the results demonstrate that both k-means and EM algorithms can achieve high cost-effectiveness in the cloud with our proposed approach. For example, in the case studies with the much more efficient k-means algorithm, we find that achieving a 99 percent accuracy needs only 47.71-71.14 percent of the computation cost required for achieving a 100 percent accuracy while the less efficient EM algorithm needs 16.69-32.04 percent of the computation cost. To put that into perspective, in the United States land use classification example, our approach can save up to $94,687.49 for the government in each use.
引用
收藏
页码:292 / 303
页数:12
相关论文
共 50 条
  • [41] COST-EFFECTIVE CHROMATOGRAPHY DATA MANAGEMENT
    KOONTZ, AE
    AMERICAN LABORATORY, 1990, 22 (05) : 66 - &
  • [42] Cost-effective data center cooling
    Blough, Bob
    COMMUNICATIONS NEWS, 2008, 45 (10): : 10 - 10
  • [43] Cost-Effective HPC Clustering For Computer Vision Applications
    Dietlmeier, Julia
    Begley, Sean
    Whelan, Paul F.
    2008 INTERNATIONAL MACHINE VISION AND IMAGE PROCESSING CONFERENCE, PROCEEDINGS, 2008, : 97 - 102
  • [44] Combination Replicas Placements Strategy for Data sets from Cost-effective View in the Cloud
    Wu X.
    International Journal of Computational Intelligence Systems, 2017, 10 (1) : 521 - 539
  • [45] A new cost-effective mechanism for VM-to-user mapping in cloud data centers
    Adabi, Sepideh
    Hossein-Haje, Zahra
    Adabi, Sahar
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2020, 23 (04): : 2425 - 2451
  • [46] Needle in A Haystack: Cost-Effective Data Analytics for Real-Time Cloud Sharing
    Hua, Yu
    Feng, Dan
    2014 IEEE 22ND INTERNATIONAL SYMPOSIUM OF QUALITY OF SERVICE (IWQOS), 2014, : 159 - 167
  • [47] Combination Replicas Placements Strategy for Data sets from Cost-effective View in the Cloud
    Wu, Xiuguo
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2017, 10 (01) : 521 - 539
  • [48] On Achieving Cost-Effective Adaptive Cloud Gaming in Geo-Distributed Data Centers
    Tian, Hao
    Wu, Di
    He, Jian
    Xu, Yuedong
    Chen, Min
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2015, 25 (12) : 2064 - 2077
  • [49] A new cost-effective mechanism for VM-to-user mapping in cloud data centers
    Sepideh Adabi
    Zahra Hossein-Haje
    Sahar Adabi
    Cluster Computing, 2020, 23 : 2425 - 2451
  • [50] Cost-Effective Traffic Scheduling For Cloud Resource Management
    Shareef, Zayd Ashraf
    Hussin, Masnida
    Abdullah, Azizol
    Muhammed, Abdullah
    2015 IEEE STUDENT CONFERENCE ON RESEARCH AND DEVELOPMENT (SCORED), 2015, : 189 - 194