Cutting the Unnecessary Long Tail: Cost-Effective Big Data Clustering in the Cloud

被引:4
|
作者
Li, Dongwei [1 ,2 ]
Wang, Shuliang [1 ]
Gao, Nan [3 ]
He, Qiang [2 ]
Yang, Yun [2 ]
机构
[1] Beijing Inst Technol, Sch Comp Sci, Beijing 100811, Haidian, Peoples R China
[2] Swinburne Univ Technol, Sch Software & Elect Engn, Hawthorn, Vic 3122, Australia
[3] RMIT Univ, Sch Sci, Melbourne, Vic 3000, Australia
关键词
Cloud computing; cost-effectiveness; clustering algorithms; big data; data mining; ALGORITHMS; EM;
D O I
10.1109/TCC.2019.2947678
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Clustering big data often requires tremendous computational resources where cloud computing is undoubtedly one of the promising solutions. However, the computation cost in the cloud can be unexpectedly high if it cannot be managed properly. The long tail phenomenon has been observed widely in the big data clustering area, which indicates that the majority of time is often consumed in the middle to late stages in the clustering process. In this research, we try to cut the unnecessary long tail in the clustering process to achieve a sufficiently satisfactory accuracy at the lowest possible computation cost. A novel approach is proposed to achieve cost-effective big data clustering in the cloud. By training the regression model with the sampling data, we can make widely used k-means and EM (Expectation-Maximization) algorithms stop automatically at an early point when the desired accuracy is obtained. Experiments are conducted on four popular data sets and the results demonstrate that both k-means and EM algorithms can achieve high cost-effectiveness in the cloud with our proposed approach. For example, in the case studies with the much more efficient k-means algorithm, we find that achieving a 99 percent accuracy needs only 47.71-71.14 percent of the computation cost required for achieving a 100 percent accuracy while the less efficient EM algorithm needs 16.69-32.04 percent of the computation cost. To put that into perspective, in the United States land use classification example, our approach can save up to $94,687.49 for the government in each use.
引用
收藏
页码:292 / 303
页数:12
相关论文
共 50 条
  • [1] Towards Cost-Effective Cloud Downloading with Tencent Big Data
    Li, Zhen-Hua
    Liu, Gang
    Ji, Zhi-Yuan
    Zimmermann, Roger
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2015, 30 (06) : 1163 - 1174
  • [2] Towards Cost-Effective Cloud Downloading with Tencent Big Data
    Zhen-Hua Li
    Gang Liu
    Zhi-Yuan Ji
    Roger Zimmermann
    Journal of Computer Science and Technology, 2015, 30 : 1163 - 1174
  • [3] Cost-Effective Cloud Server Provisioning for Predictable Performance of Big Data Analytics
    Xu, Fei
    Zheng, Haoyue
    Jiang, Huan
    Shao, Wujie
    Liu, Haikun
    Zhou, Zhi
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (05) : 1036 - 1051
  • [4] Cost-Effective, Workload-Adaptive Migration of Big Data Applications to the Cloud
    Giannakouris, Victor
    Fernandez, Alejandro
    Simitsis, Alkis
    Babu, Shivnath
    SIGMOD '19: PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2019, : 1909 - 1912
  • [5] Cost-effective cutting
    Machinery, 2007, 165 (4144): : 59 - 60
  • [6] Cost-effective Big Data Mining in the Cloud: A Case Study with K-means
    He, Qiang
    Zhu, Xiaodong
    Li, Dongwei
    Wang, Shuliang
    Shen, Jun
    Yang, Yun
    2017 IEEE 10TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD), 2017, : 74 - 81
  • [7] Cost-effective clustering
    Gottlieb, S
    COMPUTER PHYSICS COMMUNICATIONS, 2001, 142 (1-3) : 43 - 48
  • [8] Uploading multiply deferrable big data to the cloud platform using cost-effective online algorithms
    Cui, Baojiang
    Shi, Peilin
    Qi, Weikong
    Li, Ming
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2017, 67 : 276 - 285
  • [9] Cost-effective and adaptive clustering algorithm for stream processing on cloud system
    Yue Xia
    Junhua Fang
    Pingfu Chao
    Zhicheng Pan
    Jedi S. Shang
    GeoInformatica, 2023, 27 : 1 - 21
  • [10] Cost-effective and adaptive clustering algorithm for stream processing on cloud system
    Xia, Yue
    Fang, Junhua
    Chao, Pingfu
    Pan, Zhicheng
    Shang, Jedi S.
    GEOINFORMATICA, 2023, 27 (01) : 1 - 21