Big data clustering techniques based on Spark: a literature review

被引:0
|
作者
Saeed M.M. [1 ]
Aghbari Z.A. [2 ]
Alsharidah M. [1 ]
机构
[1] Department of Computer Science, Prince Sattam Bin Abdul Aziz, Riyadh
[2] Department of Computer Science, University of Sharjah, Sharjah
关键词
Big Data; Big Data clustering; Spark; Spark-based clustering;
D O I
10.7717/PEERJ-CS.321
中图分类号
学科分类号
摘要
A popular unsupervised learning method, known as clustering, is extensively used in data mining, machine learning and pattern recognition. The procedure involves grouping of single and distinct points in a group in such a way that they are either similar to each other or dissimilar to points of other clusters. Traditional clustering methods are greatly challenged by the recent massive growth of data. Therefore, several research works proposed novel designs for clustering methods that leverage the benefits of Big Data platforms, such as Apache Spark, which is designed for fast and distributed massive data processing. However, Spark-based clustering research is still in its early days. In this systematic survey, we investigate the existing Spark-based clustering methods in terms of their support to the characteristics Big Data. Moreover, we propose a new taxonomy for the Spark-based clustering methods. To the best of our knowledge, no survey has been conducted on Spark-based clustering of Big Data. Therefore, this survey aims to present a comprehensive summary of the previous studies in the field of Big Data clustering using Apache Spark during the span of 2010-2020. This survey also highlights the new research directions in the field of clustering massive data. © Copyright 2020 Saeed et al.
引用
收藏
页码:1 / 28
页数:27
相关论文
共 50 条
  • [1] Big data clustering techniques based on Spark: a literature review
    Saeed, Mozamel M.
    Al Aghbari, Zaher
    Alsharidah, Mohammed
    PEERJ COMPUTER SCIENCE, 2020,
  • [2] Apache Spark Methods and Techniques in Big Data-A Review
    Sahana, H. P.
    Sanjana, M. S.
    Muddasir, N. Mohammed
    Vidyashree, K. P.
    INVENTIVE COMMUNICATION AND COMPUTATIONAL TECHNOLOGIES, ICICCT 2019, 2020, 89 : 721 - 726
  • [3] Big Data Clustering Techniques Challenges and Perspectives: Review
    Awad F.H.
    Hamad M.M.
    Informatica (Slovenia), 2023, 47 (06): : 203 - 218
  • [4] An Efficient Parallel Algorithm for Clustering Big Data based on the Spark Framework
    Faculty of Science of Rabat, Mohammed V University, Rabat, Morocco
    Intl. J. Adv. Comput. Sci. Appl., 7 (890-896):
  • [5] An Efficient Parallel Algorithm for Clustering Big Data based on the Spark Framework
    Dafir, Zineb
    Slaoui, Said
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (07) : 890 - 896
  • [6] Literature Review on High Dimensional Data Clustering Techniques
    Selvavinayagam, G.
    Loganathan, Venkateshwaran
    Loheswaran, K.
    BIOSCIENCE BIOTECHNOLOGY RESEARCH COMMUNICATIONS, 2020, 13 (06): : 183 - 187
  • [7] A Framework for Clustering and Classification of Big Data Using Spark
    Mallios, Xristos
    Vassalos, Vasilis
    Venetis, Tassos
    Vlachou, Akrivi
    ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS: OTM 2016 CONFERENCES, 2016, 10033 : 344 - 362
  • [8] Design of Intelligent K-Means Based on Spark for Big Data Clustering
    Kusuma, Ilham
    Ma'sum, M. Anwar
    Habibie, Novian
    Jatmiko, Wisnu
    Suhartanto, Heru
    2016 INTERNATIONAL WORKSHOP ON BIG DATA AND INFORMATION SECURITY (IWBIS), 2016, : 89 - 95
  • [9] Fuzzy Based Clustering Algorithms to Handle Big Data with Implementation on Apache Spark
    Bharill, Neha
    Tiwari, Aruna
    Malviya, Aayushi
    PROCEEDINGS 2016 IEEE SECOND INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (BIGDATASERVICE 2016), 2016, : 95 - 104
  • [10] Literature review and analysis on big data stream classification techniques
    Srivani, B.
    Sandhya, N.
    Rani, B. Padmaja
    INTERNATIONAL JOURNAL OF KNOWLEDGE-BASED AND INTELLIGENT ENGINEERING SYSTEMS, 2020, 24 (03) : 205 - 215