Frequent Itemsets Mining for Big Data: A Comparative Analysis

被引:29
|
作者
Apiletti, Daniele [1 ]
Baralis, Elena [1 ]
Cerquitelli, Tania [1 ]
Garza, Paolo [1 ]
Pulvirenti, Fabio [1 ]
Venturini, Luca [1 ]
机构
[1] Politecn Torino, Dipartimento Automat & Informat, Turin, Italy
关键词
Big Data; Frequent itemset mining; Hadoop and Spark platforms;
D O I
10.1016/j.bdr.2017.06.006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Itemset mining is a well-known exploratory data mining technique used to discover interesting correlations hidden in a data collection. Since it supports different targeted analyses, it is profitably exploited in a wide range of different domains, ranging from network traffic data to medical records. With the increasing amount of generated data, different scalable algorithms have been developed, exploiting the advantages of distributed computing frameworks, such as Apache Hadoop and Spark. This paper reviews Hadoop-and Spark-based scalable algorithms addressing the frequent itemset mining problem in the Big Data domain through both theoretical and experimental comparative analyses. Since the itemset mining task is computationally expensive, its distribution and parallelization strategies heavily affect memory usage, load balancing, and communication costs. A detailed discussion of the algorithmic choices of the distributed methods for frequent itemset mining is followed by an experimental analysis comparing the performance of state-of-the-art distributed implementations on both synthetic and real datasets. The strengths and weaknesses of the algorithms are thoroughly discussed with respect to the dataset features (e.g., data distribution, average transaction length, number of records), and specific parameter settings. Finally, based on theoretical and experimental analyses, open research directions for the parallelization of the itemset mining problem are presented. (C) 2017 Elsevier Inc. All rights reserved.
引用
收藏
页码:67 / 83
页数:17
相关论文
共 50 条
  • [31] An Algorithm for Mining Frequent Closed Itemsets in Data Stream
    Dai, Caiyan
    Chen, Ling
    2010 INTERNATIONAL COLLOQUIUM ON COMPUTING, COMMUNICATION, CONTROL, AND MANAGEMENT (CCCM2010), VOL I, 2010, : 281 - 284
  • [32] Mining frequent itemsets in a stream
    Calders, Toon
    Dexters, Nele
    Goethals, Bart
    ICDM 2007: PROCEEDINGS OF THE SEVENTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, 2007, : 83 - +
  • [33] An Algorithm for Mining Frequent Itemsets
    Hernandez Leon, Raudel
    Perez Suarez, Airel
    Feregrino Uribe, Claudia
    Guzman Zavaleta, Zobeida Jezabel
    2008 5TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING, COMPUTING SCIENCE AND AUTOMATIC CONTROL (CCE 2008), 2008, : 236 - +
  • [34] Mining frequent itemsets in a stream
    Calders, Toon
    Dexters, Nele
    Gillis, Joris J. M.
    Goethals, Bart
    INFORMATION SYSTEMS, 2014, 39 : 233 - 255
  • [35] Revenue prediction by mining frequent itemsets with customer analysis
    Weng, Cheng-Hsiung
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2017, 63 : 85 - 97
  • [36] A survey on algorithms for mining frequent itemsets over data streams
    Cheng, James
    Ke, Yiping
    Ng, Wilfred
    KNOWLEDGE AND INFORMATION SYSTEMS, 2008, 16 (01) : 1 - 27
  • [37] Towards a new approach for mining frequent itemsets on data stream
    Raissi, Chedy
    Poncelet, Pascal
    Teisseire, Maguelonne
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2007, 28 (01) : 23 - 36
  • [38] A Novel Strategy for Mining Frequent Closed Itemsets in Data Streams
    Tang, Keming
    Dai, Caiyan
    Chen, Ling
    JOURNAL OF COMPUTERS, 2012, 7 (07) : 1564 - 1573
  • [39] A sliding window algorithm for mining frequent itemsets on data stream
    Liu, Junqiang
    Li, Xiurong
    DCABES 2006 PROCEEDINGS, VOLS 1 AND 2, 2006, : 637 - 639
  • [40] A decremental approach for mining frequent itemsets from uncertain data
    Chui, Chun-Kit
    Kao, Ben
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2008, 5012 : 64 - 75