Frequent Itemsets Mining for Big Data: A Comparative Analysis

被引：29

作者：

Apiletti, Daniele ^{[1
]}

Baralis, Elena ^{[1
]}

Cerquitelli, Tania ^{[1
]}

Garza, Paolo ^{[1
]}

Pulvirenti, Fabio ^{[1
]}

Venturini, Luca ^{[1
]}

机构：

[1] Politecn Torino, Dipartimento Automat & Informat, Turin, Italy

来源：

BIG DATA RESEARCH | 2017年 / 9卷

关键词：

Big Data; Frequent itemset mining; Hadoop and Spark platforms;

D O I：

10.1016/j.bdr.2017.06.006

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Itemset mining is a well-known exploratory data mining technique used to discover interesting correlations hidden in a data collection. Since it supports different targeted analyses, it is profitably exploited in a wide range of different domains, ranging from network traffic data to medical records. With the increasing amount of generated data, different scalable algorithms have been developed, exploiting the advantages of distributed computing frameworks, such as Apache Hadoop and Spark. This paper reviews Hadoop-and Spark-based scalable algorithms addressing the frequent itemset mining problem in the Big Data domain through both theoretical and experimental comparative analyses. Since the itemset mining task is computationally expensive, its distribution and parallelization strategies heavily affect memory usage, load balancing, and communication costs. A detailed discussion of the algorithmic choices of the distributed methods for frequent itemset mining is followed by an experimental analysis comparing the performance of state-of-the-art distributed implementations on both synthetic and real datasets. The strengths and weaknesses of the algorithms are thoroughly discussed with respect to the dataset features (e.g., data distribution, average transaction length, number of records), and specific parameter settings. Finally, based on theoretical and experimental analyses, open research directions for the parallelization of the itemset mining problem are presented. (C) 2017 Elsevier Inc. All rights reserved.

引用

页码：67 / 83

页数：17

共 50 条

[31] An Algorithm for Mining Frequent Closed Itemsets in Data Stream
Dai, Caiyan
Chen, Ling
2010 INTERNATIONAL COLLOQUIUM ON COMPUTING, COMMUNICATION, CONTROL, AND MANAGEMENT (CCCM2010), VOL I, 2010, : 281 - 284
[32] Mining frequent itemsets in a stream
Calders, Toon
Dexters, Nele
Goethals, Bart
ICDM 2007: PROCEEDINGS OF THE SEVENTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, 2007, : 83 - +
[33] An Algorithm for Mining Frequent Itemsets
Hernandez Leon, Raudel
Perez Suarez, Airel
Feregrino Uribe, Claudia
Guzman Zavaleta, Zobeida Jezabel
2008 5TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING, COMPUTING SCIENCE AND AUTOMATIC CONTROL (CCE 2008), 2008, : 236 - +
[34] Mining frequent itemsets in a stream
Calders, Toon
Dexters, Nele
Gillis, Joris J. M.
Goethals, Bart
INFORMATION SYSTEMS, 2014, 39 : 233 - 255
[35] Revenue prediction by mining frequent itemsets with customer analysis
Weng, Cheng-Hsiung
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2017, 63 : 85 - 97
[36] A survey on algorithms for mining frequent itemsets over data streams
Cheng, James
Ke, Yiping
Ng, Wilfred
KNOWLEDGE AND INFORMATION SYSTEMS, 2008, 16 (01) : 1 - 27
[37] Towards a new approach for mining frequent itemsets on data stream
Raissi, Chedy
Poncelet, Pascal
Teisseire, Maguelonne
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2007, 28 (01) : 23 - 36
[38] A Novel Strategy for Mining Frequent Closed Itemsets in Data Streams
Tang, Keming
Dai, Caiyan
Chen, Ling
JOURNAL OF COMPUTERS, 2012, 7 (07) : 1564 - 1573
[39] A sliding window algorithm for mining frequent itemsets on data stream
Liu, Junqiang
Li, Xiurong
DCABES 2006 PROCEEDINGS, VOLS 1 AND 2, 2006, : 637 - 639
[40] A decremental approach for mining frequent itemsets from uncertain data
Chui, Chun-Kit
Kao, Ben
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2008, 5012 : 64 - 75

← 1 2 3 4 5 →