A scalable and flexible basket analysis system for big transaction data in Spark

被引:5
|
作者
Sun, Xudong [1 ,2 ]
Ngueilbaye, Alladoumbaye [1 ,2 ]
Luo, Kaijing [1 ,2 ]
Cai, Yongda [1 ,2 ]
Wu, Dingming [1 ,2 ]
Huang, Joshua Zhexue [1 ,2 ,3 ]
机构
[1] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China
[2] Shenzhen Univ, Big Data Inst, Coll Comp Sci & Software Engn, Shenzhen 518060, Peoples R China
[3] Guangdong Lab Artificial Intelligence & Digital Ec, Shenzhen 518107, Peoples R China
基金
中国国家自然科学基金;
关键词
Big transaction data; Frequent itemset mining; Parallel and distributed computing; Business basket analysis; Basket analysis systems; FP-GROWTH; FREQUENT; ALGORITHM; PATTERNS;
D O I
10.1016/j.ipm.2023.103577
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Basket analysis is a prevailing technique to help retailers uncover patterns and associations of sold products in customer shopping transactions. However, as the size of transaction databases grows, the traditional basket analysis techniques and systems become less effective because of two issues in the applications of the big data age: data scalability and flexibility to adapt different application tasks. This paper proposes a scalable distributed frequent itemset mining (ScaDistFIM) algorithm for basket analysis on big transaction data to solve these two problems. ScaDistFIM is performed in two stages. The first stage uses the FP-Growth algorithm to compute the local frequent itemsets from each random subset of the distributed transaction dataset, and all random subsets are computed in parallel. The second stage uses an approximation method to aggregate all local frequent itemsets to the final approximate set of frequent itemsets where the support values of the frequent itemsets are estimated. We further elaborate on implementing the ScaDistFIM algorithm and a flexible basket analysis system using Spark SQL queries to demonstrate the system's flexibility in real applications. The experiment results on synthetic and real-world transaction datasets demonstrate that compared to the Spark FP-Growth algorithm, the ScaDistFIM algorithm can achieve time savings of at least 90% while ensuring nearly 100% accuracy. Hence, the ScaDistFIM algorithm exhibits superior scalability. On dataset GenD with 1 billion records, the ScaDistFIM algorithm requires only 360 s to achieve 100% precision and recall. In contrast, due to memory limitations, Spark FP-Growth cannot complete the computation task.
引用
收藏
页数:22
相关论文
共 50 条
  • [1] Scalable Manifold Learning for Big Data with Apache Spark
    Schoeneman, Frank
    Zola, Jaroslaw
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 272 - 281
  • [2] A Scalable and Flexible Repository for Big Sensor Data
    Lee, Dongeun
    Choi, Jaesik
    Shin, Heonshik
    IEEE SENSORS JOURNAL, 2015, 15 (12) : 7284 - 7294
  • [3] Analysis of the Transaction of Data for a Scalable Video Decoder
    Baez Quevedo, Abelardo
    Callico, Gustavo M.
    Lopez, Sebastian
    Lopez, Jose
    Sarmiento, Roberto
    2015 IEEE INTERNATIONAL SYMPOSIUM ON CONSUMER ELECTRONICS (ISCE), 2015,
  • [4] Implementation for Comparison Analysis System of Used Transaction Using Big Data
    Park, Byungjoon
    Kim, Hasung
    Ahn, Byeongtae
    SUSTAINABILITY, 2020, 12 (19) : 1 - 17
  • [5] Scalable and flexible management of medical image big data
    Dejun Teng
    Jun Kong
    Fusheng Wang
    Distributed and Parallel Databases, 2019, 37 : 235 - 250
  • [6] Scalable and flexible management of medical image big data
    Teng, Dejun
    Kong, Jun
    Wang, Fusheng
    DISTRIBUTED AND PARALLEL DATABASES, 2019, 37 (02) : 235 - 250
  • [7] Building Data Warehouses in the Era of Big Data An Approach for Scalable and Flexible Big Data Warehouses
    Costa, Carlos
    Santos, Maribel Yasmina
    ADVANCED INFORMATION SYSTEMS ENGINEERING (CAISE 2019), 2019, 11483 : 693 - 695
  • [8] Scalable system scheduling for HPC and big data
    Reuther, Albert
    Byun, Chansup
    Arcand, William
    Bestor, David
    Bergeron, Bill
    Hubbell, Matthew
    Jones, Michael
    Michaleas, Peter
    Prout, Andrew
    Rosa, Antonio
    Kepner, Jeremy
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2018, 111 : 76 - 92
  • [9] Application of Improved Recommendation System Based on Spark Platform in Big Data Analysis
    Xie, Li
    Zhou, Wenbo
    Li, Yaosen
    CYBERNETICS AND INFORMATION TECHNOLOGIES, 2016, 16 (06) : 245 - 255
  • [10] SOVAS: a scalable online visual analytic system for big climate data analysis
    Li, Zhenlong
    Huang, Qunying
    Jiang, Yuqin
    Hu, Fei
    INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE, 2020, 34 (06) : 1188 - 1209