Effective data management strategy and RDD weight cache replacement strategy in Spark

被引:3
|
作者
Jiang, Kun [1 ,3 ]
Du, Shaofeng [2 ]
Zhao, Fu [2 ]
Huang, Yong [4 ]
Li, Chunlin [1 ,3 ]
Luo, Youlong [3 ]
机构
[1] China Inst Water Resources & Hydropower Res, Key Lab Construction & Safety Water Engn, Minist Water Resources, Beijing, Peoples R China
[2] State Key Lab Smart Mfg Special Vehicles & Transmi, Baotou, Peoples R China
[3] Wuhan Univ Technol, Sch Comp Sci & Artificial Intelligence, Wuhan 430063, Peoples R China
[4] Chongqing Univ, Key Lab New Technol Construction Cities Mt Area, Minist Educ, Chongqing 400045, Peoples R China
关键词
Data shuffling; Data management; Cache gain; RDD partition weights; Adaptive cache replacement; DATA PLACEMENT; ALGORITHM;
D O I
10.1016/j.comcom.2022.07.008
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the dramatic increase in internet users and their demand for real-time network performance, Spark has distributed computing environment has emerged. It is widely used due to its high-performance caching mechanism and high scalability. In the face of the unpredictability of data access patterns in the current big data environment, the data shuffling phase is prone to the problems of under-utilization of Spark cluster resources, high computational latency, and high task processing latency. Based on this, this paper proposes an intermediate data management strategy based on the data shuffling phase. Firstly, the size of the data generated in the data shuffling phase of the Spark platform is predicted by random sampling. The strength division strategy obtains the skewed data degree to obtain the part with excessive skew deviation. Finally, the adaptive data management strategy is applied to perform the corresponding computation tasks by the data deviation. In addition, to improve the response time, memory usage, and computation latency of Spark applications, an adaptive cache replacement algorithm based on RDD partition weights is proposed, which takes into account the influence of four weight factors such as computation cost, usage times, partition size and life cycle of RDDs by reasonably calculating the RDD partition weight values. Compared with the current mainstream baseline algorithms, the data management algorithm based on the data mash-up phase proposed in this paper can effectively reduce resource usage and computational response latency. The RDD-based partition weighted adaptive cache replacement algorithm proposed in this paper can fully use memory resources and effectively reduce the problem of resource wastage.
引用
收藏
页码:66 / 85
页数:20
相关论文
共 50 条
  • [1] Intermediate data placement and cache replacement strategy under Spark platform
    Li, Chunlin
    Zhang, Yong
    Luo, Youlong
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2022, 163 : 114 - 135
  • [2] A Memory-Aware Spark Cache Replacement Strategy
    Zhang, Jingyu
    Zhang, Ruihan
    Alfarraj, Osama
    Tolba, Amr
    Kim, Gwang-Jun
    JOURNAL OF INTERNET TECHNOLOGY, 2022, 23 (06): : 1185 - 1190
  • [3] LPW: an efficient data-aware cache replacement strategy for Apache Spark
    Hui Li
    Shuping Ji
    Hua Zhong
    Wei Wang
    Lijie Xu
    Zhen Tang
    Jun Wei
    Tao Huang
    Science China Information Sciences, 2023, 66
  • [4] LPW: an efficient data-aware cache replacement strategy for Apache Spark
    Hui LI
    Shuping JI
    Hua ZHONG
    Wei WANG
    Lijie XU
    Zhen TANG
    Jun WEI
    Tao HUANG
    Science China(Information Sciences), 2023, 66 (01) : 77 - 96
  • [5] LPW: an efficient data-aware cache replacement strategy for Apache Spark
    Li, Hui
    Ji, Shuping
    Zhong, Hua
    Wang, Wei
    Xu, Lijie
    Tang, Zhen
    Wei, Jun
    Huang, Tao
    SCIENCE CHINA-INFORMATION SCIENCES, 2023, 66 (01)
  • [6] An Effective Replacement Strategy of Cache Memory for an SMT Processor
    Ogasawara, Yoshiyasu
    Nakajo, Hironori
    PROCEEDINGS OF THE 2009 12TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN, ARCHITECTURES, METHODS AND TOOLS, 2009, : 19 - 25
  • [7] A Simulation of Cache Replacement Strategy on Named Data Network
    Situmorang, Hamonangan
    Syambas, Nana Rachmana
    Juhana, Tutun
    Edward, Ian Yosef Matheus
    2018 12TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATION SYSTEMS, SERVICES, AND APPLICATIONS (TSSA), 2018,
  • [8] Replacement Strategy of Web Cache Based on Data Mining
    Zhang, Jing
    2015 10TH INTERNATIONAL CONFERENCE ON P2P, PARALLEL, GRID, CLOUD AND INTERNET COMPUTING (3PGCIC), 2015, : 821 - 823
  • [9] Rule-Based Data Mining Cache Replacement Strategy
    Haraty, Ramzi A.
    Zeitouny, Joe
    INTERNATIONAL JOURNAL OF DATA WAREHOUSING AND MINING, 2013, 9 (01) : 56 - 69
  • [10] Cost-Effective Hybrid Replacement Strategy for SSD in Web Cache
    Li, Qu
    Liao, Xiaofei
    Jin, Hai
    Lin, Li
    Xie, Xia
    Yao, Qiongjie
    CIT/IUCC/DASC/PICOM 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY - UBIQUITOUS COMPUTING AND COMMUNICATIONS - DEPENDABLE, AUTONOMIC AND SECURE COMPUTING - PERVASIVE INTELLIGENCE AND COMPUTING, 2015, : 1287 - 1295