Effective data management strategy and RDD weight cache replacement strategy in Spark

被引:3
|
作者
Jiang, Kun [1 ,3 ]
Du, Shaofeng [2 ]
Zhao, Fu [2 ]
Huang, Yong [4 ]
Li, Chunlin [1 ,3 ]
Luo, Youlong [3 ]
机构
[1] China Inst Water Resources & Hydropower Res, Key Lab Construction & Safety Water Engn, Minist Water Resources, Beijing, Peoples R China
[2] State Key Lab Smart Mfg Special Vehicles & Transmi, Baotou, Peoples R China
[3] Wuhan Univ Technol, Sch Comp Sci & Artificial Intelligence, Wuhan 430063, Peoples R China
[4] Chongqing Univ, Key Lab New Technol Construction Cities Mt Area, Minist Educ, Chongqing 400045, Peoples R China
关键词
Data shuffling; Data management; Cache gain; RDD partition weights; Adaptive cache replacement; DATA PLACEMENT; ALGORITHM;
D O I
10.1016/j.comcom.2022.07.008
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the dramatic increase in internet users and their demand for real-time network performance, Spark has distributed computing environment has emerged. It is widely used due to its high-performance caching mechanism and high scalability. In the face of the unpredictability of data access patterns in the current big data environment, the data shuffling phase is prone to the problems of under-utilization of Spark cluster resources, high computational latency, and high task processing latency. Based on this, this paper proposes an intermediate data management strategy based on the data shuffling phase. Firstly, the size of the data generated in the data shuffling phase of the Spark platform is predicted by random sampling. The strength division strategy obtains the skewed data degree to obtain the part with excessive skew deviation. Finally, the adaptive data management strategy is applied to perform the corresponding computation tasks by the data deviation. In addition, to improve the response time, memory usage, and computation latency of Spark applications, an adaptive cache replacement algorithm based on RDD partition weights is proposed, which takes into account the influence of four weight factors such as computation cost, usage times, partition size and life cycle of RDDs by reasonably calculating the RDD partition weight values. Compared with the current mainstream baseline algorithms, the data management algorithm based on the data mash-up phase proposed in this paper can effectively reduce resource usage and computational response latency. The RDD-based partition weighted adaptive cache replacement algorithm proposed in this paper can fully use memory resources and effectively reduce the problem of resource wastage.
引用
收藏
页码:66 / 85
页数:20
相关论文
共 50 条
  • [21] Weight Management Using a Meal Replacement Strategy in Type 2 Diabetes
    Hamdy, Osama
    Zwiefelhofer, Debbie
    CURRENT DIABETES REPORTS, 2010, 10 (02) : 159 - 164
  • [22] A Weight-Based Dynamic Replica Replacement Strategy in Data Grids
    Zhao, Wuqing
    Xu, Xianbin
    Xiong, Naixue
    Wang, Zhuowei
    2008 IEEE ASIA-PACIFIC SERVICES COMPUTING CONFERENCE, VOLS 1-3, PROCEEDINGS, 2008, : 1544 - +
  • [24] Distributed cache strategy based on LT codes under spark platform
    Shang, Jing
    Zhang, Yifei
    Wang, Jibin
    Wu, Zhihui
    Xiao, Zhiwen
    JOURNAL OF SUPERCOMPUTING, 2024, 80 (11): : 16519 - 16545
  • [25] Cache replacement strategy with encounter probability estimation in opportunistic network
    Key Laboratory of Optical Communication and Network, Chongqing University of Posts and Telecommunications, Chongqing
    400065, China
    不详
    400065, China
    Shanghai Jiaotong Daxue Xuebao, 11 (1680-1684):
  • [26] A Web Cache Replacement Strategy for Safety-Critical Systems
    Du, Jianhai
    Gao, Shiwei
    Lv, Jianghua
    Li, Qianqian
    Ma, Shilong
    TEHNICKI VJESNIK-TECHNICAL GAZETTE, 2018, 25 (03): : 820 - 830
  • [27] Web Cache Replacement Strategy Based-on Reference Degree
    Wu, Xiaozhou
    Xu, Hongzhe
    Li, Wen
    Zhu, Xiaoguang
    2015 IEEE INTERNATIONAL CONFERENCE ON SMART CITY/SOCIALCOM/SUSTAINCOM (SMARTCITY), 2015, : 209 - 212
  • [28] Towards a novel cache replacement strategy for Named Data Networking based on Software Defined Networking
    Kalghoum, Anwar
    Gammar, Sonia Mettali
    Saidane, Leila Azouz
    COMPUTERS & ELECTRICAL ENGINEERING, 2018, 66 : 98 - 113
  • [29] Cache Replacement Strategy With Limited Service Capacity in Heterogeneous Networks
    Jiang, Le
    Zhang, Xinglin
    IEEE ACCESS, 2020, 8 : 25509 - 25520
  • [30] LCS: An Efficient Data Eviction Strategy for Spark
    Geng, Yuanzhen
    Shi, Xuanhua
    Pei, Cheng
    Jin, Hai
    Jiang, Wenbin
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2017, 45 (06) : 1285 - 1297