Distributed Machine Learning based Mitigating Straggler in Big Data Environment

被引:1
|
作者
Lu, Haodong [1 ]
Wang, Kun [2 ]
机构
[1] Nanjing Univ Posts & Telecommun, Coll Internet Things, Nanjing, Peoples R China
[2] Univ Calif Los Angeles, Dept Elect & Comp Engn, Los Angeles, CA USA
基金
中国国家自然科学基金;
关键词
Parameter Server; Straggler; Deep Reinforcement Learning;
D O I
10.1109/ICC42927.2021.9500531
中图分类号
TN [电子技术、通信技术];
学科分类号
0809 ;
摘要
In big data era, utilizing the parameter server paradigm has been regarded as an efficient and practical way to improve performance in processing deep learning (DL) applications. One of the main problems is that straggler greatly hinders DL training progress, but the previous methods cannot fully consider the resource utilization of the cluster when dealing with straggler. To mitigate straggler problem in parameter server, we propose a Deep Reinforcement Learning (DRL)-based framework called Distributed Actor-critic Reinforcement Learning (DARL) that can automatically adapt each worker's training load to the dynamic cluster without parameter settings. DARL employs state-of-the-art techniques to stabilize training and improve convergence, including distributed framework, multiple actors and prioritized experience replay. Meanwhile, we also apply our customized experience sampling method to fully exploit potentially good samples. Experiments using real DL workloads show that DARL outperforms the representative Bulk Synchronous Parallel (BSP) scheme by 57.8% and Stale Synchronous Parallel (SSP) by 503% in terms of per-iteration time in heterogeneous environment.
引用
收藏
页数:6
相关论文
共 50 条
  • [11] Risk Assessment Model of Information Base Based on Machine Learning in Big Data Environment
    He, Dingjun
    International Journal of Network Security, 2024, 26 (06) : 1004 - 1014
  • [12] Research on Visual Machine Learning Algorithms Based on Apache Spark in Big Data Environment
    Wang, Jialin
    BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2019, 124 : 144 - 144
  • [13] Zeno: A Straggler Diagnosis System for Distributed Computing Using Machine Learning
    Shen, Huanxing
    Li, Cong
    HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2018, 2018, 10876 : 144 - 162
  • [14] Mitigating the Communication Straggler Effect in Federated Learning via Named Data Networking
    Amadeo, Marica
    Campolo, Claudia
    Molinaro, Antonella
    Ruggeri, Giuseppe
    Singh, Gurtaj
    IEEE COMMUNICATIONS MAGAZINE, 2024, 62 (11) : 92 - 98
  • [15] Parallel and Distributed Machine Learning Algorithms for Scalable Big Data Analytics
    Bal, Henri
    Pal, Arindam
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 108 : 1159 - 1161
  • [16] A Distributed Intelligent Intrusion Detection System based on Parallel Machine Learning and Big Data Analysis
    Louati, Faten
    Ktata, Farah Barika
    Ben Amor, Ikram Amous
    PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON SENSOR NETWORKS (SENSORNETS), 2021, : 152 - 157
  • [17] Machine Learning Based Distributed Big Data Analysis Framework for Next Generation Web in IoT
    Singh, Sushil Kumar
    Cha, Jeonghun
    Kim, Tae Woo
    Park, Jong Hyuk
    COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2021, 18 (02) : 597 - 618
  • [18] Consensus Learning for Distributed Fuzzy Neural Network in Big Data Environment
    Shi, Ye
    Lin, Chin-Teng
    Chang, Yu-Cheng
    Ding, Weiping
    Shi, Yuhui
    Yao, Xin
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2021, 5 (01): : 29 - 41
  • [19] Internet Rumor Audience Response Prediction Algorithm Based on Machine Learning in Big Data Environment
    Yang, Suhong
    Wang, Shenghui
    Yiwen, Y.
    WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2022, 2022
  • [20] Recommendation of indoor luminous environment for occupants using big data analysis based on machine learning
    Seo, Jiyoung
    Choi, Anseop
    Sung, Minki
    BUILDING AND ENVIRONMENT, 2021, 198 (198)