Scalable Random Forest with Data-Parallel Computing

被引:1
|
作者
Vazquez-Novoa, Fernando [1 ]
Conejero, Javier [1 ]
Tatu, Cristian [1 ]
Badia, Rosa M. [1 ]
机构
[1] Barcelona Supercomp Ctr BSC CNS, Barcelona, Spain
来源
关键词
Random Forest; PyCOMPSs; COMPSs; Parallelism; Distributed Computing; Dislib; Machine Learning; HPC; WORKFLOWS;
D O I
10.1007/978-3-031-39698-4_27
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In the last years, there has been a significant increment in the quantity of data available and computational resources. This leads scientific and industry communities to pursue more accurate and efficient Machine Learning (ML) models. Random Forest is a well-known algorithm in the ML field due to the good results obtained in a wide range of problems. Our objective is to create a parallel version of the algorithm that can generate a model using data distributed across different processors that computationally scales on available resources. This paper presents two novel proposals for this algorithm with a data-parallel approach. The first version is implemented using the PyCOMPSs framework and its failure management mechanism, while the second variant uses the new PyCOMPSs nesting paradigm where the parallel tasks can generate other tasks within them. Both approaches are compared between them and against MLlib Apache Spark Random Forest with strong and weak scaling tests. Our findings indicate that while the MLlib implementation is faster when executed in a small number of nodes, the scalability of both new variants is superior. We conclude that the proposed data-parallel approaches to the Random Forest algorithm can effectively generate accurate and efficient models in a distributed computing environment and offer improved scalability over existing methods.
引用
收藏
页码:397 / 410
页数:14
相关论文
共 50 条
  • [1] Data-parallel computing
    Boyd, Chas.
    2008, Association for Computing Machinery, New York, NY 10036-5701, United States (06):
  • [2] Optical interconnectivity in a scalable data-parallel system
    Dines, JAB
    Snowdon, JF
    Desmulliez, MPY
    Barsky, DB
    Shafarenko, AV
    Jesshope, CR
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1997, 41 (01) : 120 - 130
  • [3] SCALABLE DATA-PARALLEL ALGORITHMS FOR TEXTURE SYNTHESIS USING GIBBS RANDOM-FIELDS
    BADER, DA
    JALA, J
    CHELLAPPA, R
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 1995, 4 (10) : 1456 - 1460
  • [4] Resource Allocation for Data-Parallel Computing in Networks with Data Locality
    Wang, Weina
    Ying, Lei
    2016 54TH ANNUAL ALLERTON CONFERENCE ON COMMUNICATION, CONTROL, AND COMPUTING (ALLERTON), 2016, : 933 - 939
  • [5] Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations
    Yu, Yuan
    Gunda, Pradeep Kumar
    Isard, Michael
    SOSP'09: PROCEEDINGS OF THE TWENTY-SECOND ACM SIGOPS SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES, 2009, : 247 - 260
  • [6] A Scalable Hybrid Architecture for High Performance Data-Parallel Applications
    Yang, Moucheng
    Jin, Jifang
    Li, Zhehao
    Zhou, Xuegong
    Wang, Shaojun
    Wang, Lingli
    2017 INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE TECHNOLOGY (ICFPT), 2017, : 191 - 194
  • [7] SCALABLE DATA-PARALLEL IMPLEMENTATIONS OF OBJECT RECOGNITION USING GEOMETRIC HASHING
    WANG, CL
    PRASANNA, VK
    KIM, HJ
    KHOKHAR, AA
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1994, 21 (01) : 96 - 109
  • [8] Efficient and Scalable Functional Dependency Discovery on Distributed Data-Parallel Platforms
    Zhu, Guanghui
    Wang, Qian
    Tang, Qiwei
    Gu, Rong
    Yuan, Chunfeng
    Huang, Yihua
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (12) : 2663 - 2676
  • [9] Data-Parallel Actors: A Programming Model for Scalable Query Serving Systems
    Kraft, Peter
    Kazhamiaka, Fiodar
    Bailis, Peter
    Zaharia, Matei
    PROCEEDINGS OF THE 19TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION (NSDI '22), 2022, : 1059 - 1074
  • [10] A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment
    Chen, Jianguo
    Li, Kenli
    Tang, Zhuo
    Bilal, Kashif
    Yu, Shui
    Weng, Chuliang
    Li, Keqin
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (04) : 919 - 933