Scalable Random Forest with Data-Parallel Computing

被引:1
|
作者
Vazquez-Novoa, Fernando [1 ]
Conejero, Javier [1 ]
Tatu, Cristian [1 ]
Badia, Rosa M. [1 ]
机构
[1] Barcelona Supercomp Ctr BSC CNS, Barcelona, Spain
来源
关键词
Random Forest; PyCOMPSs; COMPSs; Parallelism; Distributed Computing; Dislib; Machine Learning; HPC; WORKFLOWS;
D O I
10.1007/978-3-031-39698-4_27
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In the last years, there has been a significant increment in the quantity of data available and computational resources. This leads scientific and industry communities to pursue more accurate and efficient Machine Learning (ML) models. Random Forest is a well-known algorithm in the ML field due to the good results obtained in a wide range of problems. Our objective is to create a parallel version of the algorithm that can generate a model using data distributed across different processors that computationally scales on available resources. This paper presents two novel proposals for this algorithm with a data-parallel approach. The first version is implemented using the PyCOMPSs framework and its failure management mechanism, while the second variant uses the new PyCOMPSs nesting paradigm where the parallel tasks can generate other tasks within them. Both approaches are compared between them and against MLlib Apache Spark Random Forest with strong and weak scaling tests. Our findings indicate that while the MLlib implementation is faster when executed in a small number of nodes, the scalability of both new variants is superior. We conclude that the proposed data-parallel approaches to the Random Forest algorithm can effectively generate accurate and efficient models in a distributed computing environment and offer improved scalability over existing methods.
引用
收藏
页码:397 / 410
页数:14
相关论文
共 50 条
  • [41] A design methodology for data-parallel applications
    Nyland, LS
    Prins, JF
    Goldberg, A
    Mills, PH
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2000, 26 (04) : 293 - 314
  • [42] Hyperparameters Optimization in Scalable Random Forest For Big Data Analytics
    Oo, Myal Cho Mon
    Thein, Thandar
    2019 IEEE 4TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION SYSTEMS (ICCCS 2019), 2019, : 125 - 129
  • [43] Scalable and Efficient Architecture for Random Forest on FPGA-Based Edge Computing
    Cuong Pham-Quoc
    EURO-PAR 2023: PARALLEL PROCESSING WORKSHOPS, PT I, EURO-PAR 2023, 2024, 14351 : 42 - 54
  • [44] A Data-Parallel Toolkit for Information Retrieval
    Fetterly, Dennis
    McSherry, Frank
    SIGIR 2010: PROCEEDINGS OF THE 33RD ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH DEVELOPMENT IN INFORMATION RETRIEVAL, 2010, : 701 - 701
  • [45] DATA-PARALLEL PROGRAMMING ON MIMD COMPUTERS
    HATCHER, PJ
    QUINN, MJ
    LAPADULA, AJ
    SEEVERS, BK
    ANDERSON, RJ
    JONES, RR
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1991, 2 (03) : 377 - 383
  • [46] Universal mechanisms for data-parallel architectures
    Sankaralingam, K
    Keckler, SW
    Mark, WR
    Burger, D
    36TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, PROCEEDINGS, 2003, : 303 - 314
  • [47] Data-parallel web crawling models
    Cambazoglu, BB
    Turk, A
    Aykanat, C
    COMPUTER AND INFORMATION SCIENCES - ISCIS 2004, PROCEEDINGS, 2004, 3280 : 801 - 809
  • [48] Data-parallel programming on Helios, Parallel environment and PVM
    Sener, C
    Paker, Y
    Kiper, A
    PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS - PROCEEDINGS OF THE ISCA 9TH INTERNATIONAL CONFERENCE, VOLS I AND II, 1996, : 189 - 192
  • [49] Message Passing on Data-Parallel Architectures
    Stuart, Jeff A.
    Owens, John D.
    2009 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-5, 2009, : 918 - +
  • [50] Convergence and Scalarization for Data-Parallel Architectures
    Lee, Yunsup
    Krashinsky, Ronny
    Grover, Vinod
    Keckler, Stephen W.
    Asanovic, Krste
    PROCEEDINGS OF THE 2013 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO), 2013, : 182 - 192