Scalable Random Forest with Data-Parallel Computing

被引:1
|
作者
Vazquez-Novoa, Fernando [1 ]
Conejero, Javier [1 ]
Tatu, Cristian [1 ]
Badia, Rosa M. [1 ]
机构
[1] Barcelona Supercomp Ctr BSC CNS, Barcelona, Spain
来源
关键词
Random Forest; PyCOMPSs; COMPSs; Parallelism; Distributed Computing; Dislib; Machine Learning; HPC; WORKFLOWS;
D O I
10.1007/978-3-031-39698-4_27
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In the last years, there has been a significant increment in the quantity of data available and computational resources. This leads scientific and industry communities to pursue more accurate and efficient Machine Learning (ML) models. Random Forest is a well-known algorithm in the ML field due to the good results obtained in a wide range of problems. Our objective is to create a parallel version of the algorithm that can generate a model using data distributed across different processors that computationally scales on available resources. This paper presents two novel proposals for this algorithm with a data-parallel approach. The first version is implemented using the PyCOMPSs framework and its failure management mechanism, while the second variant uses the new PyCOMPSs nesting paradigm where the parallel tasks can generate other tasks within them. Both approaches are compared between them and against MLlib Apache Spark Random Forest with strong and weak scaling tests. Our findings indicate that while the MLlib implementation is faster when executed in a small number of nodes, the scalability of both new variants is superior. We conclude that the proposed data-parallel approaches to the Random Forest algorithm can effectively generate accurate and efficient models in a distributed computing environment and offer improved scalability over existing methods.
引用
收藏
页码:397 / 410
页数:14
相关论文
共 50 条
  • [21] A GEOMETRICAL DATA-PARALLEL LANGUAGE
    DEKEYSER, JL
    LAZURE, D
    MARQUET, P
    SIGPLAN NOTICES, 1994, 29 (04): : 31 - 40
  • [22] Interruptible Tasks: Treating Memory Pressure As Interrupts for Highly Scalable Data-Parallel Programs
    Fang, Lu
    Khanh Nguyen
    Xu, Guoqing
    Demsky, Brian
    Lu, Shan
    SOSP'15: PROCEEDINGS OF THE TWENTY-FIFTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES, 2015, : 394 - 409
  • [23] DATA-PARALLEL PROGRAMMING ON MULTICOMPUTERS
    QUINN, MJ
    HATCHER, PJ
    IEEE SOFTWARE, 1990, 7 (05) : 69 - 76
  • [24] A DATA-PARALLEL FP COMPILER
    WALINSKY, C
    BANERJEE, D
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1994, 22 (02) : 138 - 153
  • [25] DATA-PARALLEL PROGRAM DESIGN
    LEWIS, TG
    CURREY, R
    LIU, J
    LECTURE NOTES IN COMPUTER SCIENCE, 1992, 591 : 37 - 53
  • [26] Data-parallel programming on a reconfigurable parallel computer
    Sen, RK
    Rajesh, K
    Periswamy, M
    Selvakumar, S
    IETE TECHNICAL REVIEW, 1998, 15 (03) : 181 - 189
  • [27] A LINEAR-TIME ALGORITHM FOR COMPUTING THE MEMORY ACCESS SEQUENCE IN DATA-PARALLEL PROGRAMS
    KENNEDY, K
    NEDELJKOVIC, N
    SETHI, A
    SIGPLAN NOTICES, 1995, 30 (08): : 102 - 111
  • [28] A DATA-PARALLEL SCIENTIFIC MODELING LANGUAGE
    FRANCIS, RS
    MATHIESON, ID
    WHITING, PG
    DIX, MR
    DAVIES, HL
    ROTSTAYN, LD
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1994, 21 (01) : 46 - 60
  • [29] DATA-PARALLEL GEOMETRIC OPERATIONS ON LISTS
    KUMAR, KG
    SKILLICORN, DB
    PARALLEL COMPUTING, 1995, 21 (03) : 447 - 459
  • [30] Language bindings for a data-parallel runtime
    Carpenter, B
    Fox, G
    Leskiw, D
    Li, X
    Wen, Y
    Zhang, G
    THIRD INTERNATIONAL WORKSHOP ON HIGH-LEVEL PARALLEL PROGRAMMING MODELS AND SUPPORTIVE ENVIRONMENTS, PROCEEDINGS, 1998, : 42 - 49