Scalable Random Forest with Data-Parallel Computing

被引：1

作者：

Vazquez-Novoa, Fernando ^{[1
]}

Conejero, Javier ^{[1
]}

Tatu, Cristian ^{[1
]}

Badia, Rosa M. ^{[1
]}

机构：

[1] Barcelona Supercomp Ctr BSC CNS, Barcelona, Spain

来源：

EURO-PAR 2023: PARALLEL PROCESSING | 2023年 / 14100卷

关键词：

Random Forest; PyCOMPSs; COMPSs; Parallelism; Distributed Computing; Dislib; Machine Learning; HPC; WORKFLOWS;

D O I：

10.1007/978-3-031-39698-4_27

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

In the last years, there has been a significant increment in the quantity of data available and computational resources. This leads scientific and industry communities to pursue more accurate and efficient Machine Learning (ML) models. Random Forest is a well-known algorithm in the ML field due to the good results obtained in a wide range of problems. Our objective is to create a parallel version of the algorithm that can generate a model using data distributed across different processors that computationally scales on available resources. This paper presents two novel proposals for this algorithm with a data-parallel approach. The first version is implemented using the PyCOMPSs framework and its failure management mechanism, while the second variant uses the new PyCOMPSs nesting paradigm where the parallel tasks can generate other tasks within them. Both approaches are compared between them and against MLlib Apache Spark Random Forest with strong and weak scaling tests. Our findings indicate that while the MLlib implementation is faster when executed in a small number of nodes, the scalability of both new variants is superior. We conclude that the proposed data-parallel approaches to the Random Forest algorithm can effectively generate accurate and efficient models in a distributed computing environment and offer improved scalability over existing methods.

引用

页码：397 / 410

页数：14

共 50 条

[31] Associative random access machines and data-parallel multiway binary-search join
Cho, OH
Colomb, RM
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 1998, 13 (06): : 451 - 467
[32] Data-Parallel Sparse LU Factorization
SIAM J Sci Comput, 2 (584):
[33] AN APPROACH TO CORRECTNESS OF DATA-PARALLEL ALGORITHMS
GABARRO, J
GAVALDA, R
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1994, 22 (02) : 185 - 201
[34] Data-parallel sparse LU factorization
Conroy, JM
Kratzer, SG
Lucas, RF
Naiman, AE
SIAM JOURNAL ON SCIENTIFIC COMPUTING, 1998, 19 (02): : 584 - 604
[35] Data-Parallel Octrees for Surface Reconstruction
Zhou, Kun
Gong, Minmin
Huang, Xin
Guo, Baining
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2011, 17 (05) : 669 - 681
[36] Developing a data-parallel application with DaParT
Sener, C
Paker, Y
Kiper, A
PARALLEL PROCESSING APPLIED MATHEMATICS, 2002, 2328 : 280 - 287
[37] On privatization of variables for data-parallel execution
Gupta, M
11TH INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM, PROCEEDINGS, 1997, : 533 - 541
[38] Data-parallel method for georeferencing of MODIS level 1B data using grid computing
Hu, YC
Xue, Y
Tang, JK
Zhong, SB
Cai, GY
COMPUTATIONAL SCIENCE - ICCS 2005, PT 3, 2005, 3516 : 883 - 886
[39] CloudForest: A Scalable and Efficient Random Forest Implementation for Biological Data
Bressler, Ryan
Kreisberg, Richard B.
Bernard, Brady
Niederhuber, John E.
Vockley, Joseph G.
Shmulevich, Ilya
Knijnenburg, Theo A.
PLOS ONE, 2015, 10 (12):
[40] Pipelined execution of data-parallel algorithms
Gorev, Maksim
Ubar, Raimund
2014 PROCEEDINGS OF THE 14TH BIENNIAL BALTIC ELECTRONICS CONFERENCE (BEC 2014), 2014, : 109 - 112

← 1 2 3 4 5 →