Ensemble Random Forests as a tool for modeling rare occurrences

被引:16
|
作者
Siders, Zachary A. [1 ]
Ducharme-Barth, Nicholas D. [2 ]
Carvalha, Felipe [3 ]
Kobayashi, Donald [3 ]
Martin, Summer [3 ]
Raynor, Jennifer [4 ]
Jones, T. Todd [3 ]
Ahrens, Robert N. M. [3 ]
机构
[1] Univ Florida, UF IFAS SFRC Fisheries & Aquat Sci Program, Gainesville, FL 32611 USA
[2] Pacific Community, Ocean Fisheries Programme, Noumea 98800, New Caledonia
[3] NOAA Fisheries, Pacific Isl Fisheries Sci Ctr, Honolulu, HI 96818 USA
[4] Wesleyan Univ, Dept Econ, Middletown, CT 06457 USA
关键词
Rare event bias; Species distribution modeling; Protected species; Bycatch; Machine learning; Random Forest; SPECIES DISTRIBUTION MODELS; CLASSIFIER; SPACE;
D O I
10.3354/esr01060
中图分类号
X176 [生物多样性保护];
学科分类号
090705 ;
摘要
Relative to target species, priority conservation species occur rarely in fishery interactions, resulting in imbalanced, overdispersed data. We present Ensemble Random Forests (ERFs) as an intuitive extension of the Random Forest algorithm to handle rare event bias. Each Random Forest receives individual stratified randomly sampled training/test sets, then down-samples the majority class for each decision tree. Results are averaged across Random Forests to generate an ensemble prediction. Through simulation, we show that ERFs outperform Random Forest with and without down-sampling, as well as with the synthetic minority over-sampling technique, for highly class imbalanced to balanced datasets. Spatial covariance greatly impacts ERFs' perceived performance, as shown through simulation and case studies. In case studies from the Hawaii deep-set longline fishery, giant manta ray Mobula birostris syn. Manta birostris and scalloped hammerhead Sphyrna lewini presence had high spatial covariance and high model test performance, while false killer whale Pseudorca crassidens had low spatial covariance and low model test performance. Overall, we find ERFs have 4 advantages: (1) reduced successive partitioning effects; (2) prediction uncertainty propagation; (3) better accounting for interacting covariates through balancing; and (4) minimization of false positives, as the majority of Random Forests within the ensemble vote correctly. As ERFs can readily mitigate rare event bias without requiring large presence sample sizes or imparting considerable balancing bias, they are likely to be a valuable tool in bycatch and species distribution modeling, as well as spatial conservation planning, especially for protected species where presence can be rare.
引用
收藏
页码:183 / 197
页数:15
相关论文
共 50 条
  • [31] Modeling of time series using random forests: Theoretical developments
    Davis, Richard A.
    Nielsen, Mikkel S.
    ELECTRONIC JOURNAL OF STATISTICS, 2020, 14 (02): : 3644 - 3671
  • [32] Modeling binding specificities of transcription factor pairs with random forests
    Anni A. Antikainen
    Markus Heinonen
    Harri Lähdesmäki
    BMC Bioinformatics, 23
  • [33] Modeling binding specificities of transcription factor pairs with random forests
    Antikainen, Anni A.
    Heinonen, Markus
    Lahdesmaki, Harri
    BMC BIOINFORMATICS, 2022, 23 (01)
  • [34] Modeling portfolio risk by risk discriminatory trees and random forests
    Yang, Bill Huajian
    JOURNAL OF RISK MODEL VALIDATION, 2014, 8 (01): : 91 - 110
  • [35] Random Forests for Uplift Modeling: An Insurance Customer Retention Case
    Guelman, Leo
    Guillen, Montserrat
    Perez-Marin, Ana M.
    MODELING AND SIMULATION IN ENGINEERING, ECONOMICS, AND MANAGEMENT, MS 2012, 2012, 115 : 123 - 133
  • [36] Modeling Gene Regulation in Liver Hepatocellular Carcinoma with Random Forests
    Kazan, Hilal
    BIOMED RESEARCH INTERNATIONAL, 2016, 2016
  • [37] Trading-Off Diversity and Accuracy for Optimal Ensemble Tree Selection in Random Forests
    Elghazel, Haytham
    Aussem, Alex
    Perraud, Florence
    ENSEMBLES IN MACHINE LEARNING APPLICATIONS, 2011, 373 : 169 - 179
  • [38] A hybrid random forests and artificial neural networks bagging ensemble for landslide susceptibility modelling
    Lucchese, Luisa, V
    de Oliveira, Guilherme G.
    Pedrollo, Olavo C.
    GEOCARTO INTERNATIONAL, 2022, 37 (27) : 16492 - 16511
  • [39] Provable Boolean interaction recovery from tree ensemble obtained via random forests
    Behr, Merle
    Wang, Yu
    Li, Xiao
    Yu, Bin
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2022, 119 (22)
  • [40] Cloud Detection for FY Meteorology Satellite Based on Ensemble Thresholds and Random Forests Approach
    Fu, Hualian
    Shen, Yuan
    Liu, Jun
    He, Guangjun
    Chen, Jinsong
    Liu, Ping
    Qian, Jing
    Li, Jun
    REMOTE SENSING, 2019, 11 (01)