The effectiveness of data pre-processing methods on the performance of machine learning techniques using RF, SVR, Cubist and SGB: a study on undrained shear strength prediction

被引:1
|
作者
Demir, Selcuk [1 ]
Sahin, Emrehan Kutlug [1 ]
机构
[1] Bolu Abant Izzet Baysal Univ, Dept Civil Engn, TR-14030 Bolu, Turkiye
关键词
Box-cox; Range; Undrained shear strength; Machine learning; Cubist; TREES; MODEL;
D O I
10.1007/s00477-024-02745-9
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
In the field of data engineering in machine learning (ML), a crucial component is the process of scaling, normalization, and standardization. This process involves transforming data to make it more compatible with modeling techniques. In particular, this transformation is essential to ensure the suitability of the data for subsequent analysis. Despite the application of many conventional and relatively new approaches to ML, there remains a conspicuous lack of research, particularly in the geotechnical discipline. In this study, ML-based prediction models (i.e., RF, SVR, Cubist, and SGB) were developed to estimate the undrained shear strength (UDSS) of cohesive soil from the perspective of a wide range of data-scaling and transformation methods. Therefore, this work presents a novel ML framework based on data engineering approaches and the Cubist regression method to predict the UDSS of cohesive soil. A dataset including six different features and one target variable were used for building prediction models. The performance of ML models was examined considering the impact of the data pre-processing issue. For that purpose, data scaling and transformation methods, namely Range, Z-Score, Log Transformation, Box-Cox, and Yeo-Johnson, were used to generate the models. The results were then systematically compared using different sampling ratios to understand how model performance varies as various data scaling/transformation methods and ML algorithms were combined. It was observed that data transformation or data sampling methods had considerable or limited effects on the UDSS model performance depending on the algorithm type and the sampling ratio. Compared to RF, SVR, and SGB models, Cubist models provided higher performance metrics after applying the data pre-processing steps. The Box-Cox transformed Cubist model yielded the best prediction performance among the other models with an R2 of 0.87 for the 90% training set. Also, the UDSS prediction model generally yielded the best performance metrics when it was used with the transformed-based models (i.e., Box-Cox, Log, and Yeo-Johnson) than that of scaled-based (i.e., Range and Z-Score) models. The results show that the Cubist model has a higher potential for UDSS prediction, and data pre-processing methods have impacts on the predictive capacity of the evaluated regression models.
引用
收藏
页码:3273 / 3290
页数:18
相关论文
共 12 条
  • [1] Comparative Study of Machine Learning Techniques for Pre-processing of Network Intrusion Data
    Rahat, Faiza
    Ahsan, Syed Nadeem
    2015 INTERNATIONAL CONFERENCE ON OPEN SOURCE SYSTEMS & TECHNOLOGIES (ICOSST), 2015, : 46 - 51
  • [2] An evaluation of various data pre-processing techniques with machine learning models for water level prediction
    Ervin Shan Khai Tiu
    Yuk Feng Huang
    Jing Lin Ng
    Nouar AlDahoul
    Ali Najah Ahmed
    Ahmed Elshafie
    Natural Hazards, 2022, 110 : 121 - 153
  • [3] An evaluation of various data pre-processing techniques with machine learning models for water level prediction
    Tiu, Ervin Shan Khai
    Huang, Yuk Feng
    Ng, Jing Lin
    AlDahoul, Nouar
    Ahmed, Ali Najah
    Elshafie, Ahmed
    NATURAL HAZARDS, 2022, 110 (01) : 121 - 153
  • [4] Efficient Dengue Spread Prediction Using Machine Learning Models with Various Pre-processing Techniques
    Saraswathi, K.
    Rohini, K.
    2024 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATION AND APPLIED INFORMATICS, ACCAI 2024, 2024,
  • [5] Insights into enhanced machine learning techniques for surface water quantity and quality prediction based on data pre-processing algorithms
    Panahi, Javad
    Mastouri, Reza
    Shabanlou, Saeid
    JOURNAL OF HYDROINFORMATICS, 2022, : 875 - 897
  • [6] Developing a Generic Predictive Computational Model using Semantic data Pre-Processing with Machine Learning Techniques and its application for Stock Market Prediction Purposes
    Yerashenia, Natalia
    Bolotov, Alexander
    Fee, David Chan You
    2022 IEEE 24TH CONFERENCE ON BUSINESS INFORMATICS (CBI 2022), VOL 1, 2022, : 50 - 59
  • [7] An impact analysis of pre-processing techniques in spectroscopy data to classify insect-damaged in soybean plants with machine and deep learning methods
    Osco, Lucas Prado
    Furuya, Danielle Elis Garcia
    Furuya, Michelle Tafs Garcia
    Correa, Daniel Veras
    Goncalvez, Wesley Nunes
    Junior, Jose Marcato
    Borges, Miguel
    Blassioli-Moraes, Maria Carolina
    Michereff, Mirian Fernandes Furtado
    Aquino, Michely Ferreira Santos
    Laumann, Raul Alberto
    Lisenberg, Veraldo
    Ramos, Ana Paula Marques
    Jorge, Lucio Andre de Castro
    INFRARED PHYSICS & TECHNOLOGY, 2022, 123
  • [8] STUDY ON MACHINE LEARNING METHODS FOR COMPRESSIVE STRENGTH PREDICTION USING CONCRETE MANUFACTURING CONTROL DATA
    Akabane, Shun-Nosuke
    Kuroda, Yasuhiro
    AIJ Journal of Technology and Design, 2024, 30 (76) : 1606 - 1611
  • [9] Human Multi-omics Data Pre-processing for Predictive Purposes Using Machine Learning: A Case Study in Childhood Obesity
    Torres-Martos, Alvaro
    Anguita-Ruiz, Augusto
    Bustos-Aibar, Mireia
    Camara-Sanchez, Sofia
    Alcala, Rafael
    Aguilera, Concepcion M.
    Alcala-Fdez, Jesus
    BIOINFORMATICS AND BIOMEDICAL ENGINEERING, PT II, 2022, : 359 - 374
  • [10] Classification of Astronomical Objects in the Galaxy M81 using Machine Learning Techniques II. An Application of Clustering in Data Pre-processing
    Chuntama, Tapanapong
    Suwannajak, Chutipong
    Techa-Angkoon, Prapaporn
    Panyangam, Benjamas
    Tanakul, Nahathai
    2021 18TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (JCSSE-2021), 2021,