A Data-Driven Methodology for Guiding the Selection of Preprocessing Techniques in a Machine Learning Pipeline

被引:0
|
作者
Garcia-Carraseo, Jorge [1 ]
Mate, Alejandro [1 ]
Trujillo, Juan [1 ]
机构
[1] Univ Alicante, Lucentia Res Grp, Dept Software & Comp Syst, Ctra San Vicente del Raspeig S-N, San Vicente Del Raspeig 03690, Spain
关键词
Data-driven; Preprocessing; Methodology; Data Science; IMPACT;
D O I
10.1007/978-3-031-34674-3_5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The performance of a Machine Learning (ML) model greatly depends on the previous preprocessing of the data. Unfortunately, the decision on which preprocessing techniques should be applied relies on the expertise of data scientists and/or ML practitioners. Since the correct application of some techniques depend on the characteristics of the data whereas others depend on the particular ML model to be trained, this leads to an error-prone process that requires the data scientist to be knowledgeable in all the combinations that may arise. To tackle this problem, we propose a methodology that guides the selection of the most appropriated preprocessing techniques that are highly required or strongly recommended taking into account both the ML model as well as the data characteristics, so that the developer is able to freely experiment with different models while ensuring that no needed preprocessing techniques are overlooked. According to the ML model and the data at hand, the methodology will (i) obtain the characteristics of the model (ii) check whether these characteristics are met by the data or not and (iii) show to the developer which variables require preprocessing and which techniques should be applied so that a proper decision can be made. To the best of our knowledge, this is the only work that tries to gather the most common ML models together with its most adequate preprocessing techniques and encode this information into a methodology that guides this process in a systematic way.
引用
收藏
页码:34 / 42
页数:9
相关论文
共 50 条
  • [21] Data-driven accident consequence assessment on urban gas pipeline network based on machine learning
    Yang, Yang
    Li, Suzhen
    Zhang, Pengcheng
    RELIABILITY ENGINEERING & SYSTEM SAFETY, 2022, 219
  • [22] Data-driven decarbonization framework with machine learning
    Jain, Ayush
    Padmanaban, Manikandan
    Hazra, Jagabondhu
    Guruprasad, Ranjini
    Godbole, Shantanu
    Syam, Heriansyah
    ENVIRONMENTAL DATA SCIENCE, 2024, 3
  • [23] Significance and methodology: Preprocessing the big data for machine learning on TBM performance
    Xiao, Hao-Han
    Yang, Wen-Kun
    Hu, Jing
    Zhang, Yun-Pei
    Jing, Liu-Jie
    Chen, Zu-Yu
    UNDERGROUND SPACE, 2022, 7 (04) : 680 - 701
  • [24] Machine Learning based Video Coding using Data-driven Techniques and Advanced Models
    Kwong, Sam
    PROCEEDINGS OF THE 2019 IEEE 18TH INTERNATIONAL CONFERENCE ON COGNITIVE INFORMATICS & COGNITIVE COMPUTING (ICCI*CC 2019), 2019, : 4 - 4
  • [25] Data-Driven Stroke Classification Utilizing Electromyographic Muscle Features and Machine Learning Techniques
    Lee, Jaehyuk
    Kim, Youngjun
    Kim, Eunchan
    APPLIED SCIENCES-BASEL, 2024, 14 (18):
  • [26] Predictive capabilities of data-driven machine learning techniques on wave-bridge interactions
    Zhu, Deming
    Zhang, Jiaxin
    Wu, Qian
    Dong, You
    Bastidas-Arteaga, Emilio
    APPLIED OCEAN RESEARCH, 2023, 137
  • [27] Data-Driven Modeling of Electric Vehicle Charging Sessions Based on Machine Learning Techniques
    Kene, Raymond O.
    Olwal, Thomas O.
    WORLD ELECTRIC VEHICLE JOURNAL, 2025, 16 (02):
  • [28] Machine-Learning Techniques Assist Data-Driven Well-Performance Optimization
    Carpenter, Chris
    JPT, Journal of Petroleum Technology, 2021, 73 (10): : 63 - 64
  • [29] Data-driven prediction of soccer outcomes using enhanced machine and deep learning techniques
    Mills, Ebenezer Fiifi Emire Atta
    Deng, Zihui
    Zhong, Zhuoqing
    Li, Jinger
    JOURNAL OF BIG DATA, 2024, 11 (01)
  • [30] Data-driven diagnosis of spinal abnormalities using feature selection and machine learning algorithms
    Raihan-Al-Masud, Md
    Mondal, M. Rubaiyat Hossain
    PLOS ONE, 2020, 15 (02):