A Data-Driven Methodology for Guiding the Selection of Preprocessing Techniques in a Machine Learning Pipeline

被引:0
|
作者
Garcia-Carraseo, Jorge [1 ]
Mate, Alejandro [1 ]
Trujillo, Juan [1 ]
机构
[1] Univ Alicante, Lucentia Res Grp, Dept Software & Comp Syst, Ctra San Vicente del Raspeig S-N, San Vicente Del Raspeig 03690, Spain
关键词
Data-driven; Preprocessing; Methodology; Data Science; IMPACT;
D O I
10.1007/978-3-031-34674-3_5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The performance of a Machine Learning (ML) model greatly depends on the previous preprocessing of the data. Unfortunately, the decision on which preprocessing techniques should be applied relies on the expertise of data scientists and/or ML practitioners. Since the correct application of some techniques depend on the characteristics of the data whereas others depend on the particular ML model to be trained, this leads to an error-prone process that requires the data scientist to be knowledgeable in all the combinations that may arise. To tackle this problem, we propose a methodology that guides the selection of the most appropriated preprocessing techniques that are highly required or strongly recommended taking into account both the ML model as well as the data characteristics, so that the developer is able to freely experiment with different models while ensuring that no needed preprocessing techniques are overlooked. According to the ML model and the data at hand, the methodology will (i) obtain the characteristics of the model (ii) check whether these characteristics are met by the data or not and (iii) show to the developer which variables require preprocessing and which techniques should be applied so that a proper decision can be made. To the best of our knowledge, this is the only work that tries to gather the most common ML models together with its most adequate preprocessing techniques and encode this information into a methodology that guides this process in a systematic way.
引用
收藏
页码:34 / 42
页数:9
相关论文
共 50 条
  • [31] A comprehensive pipeline to integrate preprocessing and machine learning techniques for accurate classification in Raman spectroscopy
    Innocente, Simone
    Maryam, Siddra
    Andersson-Engels, Stefan
    Komolibus, Katarzyna
    Gautam, Rekha
    Visentin, Andrea
    DATA SCIENCE FOR PHOTONICS AND BIOPHOTONICS, 2024, 13011
  • [32] Data-driven resilient model development and feature selection for rock compressive strength prediction using machine learning and transformer techniques
    Rahaman, Md. Shakil
    Miah, Mohammad Islam
    EARTH SCIENCE INFORMATICS, 2025, 18 (03)
  • [33] Deep Transfer Learning for Industrial Automation: A Review and Discussion of New Techniques for Data-Driven Machine Learning
    Maschler, Benjamin
    Weyrich, Michael
    IEEE INDUSTRIAL ELECTRONICS MAGAZINE, 2021, 15 (02) : 65 - 75
  • [34] Predicting monthly streamflow using data-driven models coupled with data-preprocessing techniques
    Wu, C. L.
    Chau, K. W.
    Li, Y. S.
    WATER RESOURCES RESEARCH, 2009, 45
  • [35] Determining water and solute permeability of reverse osmosis membrane using a data-driven machine learning pipeline
    Chae, Sung Ho
    Hong, Seok Won
    Son, Moon
    Cho, Kyung Hwa
    JOURNAL OF WATER PROCESS ENGINEERING, 2024, 64
  • [36] Imbalanced data preprocessing techniques for machine learning: a systematic mapping study
    Vitor Werner de Vargas
    Jorge Arthur Schneider Aranda
    Ricardo dos Santos Costa
    Paulo Ricardo da Silva Pereira
    Jorge Luis Victória Barbosa
    Knowledge and Information Systems, 2023, 65 : 31 - 57
  • [37] Imbalanced data preprocessing techniques for machine learning: a systematic mapping study
    de Vargas, Vitor Werner
    Schneider Aranda, Jorge Arthur
    Costa, Ricardo dos Santos
    da Silva Pereira, Paulo Ricardo
    Victoria Barbosa, Jorge Luis
    KNOWLEDGE AND INFORMATION SYSTEMS, 2023, 65 (01) : 31 - 57
  • [38] Effect of Data Preprocessing in the Detection of Epilepsy using Machine Learning Techniques
    Sabarivani, A.
    Ramadevi, R.
    Pandian, R.
    Krishnamoorthy, N. R.
    JOURNAL OF SCIENTIFIC & INDUSTRIAL RESEARCH, 2021, 80 (12): : 1066 - 1077
  • [39] Data-Driven Learning: A Scaffolding Methodology for CLIL and LSP Teaching and Learning
    Corino, Elisa
    Onesti, Cristina
    FRONTIERS IN EDUCATION, 2019, 4
  • [40] Data-driven models in machine learning for crime prediction
    Wawrzyniak, Zbigniew M.
    Jankowski, Stanislaw
    Szczechla, Eliza
    Szymanski, Zbigniew
    Pytlak, Radoslaw
    Michalak, Pawel
    Borowik, Grzegorz
    2018 26TH INTERNATIONAL CONFERENCE ON SYSTEMS ENGINEERING (ICSENG 2018), 2018,