A Data-Driven Methodology for Guiding the Selection of Preprocessing Techniques in a Machine Learning Pipeline

被引:0
|
作者
Garcia-Carraseo, Jorge [1 ]
Mate, Alejandro [1 ]
Trujillo, Juan [1 ]
机构
[1] Univ Alicante, Lucentia Res Grp, Dept Software & Comp Syst, Ctra San Vicente del Raspeig S-N, San Vicente Del Raspeig 03690, Spain
关键词
Data-driven; Preprocessing; Methodology; Data Science; IMPACT;
D O I
10.1007/978-3-031-34674-3_5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The performance of a Machine Learning (ML) model greatly depends on the previous preprocessing of the data. Unfortunately, the decision on which preprocessing techniques should be applied relies on the expertise of data scientists and/or ML practitioners. Since the correct application of some techniques depend on the characteristics of the data whereas others depend on the particular ML model to be trained, this leads to an error-prone process that requires the data scientist to be knowledgeable in all the combinations that may arise. To tackle this problem, we propose a methodology that guides the selection of the most appropriated preprocessing techniques that are highly required or strongly recommended taking into account both the ML model as well as the data characteristics, so that the developer is able to freely experiment with different models while ensuring that no needed preprocessing techniques are overlooked. According to the ML model and the data at hand, the methodology will (i) obtain the characteristics of the model (ii) check whether these characteristics are met by the data or not and (iii) show to the developer which variables require preprocessing and which techniques should be applied so that a proper decision can be made. To the best of our knowledge, this is the only work that tries to gather the most common ML models together with its most adequate preprocessing techniques and encode this information into a methodology that guides this process in a systematic way.
引用
收藏
页码:34 / 42
页数:9
相关论文
共 50 条
  • [1] A preprocessing data-driven pipeline for estimating number of clusters
    Koren, Michal
    Peretz, Or
    Koren, Oded
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 141
  • [2] Data-driven Autism Biomarkers Selection by using Signal Processing and Machine Learning Techniques
    Antovski, Antonio
    Kostadinovska, Stefani
    Simjanoska, Monika
    Eftimov, Tome
    Ackovska, Nevena
    Bogdanova, Ana Madevska
    PROCEEDINGS OF THE 12TH INTERNATIONAL JOINT CONFERENCE ON BIOMEDICAL ENGINEERING SYSTEMS AND TECHNOLOGIES, VOL 3 (BIOINFORMATICS), 2019, : 201 - 208
  • [3] A data-driven Machine Learning approach to creativity and innovation techniques selection in solution development
    de Carvalho Botega, Luiz Fernando
    da Silva, Jonny Carlos
    KNOWLEDGE-BASED SYSTEMS, 2022, 257
  • [4] Microbiome Preprocessing Machine Learning Pipeline
    Jasner, Yoel Y.
    Belogolovski, Anna
    Ben-Itzhak, Meirav
    Koren, Omry
    Louzoun, Yoram
    FRONTIERS IN IMMUNOLOGY, 2021, 12
  • [5] Machine learning-based surrogate modeling for data-driven optimization: a comparison of subset selection for regression techniques
    Kim, Sun Hye
    Boukouvala, Fani
    OPTIMIZATION LETTERS, 2020, 14 (04) : 989 - 1010
  • [6] Machine learning-based surrogate modeling for data-driven optimization: a comparison of subset selection for regression techniques
    Sun Hye Kim
    Fani Boukouvala
    Optimization Letters, 2020, 14 : 989 - 1010
  • [7] Big Machinery Data Preprocessing Methodology for Data-Driven Models in Prognostics and Health Management
    Cofre-Martel, Sergio
    Droguett, Enrique Lopez
    Modarres, Mohammad
    SENSORS, 2021, 21 (20)
  • [8] Adapting Data-Driven Techniques to Improve Surrogate Machine Learning Model Performance
    Jones, Huw Rhys
    Popescu, Andrei C.
    Sulehman, Yusuf
    Mu, Tingting
    IEEE ACCESS, 2023, 11 : 23909 - 23925
  • [9] Data-driven Diversity Antenna Selection for MIMO Communication using Machine Learning
    Wu, ChienHsiang
    Lai, ChinFeng
    JOURNAL OF INTERNET TECHNOLOGY, 2022, 23 (01): : 1 - 9
  • [10] Data-Driven Selection of Land Product Validation Station Based on Machine Learning
    Li, Ruoxi
    Tao, Zui
    Zhou, Xiang
    Lv, Tingting
    Wang, Jin
    Xie, Futai
    Zhai, Mingjian
    REMOTE SENSING, 2022, 14 (04)