A Data-Driven Methodology for Guiding the Selection of Preprocessing Techniques in a Machine Learning Pipeline

被引:0
|
作者
Garcia-Carraseo, Jorge [1 ]
Mate, Alejandro [1 ]
Trujillo, Juan [1 ]
机构
[1] Univ Alicante, Lucentia Res Grp, Dept Software & Comp Syst, Ctra San Vicente del Raspeig S-N, San Vicente Del Raspeig 03690, Spain
关键词
Data-driven; Preprocessing; Methodology; Data Science; IMPACT;
D O I
10.1007/978-3-031-34674-3_5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The performance of a Machine Learning (ML) model greatly depends on the previous preprocessing of the data. Unfortunately, the decision on which preprocessing techniques should be applied relies on the expertise of data scientists and/or ML practitioners. Since the correct application of some techniques depend on the characteristics of the data whereas others depend on the particular ML model to be trained, this leads to an error-prone process that requires the data scientist to be knowledgeable in all the combinations that may arise. To tackle this problem, we propose a methodology that guides the selection of the most appropriated preprocessing techniques that are highly required or strongly recommended taking into account both the ML model as well as the data characteristics, so that the developer is able to freely experiment with different models while ensuring that no needed preprocessing techniques are overlooked. According to the ML model and the data at hand, the methodology will (i) obtain the characteristics of the model (ii) check whether these characteristics are met by the data or not and (iii) show to the developer which variables require preprocessing and which techniques should be applied so that a proper decision can be made. To the best of our knowledge, this is the only work that tries to gather the most common ML models together with its most adequate preprocessing techniques and encode this information into a methodology that guides this process in a systematic way.
引用
收藏
页码:34 / 42
页数:9
相关论文
共 50 条
  • [41] Chinese diabetes datasets for data-driven machine learning
    Zhao, Qinpei
    Zhu, Jinhao
    Shen, Xuan
    Lin, Chuwen
    Zhang, Yinjia
    Liang, Yuxiang
    Cao, Baige
    Li, Jiangfeng
    Liu, Xiang
    Rao, Weixiong
    Wang, Congrong
    SCIENTIFIC DATA, 2023, 10 (01)
  • [42] Unsupervised machine learning for data-driven representations of reactions
    Sirumalla, Sai Krishna
    West, Richard
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2018, 256
  • [43] Anomaly analytics in data-driven machine learning applications
    Azimi, Shelernaz
    Pahl, Claus
    INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, 2025, 19 (01) : 155 - 180
  • [44] Machine Learning Descriptors for Data-Driven Catalysis Study
    Mou, Li-Hui
    Han, TianTian
    Smith, Pieter E. S.
    Sharman, Edward
    Jiang, Jun
    ADVANCED SCIENCE, 2023, 10 (22)
  • [45] Chinese diabetes datasets for data-driven machine learning
    Qinpei Zhao
    Jinhao Zhu
    Xuan Shen
    Chuwen Lin
    Yinjia Zhang
    Yuxiang Liang
    Baige Cao
    Jiangfeng Li
    Xiang Liu
    Weixiong Rao
    Congrong Wang
    Scientific Data, 10
  • [46] Machine Learning for Data-Driven Discovery The Rise and Relevance
    Sengupta, Partho P.
    Shrestha, Sirish
    JACC-CARDIOVASCULAR IMAGING, 2019, 12 (04) : 690 - 692
  • [47] Interpretable Data-Driven Modeling in Biomass Preprocessing
    Marino, Daniel L.
    Anderson, Matthew
    Kenney, Kevin
    Manic, Milos
    2018 11TH INTERNATIONAL CONFERENCE ON HUMAN SYSTEM INTERACTION (HSI), 2018, : 291 - 297
  • [48] Constructing Dependable Data-Driven Software With Machine Learning
    Pahl, Claus
    Azimi, Shelernaz
    IEEE SOFTWARE, 2021, 38 (06) : 88 - 97
  • [49] The rise of data-driven microscopy powered by machine learning
    Morgado, Leonor
    Gomez-de-Mariscal, Estibaliz
    Heil, Hannah S.
    Henriques, Ricardo
    JOURNAL OF MICROSCOPY, 2024, 295 (02) : 85 - 92
  • [50] Data preprocessing techniques: emergence and selection towards machine learning models - a practical review using HPA dataset
    K Mallikharjuna Rao
    Ghanta Saikrishna
    Kundrapu Supriya
    Multimedia Tools and Applications, 2023, 82 : 37177 - 37196