Imbalanced generative sampling of training data for improving quality of machine learning model

被引:0
|
作者
Coskun, Umut Can [1 ]
Dogan, Kemal Mert [2 ]
Gunpinar, Erkan [3 ]
机构
[1] Numedyne Informat & Engn Inc, Istanbul, Turkiye
[2] Yildiz Tech Univ, TR-34210 Istanbul, Turkiye
[3] Istanbul Tech Univ, Istanbul, Turkiye
关键词
Imbalanced sampling; Machine learning; Computer-aided design; Design exploration; Training data; Computational fluid dynamics; DESIGN; OPTIMIZATION; PERFORMANCE; UNCERTAINTY; ALGORITHM; SYSTEM;
D O I
10.1016/j.aei.2024.102631
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Design exploration in engineering applications often requires a meticulous experimental or numerical study to evaluate performance ( Y) of each design, which may require great effort, time or resources. Reducing the number of these tests for finding a good design is of paramount importance in all engineering fields. This study aims at computing a machine learning (ML) model using less number of designs as training data. Uniform sampling (US) in the design space (based on predefined design parameters) to obtain a training data is a promising approach. We further extend this sampling concept to obtain designs in the design space by also employing the ML model. The designs are selected via two non -uniform (imbalanced) sampling methods (namely, height -based sampling - HBS and gradient -based sampling - GBS) while considering their Y and gradient, dY, values. These values are divided into uniform intervals, and we aim at equalizing the number of designs in the training data at each interval as much as possible. This can force designs to have minimum or maximum Y or dY values, which, in fact, lie on small portion of the design space, in general. Therefore, capturing designs from all design space portions can be enabled. Results of the proposed methods are compared against US along with two well studied non -uniform sampling strategies, Stratified Over Sampling (SOS) and Gaussian -Process Based Sampling (GPBS). To reliably investigate quality of ML models obtained using designs sampled via US, SOS, GPBS, HBS and GBS, we utilize standard test (known) functions (such as Easom and Beale ) as substitutes for engineering problems. According to the results presented, ML models using HBS and GBS have either better prediction accuracy or wider applicability compared to all other tested sampling methods.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Sampling Approaches for Imbalanced Data Classification Problem in Machine Learning
    Tyagi, Shivani
    Mittal, Sangeeta
    PROCEEDINGS OF RECENT INNOVATIONS IN COMPUTING, ICRIC 2019, 2020, 597 : 209 - 221
  • [2] Generative learning for imbalanced data using the Gaussian mixed model
    Xie, Yuxi
    Peng, Lizhi
    Chen, Zhenxiang
    Yang, Bo
    Zhang, Hongli
    Zhang, Haibo
    APPLIED SOFT COMPUTING, 2019, 79 : 439 - 451
  • [3] DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing
    Li, Conglong
    Yao, Zhewei
    Wu, Xiaoxia
    Zhang, Minjia
    Holmes, Connor
    Li, Cheng
    He, Yuxiong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18490 - 18498
  • [4] On Machine Learning with Imbalanced Data and Research Quality Evaluation Methodologies
    Lipitakis, Anastasia-Dimitra
    Lipitakis, Evangelia A. E. C.
    2014 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE (CSCI), VOL 1, 2014, : 451 - 457
  • [5] Quality-Diversity Generative Sampling for Learning with Synthetic Data
    Chang, Allen
    Fontaine, Matthew C.
    Booth, Serena
    Mataric, Maja J.
    Nikolaidis, Stefanos
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 18, 2024, : 19805 - 19812
  • [6] The impact of imbalanced training data on machine learning for author name disambiguation
    Jinseok Kim
    Jenna Kim
    Scientometrics, 2018, 117 : 511 - 526
  • [7] Improving Deep Learning Performance Using Sampling Techniques for IoT Imbalanced Data
    El Hariri, Ayyoub
    Mouiti, Mohamed
    Habibi, Omar
    Lazaar, Mohamed
    18TH INTERNATIONAL CONFERENCE ON FUTURE NETWORKS AND COMMUNICATIONS, FNC 2023/20TH INTERNATIONAL CONFERENCE ON MOBILE SYSTEMS AND PERVASIVE COMPUTING, MOBISPC 2023/13TH INTERNATIONAL CONFERENCE ON SUSTAINABLE ENERGY INFORMATION TECHNOLOGY, SEIT 2023, 2023, 224 : 180 - 187
  • [8] The impact of imbalanced training data on machine learning for author name disambiguation
    Kim, Jinseok
    Kim, Jenna
    SCIENTOMETRICS, 2018, 117 (01) : 511 - 526
  • [9] Online Extreme Learning Machine with Hybrid Sampling Strategy for Sequential Imbalanced Data
    Mao, Wentao
    Jiang, Mengxue
    Wang, Jinwan
    Li, Yuan
    COGNITIVE COMPUTATION, 2017, 9 (06) : 780 - 800
  • [10] Online Extreme Learning Machine with Hybrid Sampling Strategy for Sequential Imbalanced Data
    Wentao Mao
    Mengxue Jiang
    Jinwan Wang
    Yuan Li
    Cognitive Computation, 2017, 9 : 780 - 800