Fast and simple dataset selection for machine learning

被引:5
|
作者
Peter, Timm J. [1 ]
Nelles, Oliver [1 ]
机构
[1] Univ Siegen, Inst Mechan & Regelungstech Mechatron, Dept Maschinenbau, Paul Bonatz Str 9-11, D-57068 Siegen, Germany
关键词
machine learning; dataset selection; design of experiments; space-filling design; domain adaptation;
D O I
10.1515/auto-2019-0010
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The task of data reduction is discussed and a novel selection approach which allows to control the optimal point distribution of the selected data subset is proposed. The proposed approach utilizes the estimation of probability density functions (pdfs). Due to its structure, the new method is capable of selecting a subset either by approximating the pdf of the original dataset or by approximating an arbitrary, desired target pdf. The new strategy evaluates the estimated pdfs solely on the selected data points, resulting in a simple and efficient algorithm with low computational and memory demand. The performance of the new approach is investigated for two different scenarios. For representative subset selection of a dataset, the new approach is compared to a recently proposed, more complex method and shows comparable results. For the demonstration of the capability of matching a target pdf, a uniform distribution is chosen as an example. Here the new method is compared to strategies for space-filling design of experiments and shows convincing results.
引用
收藏
页码:833 / 842
页数:10
相关论文
共 50 条
  • [21] DescribeML: A dataset description tool for machine learning
    Giner-Miguelez, Joan
    Gomez, Abel
    Cabot, Jordi
    SCIENCE OF COMPUTER PROGRAMMING, 2024, 231
  • [22] Measuring and Visualizing Dataset Coverage for Machine Learning
    Kuhn, D. Richard
    Raunak, M. S.
    Kacker, Raghu N.
    COMPUTER, 2025, 58 (04) : 18 - 26
  • [23] A simple and reliable instance selection for fast training support vector machine: Valid Border Recognition
    Tang, Long
    Tian, Yingjie
    Wang, Xiaowei
    Pardalos, Panos M.
    NEURAL NETWORKS, 2023, 166 : 379 - 395
  • [24] INSTANCE - the Italian seismic dataset for machine learning
    Michelini, Alberto
    Cianetti, Spina
    Gaviano, Sonja
    Giunchi, Carlo
    Jozinovic, Dario
    Lauciani, Valentino
    EARTH SYSTEM SCIENCE DATA, 2021, 13 (12) : 5509 - 5544
  • [25] Handling Imbalanced Dataset Classification in Machine Learning
    Yadav, Seema
    Bhole, Girish P.
    2020 IEEE PUNE SECTION INTERNATIONAL CONFERENCE (PUNECON), 2020, : 38 - 43
  • [26] Dataset of cannabis seeds for machine learning applications
    Chumchu, Prawit
    Patil, Kailas
    DATA IN BRIEF, 2023, 47
  • [27] Detection of colon cancer based on microarray dataset using machine learning as a feature selection and classification techniques
    Shafi, A. S. M.
    Molla, M. M. Imran
    Jui, Julakha Jahan
    Rahman, Mohammad Motiur
    SN APPLIED SCIENCES, 2020, 2 (07):
  • [28] Predicting Students Performance Using Supervised Machine Learning Based on Imbalanced Dataset and Wrapper Feature Selection
    Alija S.
    Beqiri E.
    Gaafar A.S.
    Hamoud A.K.
    Informatica (Slovenia), 2023, 47 (01): : 11 - 20
  • [29] Detection of colon cancer based on microarray dataset using machine learning as a feature selection and classification techniques
    A. S. M. Shafi
    M. M. Imran Molla
    Julakha Jahan Jui
    Mohammad Motiur Rahman
    SN Applied Sciences, 2020, 2
  • [30] Simple rules outperform machine learning for personnel selection: insights from the 3rd annual SIOP machine learning competition
    Harman J.L.
    Scheuerman J.
    Discover Artificial Intelligence, 2023, 3 (01):