Selective Sampling Designs to Improve the Performance of Classification Methods

被引:1
|
作者
Ghorbani, Soroosh [1 ]
Desmarais, Michel C. [1 ]
机构
[1] Comp & Software Engn Dept, Montreal, PQ, Canada
关键词
Planned Missing Data Design; Selective Sampling; Classification;
D O I
10.1109/ICMLA.2013.187
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Selective Sampling design refers to the situation where a study has a fixed number of observations but can decide to allocate them differently among the variables during the data gathering phase, such that some variables will have a greater ratio of missing values than others. In particular, we can decide to allocate more, or less missing values to uncertain variables: those for which the relative frequency is closer to 50% (higher uncertainty), or further from 50% (lower certainty). The main objective of the study is to investigate how a Selective Sampling process helps improve the performance of classification methods. This study specifically asks: "Can Selective Sampling affect the performance of the classification methods?" We focus on the three different classification models of NaIve Bayes, Logistic Regression and Tree Augmented Naive Bayes (TAN) for binary datasets. Three different schemes of sampling are defined: 1-Uniform (random samples) as a baseline, 2-Most Uncertain (higher sampling rate of uncertain items) and 3-Least Uncertain (lower sampling rate of uncertain items). We investigate the impacts of these different schemes on the performance of the three models on 11 different datasets. The results from 100 fold cross-validation show that Selective Sampling in all of the datasets improves the prediction performance of the TAN model and, in more than half of the datasets (54.6%), brings a higher prediction performance to NaIve Bayes and Logistic Regression classifiers.
引用
收藏
页码:178 / 181
页数:4
相关论文
共 50 条
  • [31] Saltmarsh as habitat for fish and nektonic crustaceans: Challenges in sampling designs and methods
    Connolly, RM
    AUSTRALIAN JOURNAL OF ECOLOGY, 1999, 24 (04): : 422 - 430
  • [32] Sample designs and sampling methods for the Collaborative Psychiatric Epidemiology Studies (CPES)
    Heeringa, SG
    Wagner, J
    Torres, M
    Duan, NH
    Adams, T
    Berglund, P
    INTERNATIONAL JOURNAL OF METHODS IN PSYCHIATRIC RESEARCH, 2004, 13 (04) : 221 - 240
  • [33] Quasi-Monte Carlo methods in designs of spatial sampling points
    Su, YC
    MONTE CARLO AND QUASI-MONTE CARLO METHODS 2000, 2002, : 475 - 486
  • [34] Effect of different sampling designs and methods on the estimation of secondary production: A simulation
    Cusson, M
    Plante, JF
    Genest, C
    LIMNOLOGY AND OCEANOGRAPHY-METHODS, 2006, 4 : 38 - 48
  • [35] Assessing the Performance of Classification Methods
    Hand, David J.
    INTERNATIONAL STATISTICAL REVIEW, 2012, 80 (03) : 400 - 414
  • [36] Different cell imaging methods did not significantly improve immune cell image classification performance
    Ogawa, Taisaku
    Ochiai, Koji
    Iwata, Tomoharu
    Ikawa, Tomokatsu
    Tsuzuki, Taku
    Shiroguchi, Katsuyuki
    Takahashi, Koichi
    PLOS ONE, 2022, 17 (01):
  • [37] Sampling Methods in Genetic Programming for Classification with Unbalanced Data
    Hunt, Rachel
    Johnston, Mark
    Browne, Will
    Zhang, Mengjie
    AI 2010: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2010, 6464 : 273 - +
  • [38] Aided Selection of Sampling Methods for Imbalanced Data Classification
    Sahni, Deep
    Pappu, Satya Jayadev
    Bhatt, Nirav
    CODS-COMAD 2021: PROCEEDINGS OF THE 3RD ACM INDIA JOINT INTERNATIONAL CONFERENCE ON DATA SCIENCE & MANAGEMENT OF DATA (8TH ACM IKDD CODS & 26TH COMAD), 2021, : 198 - 202
  • [39] AEROSOL SAMPLING CLASSIFICATION AND SIZE MEASUREMENT BY ELECTRICAL METHODS
    WHITBY, KT
    AMERICAN INDUSTRIAL HYGIENE ASSOCIATION JOURNAL, 1969, 30 (02): : 124 - &
  • [40] An empirical evaluation of sampling methods for the classification of imbalanced data
    Kim, Misuk
    Hwang, Kyu-Baek
    PLOS ONE, 2022, 17 (07):