Identifying and handling mislabelled instances

被引:113
|
作者
Muhlenbach, F [1 ]
Lallich, S [1 ]
Zighed, DA [1 ]
机构
[1] Univ Lyon 2, ERIC Lab, F-69676 Bron, France
关键词
supervised learning; mislabelled data; geometrical neighbourhood; filtering; removing instances; relabelling instances;
D O I
10.1023/A:1025832930864
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data mining and knowledge discovery aim at producing useful and reliable models from the data. Unfortunately some databases contain noisy data which perturb the generalization of the models. An important source of noise consists of mislabelled training instances. We offer a new approach which deals with improving classification accuracies by using a preliminary filtering procedure. An example is suspect when in its neighbourhood defined by a geometrical graph the proportion of examples of the same class is not significantly greater than in the database itself. Such suspect examples in the training data can be removed or relabelled. The filtered training set is then provided as input to learning algorithms. Our experiments on ten benchmarks of UCI Machine Learning Repository using 1-NN as the final algorithm show that removal gives better results than relabelling. Removing allows maintaining the generalization error rate when we introduce from 0 to 20% of noise on the class, especially when classes are well separable. The filtering method proposed is finally compared to the relaxation relabelling schema.
引用
收藏
页码:89 / 109
页数:21
相关论文
共 50 条
  • [1] Identifying and Handling Mislabelled Instances
    Fabrice Muhlenbach
    Stéphane Lallich
    Djamel A. Zighed
    Journal of Intelligent Information Systems, 2004, 22 : 89 - 109
  • [2] Handling interlinked XML instances on the Web
    Behrends, E
    Fritzen, O
    May, W
    ADVANCES IN DATABASE TECHNOLOGY - EDBT 2006, 2006, 3896 : 792 - 810
  • [3] Identifying mislabelled samples: Machine learning models exceed human performance
    Farrell, Christopher-John
    ANNALS OF CLINICAL BIOCHEMISTRY, 2021, 58 (06) : 650 - 652
  • [4] Identifying and eliminating mislabeled training instances
    Brodley, CE
    Friedl, MA
    PROCEEDINGS OF THE THIRTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE, VOLS 1 AND 2, 1996, : 799 - 805
  • [5] Identifying Mislabeled Instances in Classification Datasets
    Mueller, Nicolas M.
    Markert, Karla
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [6] Identifying and correcting mislabeled training instances
    Sun, Jiang-Wen
    Zhao, Feng-Ying
    Wang, Chong-Jun
    Chen, Shi-Fu
    PROCEEDINGS OF FUTURE GENERATION COMMUNICATION AND NETWORKING, MAIN CONFERENCE PAPERS, VOL 1, 2007, : 243 - 249
  • [7] Identifying Interesting Instances for Probabilistic Skylines
    Qi, Yinian
    Atallah, Mikhail
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT 2, 2010, 6262 : 300 - 314
  • [8] Identifying Unknown Instances for Autonomous Driving
    Wong, Kelvin
    Wang, Shenlong
    Ren, Mengye
    Liang, Ming
    Urtasun, Raquel
    CONFERENCE ON ROBOT LEARNING, VOL 100, 2019, 100
  • [9] Handling time-varying TSP instances
    de Franca, Fabricio O.
    Gomes, Lalinka C. T.
    de Castro, Leandro N.
    Von Zuben, Fernando J.
    2006 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION, VOLS 1-6, 2006, : 2815 - +
  • [10] FDA MISLABELLED
    不详
    ECONOMIST, 1965, 217 (06): : 615 - 615