Outlier detection for partially labeled categorical data based on conditional information entropy

被引:3
|
作者
Zhao, Zhengwei [1 ]
Wang, Rongrong [2 ]
Huang, Dan [3 ]
Li, Zhaowen [4 ]
机构
[1] Guangxi Minzu Univ, Sch Math & Phys, Nanning 530006, Guangxi, Peoples R China
[2] Guangxi Minzu Univ, Elect & Informat Engn, Nanning 530000, Guangxi, Peoples R China
[3] Yulin Normal Univ, Sch Comp Sci & Engn, Yulin 537000, Guangxi, Peoples R China
[4] Putian Univ, Key Lab Appl Math Fujian Prov Univ, Fujian Key Lab Financial Informat Proc, Putian 351100, Fujian, Peoples R China
基金
中国国家自然科学基金;
关键词
Partially labeled categorical data; Partially labeled categorical decision; information system; Outlier detection; Conditional information entropy; ALGORITHMS; CLUSTERS;
D O I
10.1016/j.ijar.2023.109086
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Labeling a large amount of data is exceptionally costly and practically infeasible, and thus available data may have missing labels. In this article, we investigate outlier detection for partially labeled categorical data based on conditional information entropy. Firstly, the equivalence class in a partially labeled categorical decision information system (p-CDIS) is introduced, so that the missing labels can be predicted by use of conditional probability. Then, conditional information entropy in a p-CDIS is calculated, which provides a more comprehensive measure of uncertainty. Additionally, the relative information entropy and relative cardinality in a p-CDIS are proposed. Next, the degree of outlierness and the weight function are presented to find outlier factors. Finally, an outlier detection method in a p-CDIS based on conditional information entropy is proposed, and a corresponding conditional information entropy algorithm (CEOF) is designed. To evaluate the stability of the CEOF algorithm, experiments are performed on ten UCI Machine Learning Repository datasets. Compared with five other algorithms, the proposed method is shown to have good effectiveness and adaptability for categorical data.
引用
收藏
页数:25
相关论文
共 50 条
  • [41] HOT: Hypergraph-based outlier test for categorical data
    Wei, L
    Qian, WN
    Zhou, AY
    Jin, W
    Yu, JX
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, 2003, 2637 : 399 - 410
  • [42] Outlier detection based on multisource information fusion in incomplete mixed data
    Li, Ran
    Chen, Hongchang
    Liu, Shuxin
    Wang, Kai
    Liu, Shuo
    Su, Zhe
    APPLIED SOFT COMPUTING, 2024, 165
  • [43] FAST-ODT: A Lightweight Outlier Detection Scheme for Categorical Data Sets
    Du, Hongwei
    Ye, Qiang
    Sun, Zhipeng
    Liu, Chuang
    Xu, Wen
    IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2021, 8 (01): : 13 - 24
  • [44] Feature selection considering interaction, redundancy and complementarity for outlier detection in categorical data
    Wang, Lianxi
    Ke, Yubing
    KNOWLEDGE-BASED SYSTEMS, 2023, 275
  • [45] Entropy-based outlier detection using spark
    Feng, Guilan
    Li, Zhengnan
    Zhou, Wengang
    Dong, Shi
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2020, 23 (02): : 409 - 419
  • [46] Entropy-based outlier detection using spark
    Guilan Feng
    Zhengnan Li
    Wengang Zhou
    Shi Dong
    Cluster Computing, 2020, 23 : 409 - 419
  • [47] ROBOUT: a conditional outlier detection methodology for high-dimensional data
    Farne, Matteo
    Vouldis, Angelos
    STATISTICAL PAPERS, 2024, 65 (04) : 2489 - 2525
  • [48] Outlier Detection Based on the Data Structure
    Guo, Feng
    Shi, Canghong
    Li, Xiaojie
    He, Jia
    Wu, Xi
    2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [49] DDoS Detection and Prevention Based on Joint Entropy and Conditional Entropy
    Gu Yonghao
    Wu Weiming
    ADVANCED MATERIALS AND COMPUTER SCIENCE, PTS 1-3, 2011, 474-476 : 2129 - 2133
  • [50] Information granularity-based incremental feature selection for partially labeled hybrid data
    Shu, Wenhao
    Yan, Zhenchao
    Chen, Ting
    Yu, Jianhui
    Qian, Wenbin
    INTELLIGENT DATA ANALYSIS, 2022, 26 (01) : 33 - 56