Efficient outlier detection in numerical and categorical data

被引:0
|
作者
Cabral, Eugenio F. [1 ]
Vinces, Braulio V. Sanchez
Silva, Guilherme D. F. [1 ]
Sander, Jorg [2 ]
Cordeiro, Robson L. F. [3 ]
机构
[1] Univ Sao Paulo, Dept Comp Sci, Ave Trabalhador Sao Carlense 400, BR-13566590 Sao Carlos, SP, Brazil
[2] Univ Alberta, Dept Comp Sci, Edmonton, AB T6G 2R3, Canada
[3] Carnegie Mellon Univ, Sch Comp Sci, Pittsburgh, PA 15213 USA
基金
美国安德鲁·梅隆基金会; 巴西圣保罗研究基金会;
关键词
Outlier detection; Scalability; Numerical and categorical data;
D O I
10.1007/s10618-024-01084-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
How to spot outliers in a large, unlabeled dataset with both numerical and categorical attributes? How to do it in a fast and scalable way? Outlier detection has many applications; it is covered therefore by an extensive literature. The distance-based detectors are the most popular ones. However, they still have two major drawbacks: (a) the intensive neighborhood search that takes hours or even days to complete in large data, and; (b) the inability to process categorical attributes. This paper tackles both problems by presenting HySortOD: a new, fast and scalable detector for numerical and categorical data. Our main focus is the analysis of datasets with many instances, and a low-to-moderate number of attributes. We studied dozens of real, benchmark datasets with up to one million instances; HySortOD outperformed nine competitors from the state of the art in runtime, being up to six orders of magnitude faster in large data, while maintaining high accuracy. Finally, we also performed an extensive experimental evaluation that confirms the ability of our method to obtain high-quality results from both real and synthetic datasets with categorical attributes.
引用
收藏
页数:46
相关论文
共 50 条
  • [1] Outlier detection for multivariate categorical data
    Puig, Xavier
    Ginebra, Josep
    QUALITY AND RELIABILITY ENGINEERING INTERNATIONAL, 2018, 34 (07) : 1400 - 1412
  • [2] WMEVF: AN OUTLIER DETECTION METHODS FOR CATEGORICAL DATA
    Rokhman, Nur
    Subanar
    Winarko, Edi
    2016 INTERNATIONAL CONFERENCE ON INFORMATICS AND COMPUTING (ICIC), 2016, : 37 - 42
  • [3] An optimization model for Outlier detection in categorical data
    He, ZY
    Deng, SC
    Xu, XF
    ADVANCES IN INTELLIGENT COMPUTING, PT 1, PROCEEDINGS, 2005, 3644 : 400 - 409
  • [4] A Neural Probabilistic outlier detection method for categorical data
    Cheng, Li
    Wang, Yijie
    Ma, Xingkong
    NEUROCOMPUTING, 2019, 365 : 325 - 335
  • [5] A simple and effective outlier detection algorithm for categorical data
    Xingwang Zhao
    Jiye Liang
    Fuyuan Cao
    International Journal of Machine Learning and Cybernetics, 2014, 5 : 469 - 477
  • [6] A simple and effective outlier detection algorithm for categorical data
    Zhao, Xingwang
    Liang, Jiye
    Cao, Fuyuan
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2014, 5 (03) : 469 - 477
  • [7] Collaborative Differentially Private Outlier detection for Categorical Data
    Asif, Hafiz
    Talukdar, Tanay
    Vaidya, Jaideep
    Shafiq, Basit
    Adam, Nabil
    2016 IEEE 2ND INTERNATIONAL CONFERENCE ON COLLABORATION AND INTERNET COMPUTING (IEEE CIC), 2016, : 92 - 101
  • [8] Homophily outlier detection in non-IID categorical data
    Guansong Pang
    Longbing Cao
    Ling Chen
    Data Mining and Knowledge Discovery, 2021, 35 : 1163 - 1224
  • [9] A relative patterns discovery for enhancing outlier detection in categorical data
    Pai, Hao-Ting
    Wu, Fan
    Hsueh, Pei-Yun S.
    DECISION SUPPORT SYSTEMS, 2014, 67 : 90 - 99
  • [10] Homophily outlier detection in non-IID categorical data
    Pang, Guansong
    Cao, Longbing
    Chen, Ling
    DATA MINING AND KNOWLEDGE DISCOVERY, 2021, 35 (04) : 1163 - 1224