Efficient outlier detection in numerical and categorical data

被引:0
|
作者
Cabral, Eugenio F. [1 ]
Vinces, Braulio V. Sanchez
Silva, Guilherme D. F. [1 ]
Sander, Jorg [2 ]
Cordeiro, Robson L. F. [3 ]
机构
[1] Univ Sao Paulo, Dept Comp Sci, Ave Trabalhador Sao Carlense 400, BR-13566590 Sao Carlos, SP, Brazil
[2] Univ Alberta, Dept Comp Sci, Edmonton, AB T6G 2R3, Canada
[3] Carnegie Mellon Univ, Sch Comp Sci, Pittsburgh, PA 15213 USA
基金
美国安德鲁·梅隆基金会; 巴西圣保罗研究基金会;
关键词
Outlier detection; Scalability; Numerical and categorical data;
D O I
10.1007/s10618-024-01084-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
How to spot outliers in a large, unlabeled dataset with both numerical and categorical attributes? How to do it in a fast and scalable way? Outlier detection has many applications; it is covered therefore by an extensive literature. The distance-based detectors are the most popular ones. However, they still have two major drawbacks: (a) the intensive neighborhood search that takes hours or even days to complete in large data, and; (b) the inability to process categorical attributes. This paper tackles both problems by presenting HySortOD: a new, fast and scalable detector for numerical and categorical data. Our main focus is the analysis of datasets with many instances, and a low-to-moderate number of attributes. We studied dozens of real, benchmark datasets with up to one million instances; HySortOD outperformed nine competitors from the state of the art in runtime, being up to six orders of magnitude faster in large data, while maintaining high accuracy. Finally, we also performed an extensive experimental evaluation that confirms the ability of our method to obtain high-quality results from both real and synthetic datasets with categorical attributes.
引用
收藏
页数:46
相关论文
共 50 条
  • [21] Information-Theoretic Outlier Detection for Large-Scale Categorical Data
    Wu, Shu
    Wang, Shengrui
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (03) : 589 - 602
  • [22] Outlier Analysis of Categorical Data using FuzzyAVF
    Reddy, Lakshmi Sreenivasa D.
    Babu, B. Raveendra
    PROCEEDINGS OF 2013 INTERNATIONAL CONFERENCE ON CIRCUITS, POWER AND COMPUTING TECHNOLOGIES (ICCPCT 2013), 2013, : 1259 - 1263
  • [23] Outlier analysis of categorical data using FuzzyAVF
    1600, IEEE Computer Society
  • [24] Efficient Outlier Detection for High-Dimensional Data
    Liu, Huawen
    Li, Xuelong
    Li, Jiuyong
    Zhang, Shichao
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2018, 48 (12): : 2451 - 2461
  • [25] An Efficient Approach for Outlier Detection with Imperfect Data Labels
    Liu, Bo
    Xiao, Yanshan
    Yu, Philip S.
    Hao, Zhifeng
    Cao, Longbing
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (07) : 1602 - 1616
  • [26] A Fast and Efficient Local Outlier Detection in Data Streams
    Yang, Xing
    Zhou, Wenli
    Shu, Nanfei
    Zhang, Hao
    PROCEEDINGS OF 2019 INTERNATIONAL CONFERENCE ON IMAGE, VIDEO AND SIGNAL PROCESSING (IVSP 2019), 2019, : 111 - 116
  • [27] A scalable and efficient outleir detection strategy for categorical data
    Koufakou, A.
    Ortiz, E. G.
    Georgiopoulos, M.
    Anagnostopoulos, G. C.
    Reynolds, K. M.
    19TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, VOL II, PROCEEDINGS, 2007, : 210 - +
  • [28] Weighted Outlier Detection of High-Dimensional Categorical Data Using Feature Grouping
    Li, Junli
    Zhang, Jifu
    Pang, Ning
    Qin, Xiao
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2020, 50 (11): : 4295 - 4308
  • [29] Multi-Hierarchy Attribute Relationship Mining Based Outlier Detection for Categorical Data
    Hu, Xianyu
    Wang, Yijie
    Cheng, Li
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [30] Combine Value Clustering and Weighted Value Coupling Learning for Outlier Detection in Categorical Data
    Xu, Hongzuo
    Wang, Yongjun
    Wu, Zhiyue
    Ma, Xingkong
    Qin, Zhiquan
    DATABASE AND EXPERT SYSTEMS APPLICATIONS (DEXA 2018), PT II, 2018, 11030 : 439 - 449