Efficient outlier detection in numerical and categorical data

被引:0
|
作者
Cabral, Eugenio F. [1 ]
Vinces, Braulio V. Sanchez
Silva, Guilherme D. F. [1 ]
Sander, Jorg [2 ]
Cordeiro, Robson L. F. [3 ]
机构
[1] Univ Sao Paulo, Dept Comp Sci, Ave Trabalhador Sao Carlense 400, BR-13566590 Sao Carlos, SP, Brazil
[2] Univ Alberta, Dept Comp Sci, Edmonton, AB T6G 2R3, Canada
[3] Carnegie Mellon Univ, Sch Comp Sci, Pittsburgh, PA 15213 USA
基金
美国安德鲁·梅隆基金会; 巴西圣保罗研究基金会;
关键词
Outlier detection; Scalability; Numerical and categorical data;
D O I
10.1007/s10618-024-01084-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
How to spot outliers in a large, unlabeled dataset with both numerical and categorical attributes? How to do it in a fast and scalable way? Outlier detection has many applications; it is covered therefore by an extensive literature. The distance-based detectors are the most popular ones. However, they still have two major drawbacks: (a) the intensive neighborhood search that takes hours or even days to complete in large data, and; (b) the inability to process categorical attributes. This paper tackles both problems by presenting HySortOD: a new, fast and scalable detector for numerical and categorical data. Our main focus is the analysis of datasets with many instances, and a low-to-moderate number of attributes. We studied dozens of real, benchmark datasets with up to one million instances; HySortOD outperformed nine competitors from the state of the art in runtime, being up to six orders of magnitude faster in large data, while maintaining high accuracy. Finally, we also performed an extensive experimental evaluation that confirms the ability of our method to obtain high-quality results from both real and synthetic datasets with categorical attributes.
引用
收藏
页数:46
相关论文
共 50 条
  • [41] HOT: Hypergraph-based outlier test for categorical data
    Wei, L
    Qian, WN
    Zhou, AY
    Jin, W
    Yu, JX
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, 2003, 2637 : 399 - 410
  • [42] Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data
    Koufakou, Anna
    Secretan, Jimmy
    Georgiopoulos, Michael
    KNOWLEDGE AND INFORMATION SYSTEMS, 2011, 29 (03) : 697 - 725
  • [43] Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data
    Anna Koufakou
    Jimmy Secretan
    Michael Georgiopoulos
    Knowledge and Information Systems, 2011, 29 : 697 - 725
  • [44] Outlier detection in interval data
    A. Pedro Duarte Silva
    Peter Filzmoser
    Paula Brito
    Advances in Data Analysis and Classification, 2018, 12 : 785 - 822
  • [45] Outlier detection in astronomical data
    Zhang, YX
    Luo, A
    Zhao, YH
    OPTIMIZING SCIENTIFIC RETURN FOR ASTRONOMY THROUGH INFORMATION TECHNOLOGIES, 2004, 5493 : 521 - 529
  • [46] Outlier detection for skewed data
    Hubert, Mia
    Van der Veeken, Stephan
    JOURNAL OF CHEMOMETRICS, 2008, 22 (3-4) : 235 - 246
  • [47] Outlier detection in skewed data
    Meropi, Pavlidou
    Bikos, Christoforos
    George, Zioutas
    SIMULATION MODELLING PRACTICE AND THEORY, 2018, 87 : 191 - 209
  • [48] Outlier detection in interval data
    Duarte Silva, A. Pedro
    Filzmoser, Peter
    Brito, Paula
    ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2018, 12 (03) : 785 - 822
  • [49] Outlier detection in transactional data
    Dash, Manoranjan
    Lie, Ng Wil
    INTELLIGENT DATA ANALYSIS, 2010, 14 (03) : 283 - 298
  • [50] An efficient histogram method for outlier detection
    Gebski, Matthew
    Wong, Raymond K.
    ADVANCES IN DATABASES: CONCEPTS, SYSTEMS AND APPLICATIONS, 2007, 4443 : 176 - +