Efficient outlier detection in numerical and categorical data

被引：0

作者：

Cabral, Eugenio F. ^{[1
]}

Vinces, Braulio V. Sanchez

Silva, Guilherme D. F. ^{[1
]}

Sander, Jorg ^{[2
]}

Cordeiro, Robson L. F. ^{[3
]}

机构：

[1] Univ Sao Paulo, Dept Comp Sci, Ave Trabalhador Sao Carlense 400, BR-13566590 Sao Carlos, SP, Brazil

[2] Univ Alberta, Dept Comp Sci, Edmonton, AB T6G 2R3, Canada

[3] Carnegie Mellon Univ, Sch Comp Sci, Pittsburgh, PA 15213 USA

来源：

DATA MINING AND KNOWLEDGE DISCOVERY | 2025年 / 39卷 / 03期

基金：

美国安德鲁·梅隆基金会; 巴西圣保罗研究基金会;

关键词：

Outlier detection; Scalability; Numerical and categorical data;

D O I：

10.1007/s10618-024-01084-1

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

How to spot outliers in a large, unlabeled dataset with both numerical and categorical attributes? How to do it in a fast and scalable way? Outlier detection has many applications; it is covered therefore by an extensive literature. The distance-based detectors are the most popular ones. However, they still have two major drawbacks: (a) the intensive neighborhood search that takes hours or even days to complete in large data, and; (b) the inability to process categorical attributes. This paper tackles both problems by presenting HySortOD: a new, fast and scalable detector for numerical and categorical data. Our main focus is the analysis of datasets with many instances, and a low-to-moderate number of attributes. We studied dozens of real, benchmark datasets with up to one million instances; HySortOD outperformed nine competitors from the state of the art in runtime, being up to six orders of magnitude faster in large data, while maintaining high accuracy. Finally, we also performed an extensive experimental evaluation that confirms the ability of our method to obtain high-quality results from both real and synthetic datasets with categorical attributes.

引用

页数：46

共 50 条

[41] HOT: Hypergraph-based outlier test for categorical data
Wei, L
Qian, WN
Zhou, AY
Jin, W
Yu, JX
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, 2003, 2637 : 399 - 410
[42] Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data
Koufakou, Anna
Secretan, Jimmy
Georgiopoulos, Michael
KNOWLEDGE AND INFORMATION SYSTEMS, 2011, 29 (03) : 697 - 725
[43] Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data
Anna Koufakou
Jimmy Secretan
Michael Georgiopoulos
Knowledge and Information Systems, 2011, 29 : 697 - 725
[44] Outlier detection in interval data
A. Pedro Duarte Silva
Peter Filzmoser
Paula Brito
Advances in Data Analysis and Classification, 2018, 12 : 785 - 822
[45] Outlier detection in astronomical data
Zhang, YX
Luo, A
Zhao, YH
OPTIMIZING SCIENTIFIC RETURN FOR ASTRONOMY THROUGH INFORMATION TECHNOLOGIES, 2004, 5493 : 521 - 529
[46] Outlier detection for skewed data
Hubert, Mia
Van der Veeken, Stephan
JOURNAL OF CHEMOMETRICS, 2008, 22 (3-4) : 235 - 246
[47] Outlier detection in skewed data
Meropi, Pavlidou
Bikos, Christoforos
George, Zioutas
SIMULATION MODELLING PRACTICE AND THEORY, 2018, 87 : 191 - 209
[48] Outlier detection in interval data
Duarte Silva, A. Pedro
Filzmoser, Peter
Brito, Paula
ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2018, 12 (03) : 785 - 822
[49] Outlier detection in transactional data
Dash, Manoranjan
Lie, Ng Wil
INTELLIGENT DATA ANALYSIS, 2010, 14 (03) : 283 - 298
[50] An efficient histogram method for outlier detection
Gebski, Matthew
Wong, Raymond K.
ADVANCES IN DATABASES: CONCEPTS, SYSTEMS AND APPLICATIONS, 2007, 4443 : 176 - +

← 1 2 3 4 5 →