Efficient outlier detection in numerical and categorical data

被引:0
|
作者
Cabral, Eugenio F. [1 ]
Vinces, Braulio V. Sanchez
Silva, Guilherme D. F. [1 ]
Sander, Jorg [2 ]
Cordeiro, Robson L. F. [3 ]
机构
[1] Univ Sao Paulo, Dept Comp Sci, Ave Trabalhador Sao Carlense 400, BR-13566590 Sao Carlos, SP, Brazil
[2] Univ Alberta, Dept Comp Sci, Edmonton, AB T6G 2R3, Canada
[3] Carnegie Mellon Univ, Sch Comp Sci, Pittsburgh, PA 15213 USA
基金
美国安德鲁·梅隆基金会; 巴西圣保罗研究基金会;
关键词
Outlier detection; Scalability; Numerical and categorical data;
D O I
10.1007/s10618-024-01084-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
How to spot outliers in a large, unlabeled dataset with both numerical and categorical attributes? How to do it in a fast and scalable way? Outlier detection has many applications; it is covered therefore by an extensive literature. The distance-based detectors are the most popular ones. However, they still have two major drawbacks: (a) the intensive neighborhood search that takes hours or even days to complete in large data, and; (b) the inability to process categorical attributes. This paper tackles both problems by presenting HySortOD: a new, fast and scalable detector for numerical and categorical data. Our main focus is the analysis of datasets with many instances, and a low-to-moderate number of attributes. We studied dozens of real, benchmark datasets with up to one million instances; HySortOD outperformed nine competitors from the state of the art in runtime, being up to six orders of magnitude faster in large data, while maintaining high accuracy. Finally, we also performed an extensive experimental evaluation that confirms the ability of our method to obtain high-quality results from both real and synthetic datasets with categorical attributes.
引用
收藏
页数:46
相关论文
共 50 条
  • [31] Fast Memory Efficient Local Outlier Detection in Data Streams
    Salehi, Mahsa
    Leckie, Christopher
    Bezdek, James C.
    Vaithianathan, Tharshan
    Zhang, Xuyun
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (12) : 3246 - 3260
  • [32] MEOD: Memory-Efficient Outlier Detection on Streaming Data
    Karale, Ankita
    Lazarova, Milena
    Koleva, Pavlina
    Poulkov, Vladimir
    SYMMETRY-BASEL, 2021, 13 (03):
  • [33] A Fast and Efficient Algorithm for Outlier Detection Over Data Streams
    Hassaan, Mosab
    Maher, Hend
    Gouda, Karam
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (11) : 749 - 756
  • [34] GA-iForest: An Efficient Isolated Forest Framework Based on Genetic Algorithm for Numerical Data Outlier Detection
    LI Kexin
    LI Jing
    LIU Shuji
    LI Zhao
    BO Jue
    LIU Biqi
    Transactions of Nanjing University of Aeronautics and Astronautics, 2019, 36 (06) : 1026 - 1038
  • [35] Fast Parallel Outlier Detection for Categorical Datasets using MapReduce
    Koufakou, Anna
    Secretan, Jimmy
    Reeder, John
    Cardona, Kelvin
    Georgiopoulos, Michael
    2008 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-8, 2008, : 3298 - 3304
  • [36] DILOF: Effective and Memory Efficient Local Outlier Detection in Data Streams
    Na, Gyoung S.
    Kim, Donghyun
    Yu, Hwanjo
    KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2018, : 1993 - 2002
  • [37] Efficient density and cluster based incremental outlier detection in data streams
    Degirmenci, Ali
    Karal, Omer
    INFORMATION SCIENCES, 2022, 607 : 901 - 920
  • [38] SynODC: Utilizing the Syntactic Structure for Outlier Detection in Categorical Attributes
    Zylinski, Arthur
    Qahtan, Abdulhakim A.
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, PT IV, ECML PKDD 2024, 2024, 14944 : 213 - 229
  • [39] An Energy-Efficient Outlier Detection Based on Data Clustering in WSNs
    Kim, Hongyeon
    Min, Jun-Ki
    INTERNATIONAL JOURNAL OF DISTRIBUTED SENSOR NETWORKS, 2014,
  • [40] An efficient approach for outlier detection in big sensor data of health care
    Saneja, Bharti
    Rani, Rinkle
    INTERNATIONAL JOURNAL OF COMMUNICATION SYSTEMS, 2017, 30 (17)