Capabilities of outlier detection schemes in large datasets, framework and methodologies

被引:50
|
作者
Tang, Jian [1 ]
Chen, Zhixiang
Fu, Ada Waichee
Cheung, David W.
机构
[1] Mem Univ Newfoundland, Dept Comp Sci, St John, NF A1C 5S7, Canada
[2] Chinese Univ Hong Kong, Dept Comp Sci & Engn, Shatin, Hong Kong, Peoples R China
[3] Univ Hong Kong, Dept Comp Sci & Informat Syst, Hong Kong, Hong Kong, Peoples R China
关键词
outlier detection; scheme capability; distance-based outliers; density-based outliers; connectivity-based outliers; performance metrics;
D O I
10.1007/s10115-005-0233-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Outlier detection is concerned with discovering exceptional behaviors of objects. Its theoretical principle and practical implementation lay a foundation for some important applications such as credit card fraud detection, discovering criminal behaviors in e-commerce, discovering computer intrusion, etc. In this paper, we first present a unified model for several existing outlier detection schemes, and propose a compatibility theory, which establishes a framework for describing the capabilities for various outlier formulation schemes in terms of matching users' intuitions. Under this framework, we show that the density-based scheme is more powerful than the distance-based scheme when a dataset contains patterns with diverse characteristics. The density-based scheme, however, is less effective when the patterns are of comparable densities with the outliers. We then introduce a connectivity-based scheme that improves the effectiveness of the density-based scheme when a pattern itself is of similar density as an outlier. We compare density-based and connectivity-based schemes in terms of their strengths and weaknesses, and demonstrate applications with different features where each of them is more effective than the other. Finally, connectivity-based and density-based schemes are comparatively evaluated on both real-life and synthetic datasets in terms of recall, precision, rank power and implementation-free metrics.
引用
收藏
页码:45 / 84
页数:40
相关论文
共 50 条
  • [1] Capabilities of outlier detection schemes in large datasets, framework and methodologies
    Jian Tang
    Zhixiang Chen
    Ada Waichee Fu
    David W. Cheung
    Knowledge and Information Systems, 2007, 11 : 45 - 84
  • [2] Quality Control Framework for Large MR Datasets: Automated Approaches to Outlier Detection
    Bento, Mariana
    Souza, Roberto
    Salluzzi, Marina
    Frayne, Richard
    XXVI BRAZILIAN CONGRESS ON BIOMEDICAL ENGINEERING, CBEB 2018, VOL. 2, 2019, 70 (02): : 387 - 391
  • [3] Outlier Detection in Large Radiological Datasets Using UMAP
    Islam, Mohammad Tariqul
    Fleischer, Jason W.
    TOPOLOGY-AND GRAPH-INFORMED IMAGING INFORMATICS, TGI3 2024, 2025, 15239 : 111 - 121
  • [4] Cell-based outlier detection algorithm: A fast outlier detection algorithm for large datasets
    Wan, You
    Bian, Fuling
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2008, 5012 : 1042 - 1048
  • [5] Local dynamic neighborhood based outlier detection approach and its framework for large-scale datasets
    Wang, Renmin
    Zhu, Qingsheng
    Luo, Jiangmei
    Zhu, Fan
    EGYPTIAN INFORMATICS JOURNAL, 2021, 22 (02) : 125 - 132
  • [6] An Improved KNN Based Outlier Detection Algorithm for Large Datasets
    Wang, Qian
    Zheng, Min
    ADVANCED DATA MINING AND APPLICATIONS, ADMA 2010, PT I, 2010, 6440 : 585 - 592
  • [7] An innovative summarization technique on large datasets for local outlier detection
    Shou, Zhaoyu
    Li, Simin
    ICIC Express Letters, 2015, 9 (11): : 2913 - 2918
  • [8] A survey of outlier detection methodologies
    Hodge V.J.
    Austin J.
    Artificial Intelligence Review, 2004, 22 (2) : 85 - 126
  • [9] A Survey of Outlier Detection Methodologies
    Victoria J. Hodge
    Jim Austin
    Artificial Intelligence Review, 2004, 22 : 85 - 126
  • [10] A survey of outlier detection methodologies
    Hodge, VJ
    Austin, J
    ARTIFICIAL INTELLIGENCE REVIEW, 2004, 22 (02) : 85 - 126