A COMPARATIVE STUDY FOR OUTLIER DETECTION METHODS IN HIGH DIMENSIONAL TEXT DATA

被引:5
|
作者
Park, Cheong Hee [1 ]
机构
[1] Chungnam Natl Univ, Dept Comp Sci & Engn, 220 Gung Dong, Daejeon 305763, South Korea
关键词
Curse of dimensionality; Dimension reduction; High dimensional text data; Outlier detection; KURTOSIS;
D O I
10.2478/jaiscr-2023-0001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Outlier detection aims to find a data sample that is significantly different from other data samples. Various outlier detection methods have been proposed and have been shown to be able to detect anomalies in many practical problems. However, in high dimensional data, conventional outlier detection methods often behave unexpectedly due to a phenomenon called the curse of dimensionality. In this paper, we compare and analyze outlier detection performance in various experimental settings, focusing on text data with dimensions typically in the tens of thousands. Experimental setups were simulated to compare the performance of outlier detection methods in unsupervised versus semi-supervised mode and uni-modal versus multi-modal data distributions. The performance of outlier detection methods based on dimension reduction is compared, and a discussion on using k-NN distance in high dimensional data is also provided. Analysis through experimental comparison in various environments can provide insights into the application of outlier detection methods in high dimensional data.
引用
收藏
页码:5 / 17
页数:13
相关论文
共 50 条
  • [1] Outlier Detection in Data Streams - A Comparative Study of Selected Methods
    Duraj, Agnieszka
    Szczepaniak, Piotr S.
    KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS (KSE 2021), 2021, 192 : 2769 - 2778
  • [2] Outlier Detection in High Dimensional Data
    Kamalov, Firuz
    Leung, Ho Hon
    JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2020, 19 (01)
  • [3] Outlier detection for high dimensional data
    Aggarwal, CC
    Yu, PS
    SIGMOD RECORD, 2001, 30 (02) : 37 - 46
  • [4] A survey on unsupervised subspace outlier detection methods for high dimensional data
    Ahn, Jaehyeong
    Kwon, Sunghoon
    KOREAN JOURNAL OF APPLIED STATISTICS, 2021, 34 (03) : 507 - 521
  • [5] Outlier detection for high-dimensional data
    Ro, Kwangil
    Zou, Changliang
    Wang, Zhaojun
    Yin, Guosheng
    BIOMETRIKA, 2015, 102 (03) : 589 - 599
  • [6] Intrinsic dimensional outlier detection in high-dimensional data
    Von Brünken, Jonathan
    Houle, Michael E.
    Zimek, Arthur
    NII Technical Reports, 2015, (03): : 1 - 12
  • [7] Research on outlier detection for high dimensional data stream
    Yu, Liping
    Li, Yunfei
    Jia, Juncheng
    PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND ENGINEERING APPLICATIONS, 2016, 63 : 395 - 398
  • [8] Efficient Outlier Detection for High-Dimensional Data
    Liu, Huawen
    Li, Xuelong
    Li, Jiuyong
    Zhang, Shichao
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2018, 48 (12): : 2451 - 2461
  • [9] Outlier detection in relevant subspace of high dimensional data
    Chen, Zijun
    Zhang, Liang
    Sun, Dejie
    Liu, Wenyuan
    ICIC Express Letters, 2011, 5 (06): : 2023 - 2028
  • [10] A survey of outlier detection in high dimensional data streams
    Souiden, Imen
    Omri, Mohamed Nazih
    Brahmi, Zaki
    COMPUTER SCIENCE REVIEW, 2022, 44