A COMPARATIVE STUDY FOR OUTLIER DETECTION METHODS IN HIGH DIMENSIONAL TEXT DATA

被引:5
|
作者
Park, Cheong Hee [1 ]
机构
[1] Chungnam Natl Univ, Dept Comp Sci & Engn, 220 Gung Dong, Daejeon 305763, South Korea
关键词
Curse of dimensionality; Dimension reduction; High dimensional text data; Outlier detection; KURTOSIS;
D O I
10.2478/jaiscr-2023-0001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Outlier detection aims to find a data sample that is significantly different from other data samples. Various outlier detection methods have been proposed and have been shown to be able to detect anomalies in many practical problems. However, in high dimensional data, conventional outlier detection methods often behave unexpectedly due to a phenomenon called the curse of dimensionality. In this paper, we compare and analyze outlier detection performance in various experimental settings, focusing on text data with dimensions typically in the tens of thousands. Experimental setups were simulated to compare the performance of outlier detection methods in unsupervised versus semi-supervised mode and uni-modal versus multi-modal data distributions. The performance of outlier detection methods based on dimension reduction is compared, and a discussion on using k-NN distance in high dimensional data is also provided. Analysis through experimental comparison in various environments can provide insights into the application of outlier detection methods in high dimensional data.
引用
收藏
页码:5 / 17
页数:13
相关论文
共 50 条
  • [21] Thresholding-based outlier detection for high-dimensional data
    Yang, Xiaona
    Wang, Zhaojun
    Zi, Xuemin
    JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2018, 88 (11) : 2170 - 2184
  • [22] Autoencoder-based outlier detection for sparse, high dimensional data
    Chen, Wanghu
    Li, Huijun
    Li, Jing
    Arshad, Ali
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 2735 - 2742
  • [23] ROBOUT: a conditional outlier detection methodology for high-dimensional data
    Farne, Matteo
    Vouldis, Angelos
    STATISTICAL PAPERS, 2024, 65 (04) : 2489 - 2525
  • [24] Towards Enabling Outlier Detection in Large, High Dimensional Data Warehouses
    Georgoulas, Konstantinos
    Kotidis, Yannis
    SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, SSDBM 2012, 2012, 7338 : 591 - 594
  • [25] Research on Outlier Detection for High-Dimensional Data Based on PPCLOF
    Chen, Chen
    Luo, Kaiwen
    Min, Lan
    Li, Shenglin
    JOURNAL OF WEB ENGINEERING, 2021, 20 (03): : 743 - 758
  • [26] Fast outlier detection algorithm for high dimensional categorical data streams
    Zhou, Xiao-Yun
    Sun, Zhi-Hui
    Zhang, Bai-Li
    Yang, Yi-Dong
    Ruan Jian Xue Bao/Journal of Software, 2007, 18 (04): : 933 - 942
  • [27] On eigenfunction approach to data mining: outlier detection in high-dimensional data sets
    Nagar, AK
    Muyeba, MK
    8TH WORLD MULTI-CONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL II, PROCEEDINGS: COMPUTING TECHNIQUES, 2004, : 251 - 256
  • [28] Outlier detection for multivariate skew-normal data: a comparative study
    Dovoedo, Y. H.
    Chakraborti, S.
    JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2013, 83 (04) : 771 - 781
  • [29] A COMPARATIVE STUDY FOR STATISTICAL OUTLIER DETECTION USING COLON CANCER DATA
    Bhargavi, M. Vidya
    Sireesha, V
    ADVANCES AND APPLICATIONS IN STATISTICS, 2022, 72 (01) : 41 - 54
  • [30] OUTLIER DETECTION WITH ENHANCED ANGLE-BASED OUTLIER FACTOR IN HIGH-DIMENSIONAL DATA STREAM
    Shou, Zhaoyu
    Tian, Hao
    Li, Simin
    Zou, Fengbo
    INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2018, 14 (05): : 1633 - 1651