Model-based clustering and outlier detection with missing data

被引:0
|
作者
Hung Tong
Cristina Tortora
机构
[1] San José State University,
关键词
Model-based clustering; Data missing at random; Contaminated normal distribution; Outliers; 62H30;
D O I
暂无
中图分类号
学科分类号
摘要
The use of the multivariate contaminated normal (MCN) distribution in model-based clustering is recommended to cluster data characterized by mild outliers, the model can at the same time detect outliers automatically and produce robust parameter estimates in each cluster. However, one of the limitations of this approach is that it requires complete data, i.e. the MCN cannot be used directly on data with missing values. In this paper, we develop a framework for fitting a mixture of MCN distributions to incomplete data sets, i.e. data sets with some values missing at random. Parameter estimation is obtained using the expectation-conditional maximization algorithm—a variant of the expectation-maximization algorithm in which the traditional maximization steps are instead replaced by simpler conditional maximization steps. We perform a simulation study to compare the results of our model to a mixture of multivariate normal and Student’s t distributions for incomplete data. The simulation also includes a study on the effect of the percentage of missing data on the performance of the three algorithms. The model is then applied to the Automobile data set (UCI machine learning repository). The results show that, while the Student’s t distribution gives similar classification performance, the MCN works better in detecting outliers with a lower false positive rate of outlier detection. The performance of all the techniques decreases linearly as the percentage of missing values increases.
引用
收藏
页码:5 / 30
页数:25
相关论文
共 50 条
  • [11] A Mixture Model-Based Combination Approach for Outlier Detection
    Bouguessa, Mohamed
    INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2014, 23 (04)
  • [12] A partial EM algorithm for model-based clustering with highly diverse missing data patterns
    Browne, Ryan P.
    McNicholas, Paul D.
    Findlay, Christopher J.
    STAT, 2022, 11 (01):
  • [13] Model-based clustering of longitudinal data
    McNicholas, Paul D.
    Murphy, T. Brendan
    CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE, 2010, 38 (01): : 153 - 168
  • [14] Boosting for model-based data clustering
    Saffari, Amir
    Bischof, Horst
    PATTERN RECOGNITION, 2008, 5096 : 51 - 60
  • [15] Model-based clustering for longitudinal data
    De la Cruz-Mesia, Rolando
    Quintanab, Fernando A.
    Marshall, Guillermo
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2008, 52 (03) : 1441 - 1457
  • [16] Model-Based Clustering of Temporal Data
    El Assaad, Hani
    Same, Allou
    Govaert, Gerard
    Aknin, Patrice
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2013, 2013, 8131 : 9 - 16
  • [17] An Outlier Detection Algorithm for Data Streams Based on Fuzzy Clustering
    Su, Xiaoke
    Qin, Yuming
    Wan, Renxia
    PROGRESS IN INTELLIGENCE COMPUTATION AND APPLICATIONS, 2008, : 109 - 112
  • [18] Automated outlier detection and estimation of missing data
    Rhyu, Jinwook
    Bozinovski, Dragana
    Dubs, Alexis B.
    Mohan, Naresh
    Bende, Elizabeth M. Cummings
    Maloney, Andrew J.
    Nieves, Miriam
    Sangerman, Jose
    Lu, Amos E.
    Hong, Moo Sun
    Artamonova, Anastasia
    Ou, Rui Wen
    Barone, Paul W.
    Leung, James C.
    Wolfrum, Jacqueline M.
    Sinskey, Anthony J.
    Springs, Stacy L.
    Braatz, Richard D.
    COMPUTERS & CHEMICAL ENGINEERING, 2024, 180
  • [19] Model-based clustering and classification of functional data
    Chamroukhi, Faicel
    Nguyen, Hien D.
    WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2019, 9 (04)
  • [20] On model-based clustering of skewed matrix data
    Melnykov, Volodymyr
    Zhu, Xuwen
    JOURNAL OF MULTIVARIATE ANALYSIS, 2018, 167 : 181 - 194