Model-based clustering and outlier detection with missing data

被引:0
|
作者
Hung Tong
Cristina Tortora
机构
[1] San José State University,
关键词
Model-based clustering; Data missing at random; Contaminated normal distribution; Outliers; 62H30;
D O I
暂无
中图分类号
学科分类号
摘要
The use of the multivariate contaminated normal (MCN) distribution in model-based clustering is recommended to cluster data characterized by mild outliers, the model can at the same time detect outliers automatically and produce robust parameter estimates in each cluster. However, one of the limitations of this approach is that it requires complete data, i.e. the MCN cannot be used directly on data with missing values. In this paper, we develop a framework for fitting a mixture of MCN distributions to incomplete data sets, i.e. data sets with some values missing at random. Parameter estimation is obtained using the expectation-conditional maximization algorithm—a variant of the expectation-maximization algorithm in which the traditional maximization steps are instead replaced by simpler conditional maximization steps. We perform a simulation study to compare the results of our model to a mixture of multivariate normal and Student’s t distributions for incomplete data. The simulation also includes a study on the effect of the percentage of missing data on the performance of the three algorithms. The model is then applied to the Automobile data set (UCI machine learning repository). The results show that, while the Student’s t distribution gives similar classification performance, the MCN works better in detecting outliers with a lower false positive rate of outlier detection. The performance of all the techniques decreases linearly as the percentage of missing values increases.
引用
收藏
页码:5 / 30
页数:25
相关论文
共 50 条
  • [21] Model-based Clustering and Classification for Data Science
    Unwin, Antony
    INTERNATIONAL STATISTICAL REVIEW, 2020, 88 (01) : 263 - 264
  • [22] Model-based clustering of array CGH data
    Shah, Sohrab P.
    Cheung, K-John, Jr.
    Johnson, Nathalie A.
    Alain, Guillaume
    Gascoyne, Randy D.
    Horsman, Douglas E.
    Ng, Raymond T.
    Murphy, Kevin P.
    BIOINFORMATICS, 2009, 25 (12) : I30 - I38
  • [23] Model-based multidimensional clustering of categorical data
    Chen, Tao
    Zhang, Nevin L.
    Liu, Tengfei
    Poon, Kin Man
    Wang, Yi
    ARTIFICIAL INTELLIGENCE, 2012, 176 (01) : 2246 - 2269
  • [24] Model-Based Hierarchical Clustering for Categorical Data
    Alalyan, Fahdah
    Zamzami, Nuha
    Bouguila, Nizar
    2019 IEEE 28TH INTERNATIONAL SYMPOSIUM ON INDUSTRIAL ELECTRONICS (ISIE), 2019, : 1424 - 1429
  • [25] Model-based clustering for multivariate functional data
    Jacques, Julien
    Preda, Cristian
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2014, 71 : 92 - 106
  • [26] Penalized model-based clustering of fMRI data
    Dilernia, Andrew
    Quevedo, Karina
    Camchong, Jazmin
    Lim, Kelvin
    Pan, Wei
    Zhang, Lin
    BIOSTATISTICS, 2022, 23 (03) : 825 - 843
  • [27] Outlier Detection and Missing Value in Seasonal ARIMA Model Using Rainfall Data
    Arumugam, P.
    Saranya, R.
    MATERIALS TODAY-PROCEEDINGS, 2018, 5 (01) : 1791 - 1799
  • [28] Grid-Based and Outlier Detection-Based Data Clustering and Classification
    Cho, Kyu Cheol
    Lee, Jong Sik
    UBIQUITOUS COMPUTING AND MULTIMEDIA APPLICATIONS, PT I, 2011, 150 : 129 - 138
  • [29] Grid-based & Outlier Detection-based Data Clustering & Classification
    Cho, Kyu Cheol
    Lee, Jong Sik
    INFORMATION-AN INTERNATIONAL INTERDISCIPLINARY JOURNAL, 2012, 15 (03): : 1253 - 1266
  • [30] An Energy-Efficient Outlier Detection Based on Data Clustering in WSNs
    Kim, Hongyeon
    Min, Jun-Ki
    INTERNATIONAL JOURNAL OF DISTRIBUTED SENSOR NETWORKS, 2014,