Stratified feature sampling method for ensemble clustering of high dimensional data

被引:55
|
作者
Jing, Liping [1 ]
Tian, Kuang [1 ]
Huang, Joshua Z. [2 ]
机构
[1] Beijing Jiaotong Univ, Beijing Key Lab Traff Data Anal & Min, Beijing, Peoples R China
[2] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
Stratified sampling; Ensemble clustering; High dimensional data; Consensus function; CLASS DISCOVERY; CLASSIFICATION; PREDICTION; CONSENSUS; SELECTION; CANCER;
D O I
10.1016/j.patcog.2015.05.006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
High dimensional data with thousands of features present a big challenge to current clustering algorithms. Sparsity, noise and correlation of features are common characteristics of such data. Another common phenomenon is that clusters in such high dimensional data often exist in different subspaces. Ensemble clustering is emerging as a prominent technique for improving robustness, stability and accuracy of high dimensional data clustering. In this paper, we propose a stratified sampling method for generating subspace component data sets in ensemble clustering of high dimensional data. Instead of randomly sampling a subset of features for each component data set, in this method we first cluster the features of high dimensional data into a few feature groups called feature strata. Using stratified sampling, we randomly sample some features from each feature stratum and merge the sampled features from different feature strata to generate a component data set. In this way, the component data sets have better representations of the clustering structure in the original data set. Comparing with random sampling and random projection methods in synthetic data analysis, the component clustering by stratified sampling has demonstrated that the average clustering accuracy was increased without sacrificing clustering diversity. We carried out a series of experiments on eight real world data sets from microarray, text and image domains to evaluate ensemble clustering methods using three subspace component data generation methods and four consensus functions. The experimental results consistently showed that the stratified sampling method produced the best ensemble clustering results in all data sets. The ensemble clustering with stratified sampling also outperformed three other ensemble clustering methods which generate component clusters from the entire space of the original data. (C) 2015 Elsevier Ltd. All rights reserved.
引用
收藏
页码:3688 / 3702
页数:15
相关论文
共 50 条
  • [31] Ensemble clustering and feature weighting in time series data
    Ainaz Bahramlou
    Massoud Reza Hashemi
    Zeinab Zali
    The Journal of Supercomputing, 2023, 79 : 16442 - 16478
  • [32] Ensemble clustering and feature weighting in time series data
    Bahramlou, Ainaz
    Hashemi, Massoud Reza
    Zali, Zeinab
    JOURNAL OF SUPERCOMPUTING, 2023, 79 (15): : 16442 - 16478
  • [33] FEATURE CLUSTERING FOR PSO-BASED FEATURE CONSTRUCTION ON HIGH-DIMENSIONAL DATA
    Swesi, Idheba Mohamad Ali Omer
    Abu Bakar, Azuraliza
    JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGY-MALAYSIA, 2019, 18 (04): : 439 - 472
  • [34] RETRACTED: An Ensemble Clustering Approach (Consensus Clustering) for High-Dimensional Data (Retracted Article)
    Yan, Jingdong
    Liu, Wuwei
    SECURITY AND COMMUNICATION NETWORKS, 2022, 2022
  • [35] Clustering High-Dimensional Data via Random Sampling and Consensus
    Traganitis, Panagiotis A.
    Slavakis, Konstantinos
    Giannakis, Georgios B.
    2014 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP), 2014, : 307 - 311
  • [36] CSS: Handling imbalanced data by improved clustering with stratified sampling
    Cao, Lu
    Shen, Hong
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (02):
  • [37] A Clustering Algorithm for High-Dimensional Nonlinear Feature Data with Applications
    Jiang H.
    Wang G.
    Gao J.
    Gao Z.
    Gao R.
    Guo Q.
    Hsi-An Chiao Tung Ta Hsueh/Journal of Xi'an Jiaotong University, 2017, 51 (12): : 49 - 55and90
  • [38] On online high-dimensional spherical data clustering and feature selection
    Amayri, Ola
    Bouguila, Nizar
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2013, 26 (04) : 1386 - 1398
  • [39] Latent Feature Group Learning for High-Dimensional Data Clustering
    Wang, Wenting
    He, Yulin
    Ma, Liheng
    Huang, Joshua Zhexue
    INFORMATION, 2019, 10 (06)
  • [40] An Initialization Method for Clustering High-Dimensional Data
    Chen, Luying
    Chen, Lifei
    Jiang, Qingshan
    Wang, Beizhan
    Shi, Liang
    FIRST INTERNATIONAL WORKSHOP ON DATABASE TECHNOLOGY AND APPLICATIONS, PROCEEDINGS, 2009, : 444 - +