Stratified feature sampling method for ensemble clustering of high dimensional data

被引:55
|
作者
Jing, Liping [1 ]
Tian, Kuang [1 ]
Huang, Joshua Z. [2 ]
机构
[1] Beijing Jiaotong Univ, Beijing Key Lab Traff Data Anal & Min, Beijing, Peoples R China
[2] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
Stratified sampling; Ensemble clustering; High dimensional data; Consensus function; CLASS DISCOVERY; CLASSIFICATION; PREDICTION; CONSENSUS; SELECTION; CANCER;
D O I
10.1016/j.patcog.2015.05.006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
High dimensional data with thousands of features present a big challenge to current clustering algorithms. Sparsity, noise and correlation of features are common characteristics of such data. Another common phenomenon is that clusters in such high dimensional data often exist in different subspaces. Ensemble clustering is emerging as a prominent technique for improving robustness, stability and accuracy of high dimensional data clustering. In this paper, we propose a stratified sampling method for generating subspace component data sets in ensemble clustering of high dimensional data. Instead of randomly sampling a subset of features for each component data set, in this method we first cluster the features of high dimensional data into a few feature groups called feature strata. Using stratified sampling, we randomly sample some features from each feature stratum and merge the sampled features from different feature strata to generate a component data set. In this way, the component data sets have better representations of the clustering structure in the original data set. Comparing with random sampling and random projection methods in synthetic data analysis, the component clustering by stratified sampling has demonstrated that the average clustering accuracy was increased without sacrificing clustering diversity. We carried out a series of experiments on eight real world data sets from microarray, text and image domains to evaluate ensemble clustering methods using three subspace component data generation methods and four consensus functions. The experimental results consistently showed that the stratified sampling method produced the best ensemble clustering results in all data sets. The ensemble clustering with stratified sampling also outperformed three other ensemble clustering methods which generate component clusters from the entire space of the original data. (C) 2015 Elsevier Ltd. All rights reserved.
引用
收藏
页码:3688 / 3702
页数:15
相关论文
共 50 条
  • [41] SCEA: A Parallel Clustering Ensemble Algorithm for High-Dimensional Massive Data
    Liao, Bin
    Huang, Jing-Lai
    Wang, Xin
    Sun, Rui-Na
    Ge, Xiao-Yan
    Guo, Bing-Lei
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2021, 49 (06): : 1077 - 1087
  • [42] Feature Subset Selection for High-Dimensional, Low Sampling Size Data Classification Using Ensemble Feature Selection With a Wrapper-Based Search
    Mandal, Ashis Kumar
    Nadim, MD.
    Saha, Hasi
    Sultana, Tangina
    Hossain, Md. Delowar
    Huh, Eui-Nam
    IEEE ACCESS, 2024, 12 : 62341 - 62357
  • [43] Using Feature Clustering for GP-Based Feature Construction on High-Dimensional Data
    Binh Tran
    Xue, Bing
    Zhang, Mengjie
    GENETIC PROGRAMMING, EUROGP 2017, 2017, 10196 : 210 - 226
  • [44] An efficient clustering method of data mining for high-dimensional data
    Chang, JW
    Kang, HM
    8TH WORLD MULTI-CONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL II, PROCEEDINGS: COMPUTING TECHNIQUES, 2004, : 273 - 278
  • [45] A Feature Extraction Based Ensemble Data Clustering for Healthcare Applications
    Karthika, D.
    Jayashri, N.
    PERVASIVE COMPUTING AND SOCIAL NETWORKING, ICPCSN 2022, 2023, 475 : 1 - 7
  • [46] High-dimensional clustering method for high performance data mining
    Chang, Jae-Woo
    Lee, Hyun-Jo
    COMPUTATIONAL SCIENCE - ICCS 2007, PT 3, PROCEEDINGS, 2007, 4489 : 621 - +
  • [47] Approximate Clustering Ensemble Method for Big Data
    Mahmud, Mohammad Sultan
    Huang, Joshua Zhexue
    Ruby, Rukhsana
    Ngueilbaye, Alladoumbaye
    Wu, Kaishun
    IEEE TRANSACTIONS ON BIG DATA, 2023, 9 (04) : 1142 - 1155
  • [48] A GA-based Feature Selection for High-dimensional Data Clustering
    Sun, Mei
    Xiong, Langhuan
    Sun, Haojun
    Jiang, Dazhi
    THIRD INTERNATIONAL CONFERENCE ON GENETIC AND EVOLUTIONARY COMPUTING, 2009, : 769 - 772
  • [49] A stratified sampling based clustering algorithm for large-scale data
    Zhao, Xingwang
    Liang, Jiye
    Dang, Chuangyin
    KNOWLEDGE-BASED SYSTEMS, 2019, 163 : 416 - 428
  • [50] The shrinking-clustering method and simulation to high dimensional data
    Zhang, Jian-Ye
    Pan, Quan
    Liang, Jian-Hai
    ISTM/2007: 7TH INTERNATIONAL SYMPOSIUM ON TEST AND MEASUREMENT, VOLS 1-7, CONFERENCE PROCEEDINGS, 2007, : 2159 - 2163