Stratified feature sampling method for ensemble clustering of high dimensional data

被引：55

作者：

Jing, Liping ^{[1
]}

Tian, Kuang ^{[1
]}

Huang, Joshua Z. ^{[2
]}

机构：

[1] Beijing Jiaotong Univ, Beijing Key Lab Traff Data Anal & Min, Beijing, Peoples R China

[2] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen, Peoples R China

来源：

PATTERN RECOGNITION | 2015年 / 48卷 / 11期

基金：

中国国家自然科学基金;

关键词：

Stratified sampling; Ensemble clustering; High dimensional data; Consensus function; CLASS DISCOVERY; CLASSIFICATION; PREDICTION; CONSENSUS; SELECTION; CANCER;

D O I：

10.1016/j.patcog.2015.05.006

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

High dimensional data with thousands of features present a big challenge to current clustering algorithms. Sparsity, noise and correlation of features are common characteristics of such data. Another common phenomenon is that clusters in such high dimensional data often exist in different subspaces. Ensemble clustering is emerging as a prominent technique for improving robustness, stability and accuracy of high dimensional data clustering. In this paper, we propose a stratified sampling method for generating subspace component data sets in ensemble clustering of high dimensional data. Instead of randomly sampling a subset of features for each component data set, in this method we first cluster the features of high dimensional data into a few feature groups called feature strata. Using stratified sampling, we randomly sample some features from each feature stratum and merge the sampled features from different feature strata to generate a component data set. In this way, the component data sets have better representations of the clustering structure in the original data set. Comparing with random sampling and random projection methods in synthetic data analysis, the component clustering by stratified sampling has demonstrated that the average clustering accuracy was increased without sacrificing clustering diversity. We carried out a series of experiments on eight real world data sets from microarray, text and image domains to evaluate ensemble clustering methods using three subspace component data generation methods and four consensus functions. The experimental results consistently showed that the stratified sampling method produced the best ensemble clustering results in all data sets. The ensemble clustering with stratified sampling also outperformed three other ensemble clustering methods which generate component clusters from the entire space of the original data. (C) 2015 Elsevier Ltd. All rights reserved.

引用

页码：3688 / 3702

页数：15

共 50 条

[41] SCEA: A Parallel Clustering Ensemble Algorithm for High-Dimensional Massive Data
Liao, Bin
Huang, Jing-Lai
Wang, Xin
Sun, Rui-Na
Ge, Xiao-Yan
Guo, Bing-Lei
Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2021, 49 (06): : 1077 - 1087
[42] Feature Subset Selection for High-Dimensional, Low Sampling Size Data Classification Using Ensemble Feature Selection With a Wrapper-Based Search
Mandal, Ashis Kumar
Nadim, MD.
Saha, Hasi
Sultana, Tangina
Hossain, Md. Delowar
Huh, Eui-Nam
IEEE ACCESS, 2024, 12 : 62341 - 62357
[43] Using Feature Clustering for GP-Based Feature Construction on High-Dimensional Data
Binh Tran
Xue, Bing
Zhang, Mengjie
GENETIC PROGRAMMING, EUROGP 2017, 2017, 10196 : 210 - 226
[44] An efficient clustering method of data mining for high-dimensional data
Chang, JW
Kang, HM
8TH WORLD MULTI-CONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL II, PROCEEDINGS: COMPUTING TECHNIQUES, 2004, : 273 - 278
[45] A Feature Extraction Based Ensemble Data Clustering for Healthcare Applications
Karthika, D.
Jayashri, N.
PERVASIVE COMPUTING AND SOCIAL NETWORKING, ICPCSN 2022, 2023, 475 : 1 - 7
[46] High-dimensional clustering method for high performance data mining
Chang, Jae-Woo
Lee, Hyun-Jo
COMPUTATIONAL SCIENCE - ICCS 2007, PT 3, PROCEEDINGS, 2007, 4489 : 621 - +
[47] Approximate Clustering Ensemble Method for Big Data
Mahmud, Mohammad Sultan
Huang, Joshua Zhexue
Ruby, Rukhsana
Ngueilbaye, Alladoumbaye
Wu, Kaishun
IEEE TRANSACTIONS ON BIG DATA, 2023, 9 (04) : 1142 - 1155
[48] A GA-based Feature Selection for High-dimensional Data Clustering
Sun, Mei
Xiong, Langhuan
Sun, Haojun
Jiang, Dazhi
THIRD INTERNATIONAL CONFERENCE ON GENETIC AND EVOLUTIONARY COMPUTING, 2009, : 769 - 772
[49] A stratified sampling based clustering algorithm for large-scale data
Zhao, Xingwang
Liang, Jiye
Dang, Chuangyin
KNOWLEDGE-BASED SYSTEMS, 2019, 163 : 416 - 428
[50] The shrinking-clustering method and simulation to high dimensional data
Zhang, Jian-Ye
Pan, Quan
Liang, Jian-Hai
ISTM/2007: 7TH INTERNATIONAL SYMPOSIUM ON TEST AND MEASUREMENT, VOLS 1-7, CONFERENCE PROCEEDINGS, 2007, : 2159 - 2163

← 1 2 3 4 5 →