EnsCat: clustering of categorical data via ensembling

被引:0
|
作者
Clarke, Bertrand S. [1 ]
Amiri, Saeid [2 ]
Clarke, Jennifer L. [1 ,3 ]
机构
[1] Univ Nebraska Lincoln, Dept Stat, Lincoln, NE 68588 USA
[2] Univ Wisconsin Madison, Dept Nat & Appl Sci, Iowa City, IA USA
[3] Univ Nebraska Lincoln, Dept Food Sci & Technol, Lincoln, NE 68588 USA
来源
BMC BIOINFORMATICS | 2016年 / 17卷
基金
美国国家科学基金会;
关键词
Categorical data; Clustering; Ensembling methods; High dimensional data; ALGORITHM;
D O I
10.1186/s12859-016-1245-9
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Clustering is a widely used collection of unsupervised learning techniques for identifying natural classes within a data set. It is often used in bioinformatics to infer population substructure. Genomic data are often categorical and high dimensional, e.g., long sequences of nucleotides. This makes inference challenging: The distance metric is often not well-defined on categorical data; running time for computations using high dimensional data can be considerable; and the Curse of Dimensionality often impedes the interpretation of the results. Up to the present, however, the literature and software addressing clustering for categorical data has not yet led to a standard approach. Results: We present software for an ensemble method that performs well in comparison with other methods regardless of the dimensionality of the data. In an ensemble method a variety of instantiations of a statistical object are found and then combined into a consensus value. It has been known for decades that ensembling generally outperforms the components that comprise it in many settings. Here, we apply this ensembling principle to clustering. We begin by generating many hierarchical clusterings with different clustering sizes. When the dimension of the data is high, we also randomly select subspaces also of variable size, to generate clusterings. Then, we combine these clusterings into a single membership matrix and use this to obtain a new, ensembled dissimilarity matrix using Hamming distance. Conclusions: Ensemble clustering, as implemented in R and called EnsCat, gives more clearly separated clusters than other clustering techniques for categorical data. The latest version with manual and examples is available at https://github.com/jlp2duke/EnsCat.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] Hierarchical division clustering framework for categorical data
    Wei, Wei
    Liang, Jiye
    Guo, Xinyao
    Song, Peng
    Sun, Yijun
    NEUROCOMPUTING, 2019, 341 : 118 - 134
  • [42] On clustering tree structured data with categorical nature
    Boutsinas, B.
    Papastergiou, T.
    PATTERN RECOGNITION, 2008, 41 (12) : 3613 - 3623
  • [43] A categorical data clustering framework on graph representation
    Bai, Liang
    Liang, Jiye
    PATTERN RECOGNITION, 2022, 128
  • [44] Coercion: A Distributed Clustering Algorithm for Categorical Data
    Wang, Bin
    Zhou, Yang
    Hei, Xinhong
    2013 9TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS), 2013, : 683 - 687
  • [45] Rough Set Approach for Categorical Data Clustering
    Herawan, Tutut
    Yanto, Iwan Tri Riyadi
    Deris, Mustafa Mat
    DATABASE THEORY AND APPLICATION, 2009, 64 : 179 - 186
  • [46] A subspace hierarchical clustering algorithm for categorical data
    Carbonera, Joel Luis
    Abel, Mara
    2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 509 - 516
  • [47] Clustering Categorical Data Using Hierarchies (CLUCDUH)
    Silahtaroglu, Gökhan
    World Academy of Science, Engineering and Technology, 2009, 56 : 334 - 339
  • [48] A bi-clustering framework for categorical data
    Pensa, RG
    Robardet, C
    Boulicaut, JF
    KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2005, 2005, 3721 : 643 - 650
  • [49] On clustering massive text and categorical data streams
    Charu C. Aggarwal
    Philip S. Yu
    Knowledge and Information Systems, 2010, 24 : 171 - 196
  • [50] Parallel Hierarchical Subspace Clustering of Categorical Data
    Pang, Ning
    Zhang, Jifu
    Zhang, Chaowei
    Qin, Xiao
    IEEE TRANSACTIONS ON COMPUTERS, 2019, 68 (04) : 542 - 555