A novel minorization-maximization framework for simultaneous feature selection and clustering of high-dimensional count data

被引:1
|
作者
Zamzami, Nuha [1 ]
Bouguila, Nizar [2 ]
机构
[1] Univ Jeddah, Coll Comp Sci & Engn, Dept Comp Sci & Artificial Intelligence, Jeddah, Saudi Arabia
[2] Concordia Univ, Concordia Inst Informat Syst Engn CIISE, Montreal, PQ, Canada
关键词
Feature saliency; Feature selection; Model selection; Unsupervised learning; Count data; Mixture models; Generalized Dirichlet multinomial; Maximum likelihood; Minorization-maximization; UNSUPERVISED FEATURE-SELECTION; DISCRIMINANT-ANALYSIS; MAXIMUM-LIKELIHOOD; MODEL SELECTION; ALGORITHM; CLASSIFICATION; MIXTURES;
D O I
10.1007/s10044-022-01094-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Count data are commonly exploited in machine learning and computer vision applications; however, they often suffer from the well-known curse of dimensionality, which declines the performance of clustering algorithms dramatically. Feature selection is a major technique for handling a large number of features, which most are often redundant and noisy. In this paper, we propose a probabilistic approach for count data based on the concept of feature saliency in the context of mixture-based clustering using the generalized Dirichlet multinomial distribution. The saliency of irrelevant features is reduced toward zero by minimizing the message length, which equates to doing feature and model selection simultaneously. It is proved that the developed approach is effective in identifying both the optimal number of clusters and the most important features, and so enhancing clustering performance significantly, using a range of challenging applications including text and image clustering.
引用
收藏
页码:91 / 106
页数:16
相关论文
共 50 条
  • [21] A filter feature selection for high-dimensional data
    Janane, Fatima Zahra
    Ouaderhman, Tayeb
    Chamlal, Hasna
    JOURNAL OF ALGORITHMS & COMPUTATIONAL TECHNOLOGY, 2023, 17
  • [22] Feature selection for high-dimensional temporal data
    Michail Tsagris
    Vincenzo Lagani
    Ioannis Tsamardinos
    BMC Bioinformatics, 19
  • [23] Feature Selection with High-Dimensional Imbalanced Data
    Van Hulse, Jason
    Khoshgoftaar, Taghi M.
    Napolitano, Amri
    Wald, Randall
    2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, : 507 - 514
  • [24] Feature selection for high-dimensional temporal data
    Tsagris, Michail
    Lagani, Vincenzo
    Tsamardinos, Ioannis
    BMC BIOINFORMATICS, 2018, 19
  • [25] FEATURE SELECTION FOR HIGH-DIMENSIONAL DATA ANALYSIS
    Verleysen, Michel
    ECTA 2011/FCTA 2011: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON EVOLUTIONARY COMPUTATION THEORY AND APPLICATIONS AND INTERNATIONAL CONFERENCE ON FUZZY COMPUTATION THEORY AND APPLICATIONS, 2011,
  • [26] Feature Selection for Clustering on High Dimensional Data
    Zeng, Hong
    Cheung, Yiu-ming
    PRICAI 2008: TRENDS IN ARTIFICIAL INTELLIGENCE, 2008, 5351 : 913 - 922
  • [27] A finite mixture model for simultaneous high-dimensional clustering, localized feature selection and outlier rejection
    Bouguila, Nizar
    Almakadmeh, Khaled
    Boutemedjet, Sabri
    EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (07) : 6641 - 6656
  • [28] A Fast Clustering-Based Feature Subset Selection Algorithm for High-Dimensional Data
    Song, Qinbao
    Ni, Jingjie
    Wang, Guangtao
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (01) : 1 - 14
  • [29] High-dimensional data clustering using k-means subspace feature selection
    Wang, Xiao-Dong
    Chen, Rung-Ching
    Yan, Fei
    Journal of Network Intelligence, 2019, 4 (03): : 80 - 87
  • [30] A novel feature selection scheme for high-dimensional data sets: four-Staged Feature Selection
    Pehlivanli, Ayca Cakmak
    JOURNAL OF APPLIED STATISTICS, 2016, 43 (06) : 1140 - 1154