Selections of data preprocessing methods and similarity metrics for gene cluster analysis

被引:0
|
作者
YANG Chunmei
Motorola (China) Electronics Ltd.
机构
关键词
gene expression; cluster analysis; data preprocessing; similarity metrics; Rand index;
D O I
暂无
中图分类号
Q75-33 [];
学科分类号
摘要
Clustering is one of the major exploratory techniques for gene expression data analysis. Only with suitable similarity metrics and when datasets are properly preprocessed, can results of high quality be obtained in cluster analysis. In this study, gene expression datasets with external evaluation criteria were preprocessed as normalization by line, normalization by column or logarithm transformation by base-2, and were subsequently clustered by hierarchical clustering, k-means clustering and self-organizing maps (SOMs) with Pearson correlation coefficient or Euclidean distance as similarity metric. Finally, the quality of clusters was evaluated by adjusted Rand index. The results illustrate that k -means clustering and SOMs have distinct advantages over hierarchical clustering in gene clustering, and SOMs are a bit better than k-means when randomly initialized. It also shows that hierarchical clustering prefers Pearson correlation coefficient as similarity metric and dataset normalized by line. Meanwhile, k -means clustering and SOMs can produce better clusters with Euclidean distance and logarithm transformed datasets. These results will afford valuable reference to the implementation of gene expression cluster analysis.
引用
收藏
页码:607 / 613
页数:7
相关论文
共 50 条
  • [21] TockyPrep: data preprocessing methods for flow cytometric fluorescent timer analysis
    Ono, Masahiro
    BMC BIOINFORMATICS, 2025, 26 (01):
  • [22] Measuring cluster similarity across methods
    Kos, AJ
    Psenicka, C
    PSYCHOLOGICAL REPORTS, 2000, 86 (03) : 858 - 862
  • [23] Distances, Metrics and Cluster Analysis
    Khachumov, M. V.
    SCIENTIFIC AND TECHNICAL INFORMATION PROCESSING, 2012, 39 (06) : 310 - 316
  • [24] Distances, metrics and cluster analysis
    M. V. Khachumov
    Scientific and Technical Information Processing, 2012, 39 (6) : 310 - 316
  • [25] Clustering of gene expression data: performance and similarity analysis
    Longde Yin
    Chun-Hsi Huang
    Jun Ni
    BMC Bioinformatics, 7
  • [26] Clustering of gene expression data: Performance and similarity analysis
    Yin, Longde
    Huang, Chun-Hsi
    FIRST INTERNATIONAL MULTI-SYMPOSIUMS ON COMPUTER AND COMPUTATIONAL SCIENCES (IMSCCS 2006), PROCEEDINGS, VOL 1, 2006, : 142 - +
  • [27] Clustering of gene expression data: performance and similarity analysis
    Yin, Longde
    Huang, Chun-Hsi
    Ni, Jun
    BMC BIOINFORMATICS, 2006, 7 (Suppl 4)
  • [28] Cluster analysis Using Gene Expression Data
    Divya
    Altaf, Insha
    PROCEEDINGS OF THE 2017 IEEE SECOND INTERNATIONAL CONFERENCE ON ELECTRICAL, COMPUTER AND COMMUNICATION TECHNOLOGIES (ICECCT), 2017,
  • [29] Cluster analysis for gene expression data: A survey
    Jiang, DX
    Tang, C
    Zhang, AD
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2004, 16 (11) : 1370 - 1386
  • [30] Shrinkage-based similarity metric for cluster analysis of microarray data
    Cherepinsky, V
    Feng, JW
    Rejali, M
    Mishra, B
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2003, 100 (17) : 9668 - 9673