Selections of data preprocessing methods and similarity metrics for gene cluster analysis

被引:0
|
作者
YANG Chunmei
Motorola (China) Electronics Ltd.
机构
关键词
gene expression; cluster analysis; data preprocessing; similarity metrics; Rand index;
D O I
暂无
中图分类号
Q75-33 [];
学科分类号
摘要
Clustering is one of the major exploratory techniques for gene expression data analysis. Only with suitable similarity metrics and when datasets are properly preprocessed, can results of high quality be obtained in cluster analysis. In this study, gene expression datasets with external evaluation criteria were preprocessed as normalization by line, normalization by column or logarithm transformation by base-2, and were subsequently clustered by hierarchical clustering, k-means clustering and self-organizing maps (SOMs) with Pearson correlation coefficient or Euclidean distance as similarity metric. Finally, the quality of clusters was evaluated by adjusted Rand index. The results illustrate that k -means clustering and SOMs have distinct advantages over hierarchical clustering in gene clustering, and SOMs are a bit better than k-means when randomly initialized. It also shows that hierarchical clustering prefers Pearson correlation coefficient as similarity metric and dataset normalized by line. Meanwhile, k -means clustering and SOMs can produce better clusters with Euclidean distance and logarithm transformed datasets. These results will afford valuable reference to the implementation of gene expression cluster analysis.
引用
收藏
页码:607 / 613
页数:7
相关论文
共 50 条
  • [41] Cluster analysis of protein array results via similarity of Gene Ontology annotation
    Wolting, Cheryl
    McGlade, C. Jane
    Tritchler, David
    BMC BIOINFORMATICS, 2006, 7 (1)
  • [42] Cluster analysis of protein array results via similarity of Gene Ontology annotation
    Cheryl Wolting
    C Jane McGlade
    David Tritchler
    BMC Bioinformatics, 7
  • [43] STRUCTURAL SIMILARITY METRICS FOR TEXTURE ANALYSIS AND RETRIEVAL
    Zujovic, Jana
    Pappas, Thrasyvoulos N.
    Neuhoff, David L.
    2009 16TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOLS 1-6, 2009, : 2225 - +
  • [44] Cluster analysis and its applications to gene expression data
    Sharan, R
    Elkon, R
    Shamir, R
    BIOINFORMATICS AND GENOME ANALYSIS, 2002, 38 : 83 - 108
  • [45] Cluster analysis of large scale gene expression data
    Erb, RS
    Michaels, GS
    DIMENSION REDUCTION, COMPUTATIONAL COMPLEXITY AND INFORMATION, 1998, 30 : 303 - 308
  • [46] Implementation χ-Sim Co-Similarity and Agglomerative Hierarchical to Cluster Gene Expression Data of Lymphoma by Gene and Condition
    Bustamam, A.
    Zubedi, F.
    Siswantining, T.
    PROCEEDINGS OF THE 3RD INTERNATIONAL SYMPOSIUM ON CURRENT PROGRESS IN MATHEMATICS AND SCIENCES 2017 (ISCPMS2017), 2018, 2023
  • [47] Methods of Gene Ontology Term Similarity Analysis in Graph Database Environment
    Stypka, Lukasz
    Kozielski, Michal
    BEYOND DATABASES, ARCHITECTURES AND STRUCTURES, BDAS 2014, 2014, 424 : 345 - 354
  • [48] Application of cluster analysis of temporal gene expression data to panel data
    Nascimento, Moyses
    Safadi, Thelma
    Fonseca e Silva, Fabyano
    PESQUISA AGROPECUARIA BRASILEIRA, 2011, 46 (11) : 1489 - 1495
  • [49] A Survey of Preprocessing Methods Used for Analysis of Big Data Originated From Smart Grids
    Alghamdi, Turki Ali
    Javaid, Nadeem
    IEEE ACCESS, 2022, 10 : 29149 - 29171
  • [50] Fuzzy cluster analysis: Methods for classification, data analysis and image recognition
    Rayward-Smith, VJ
    JOURNAL OF THE OPERATIONAL RESEARCH SOCIETY, 2000, 51 (06) : 769 - 770