Selections of data preprocessing methods and similarity metrics for gene cluster analysis

被引：0

作者：

YANG Chunmei

Motorola (China) Electronics Ltd.

机构：

来源：

ProgressinNaturalScience | 2006年 / 06期

关键词：

gene expression; cluster analysis; data preprocessing; similarity metrics; Rand index;

D O I：

暂无

中图分类号：

Q75-33 [];

学科分类号：

摘要：

Clustering is one of the major exploratory techniques for gene expression data analysis. Only with suitable similarity metrics and when datasets are properly preprocessed, can results of high quality be obtained in cluster analysis. In this study, gene expression datasets with external evaluation criteria were preprocessed as normalization by line, normalization by column or logarithm transformation by base-2, and were subsequently clustered by hierarchical clustering, k-means clustering and self-organizing maps (SOMs) with Pearson correlation coefficient or Euclidean distance as similarity metric. Finally, the quality of clusters was evaluated by adjusted Rand index. The results illustrate that k -means clustering and SOMs have distinct advantages over hierarchical clustering in gene clustering, and SOMs are a bit better than k-means when randomly initialized. It also shows that hierarchical clustering prefers Pearson correlation coefficient as similarity metric and dataset normalized by line. Meanwhile, k -means clustering and SOMs can produce better clusters with Euclidean distance and logarithm transformed datasets. These results will afford valuable reference to the implementation of gene expression cluster analysis.

引用

页码：607 / 613

页数：7

共 50 条

[21] TockyPrep: data preprocessing methods for flow cytometric fluorescent timer analysis
Ono, Masahiro
BMC BIOINFORMATICS, 2025, 26 (01):
[22] Measuring cluster similarity across methods
Kos, AJ
Psenicka, C
PSYCHOLOGICAL REPORTS, 2000, 86 (03) : 858 - 862
[23] Distances, Metrics and Cluster Analysis
Khachumov, M. V.
SCIENTIFIC AND TECHNICAL INFORMATION PROCESSING, 2012, 39 (06) : 310 - 316
[24] Distances, metrics and cluster analysis
M. V. Khachumov
Scientific and Technical Information Processing, 2012, 39 (6) : 310 - 316
[25] Clustering of gene expression data: performance and similarity analysis
Longde Yin
Chun-Hsi Huang
Jun Ni
BMC Bioinformatics, 7
[26] Clustering of gene expression data: Performance and similarity analysis
Yin, Longde
Huang, Chun-Hsi
FIRST INTERNATIONAL MULTI-SYMPOSIUMS ON COMPUTER AND COMPUTATIONAL SCIENCES (IMSCCS 2006), PROCEEDINGS, VOL 1, 2006, : 142 - +
[27] Clustering of gene expression data: performance and similarity analysis
Yin, Longde
Huang, Chun-Hsi
Ni, Jun
BMC BIOINFORMATICS, 2006, 7 (Suppl 4)
[28] Cluster analysis Using Gene Expression Data
Divya
Altaf, Insha
PROCEEDINGS OF THE 2017 IEEE SECOND INTERNATIONAL CONFERENCE ON ELECTRICAL, COMPUTER AND COMMUNICATION TECHNOLOGIES (ICECCT), 2017,
[29] Cluster analysis for gene expression data: A survey
Jiang, DX
Tang, C
Zhang, AD
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2004, 16 (11) : 1370 - 1386
[30] Shrinkage-based similarity metric for cluster analysis of microarray data
Cherepinsky, V
Feng, JW
Rejali, M
Mishra, B
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2003, 100 (17) : 9668 - 9673

← 1 2 3 4 5 →