Selections of data preprocessing methods and similarity metrics for gene cluster analysis

被引:0
|
作者
YANG Chunmei
Motorola (China) Electronics Ltd.
机构
关键词
gene expression; cluster analysis; data preprocessing; similarity metrics; Rand index;
D O I
暂无
中图分类号
Q75-33 [];
学科分类号
摘要
Clustering is one of the major exploratory techniques for gene expression data analysis. Only with suitable similarity metrics and when datasets are properly preprocessed, can results of high quality be obtained in cluster analysis. In this study, gene expression datasets with external evaluation criteria were preprocessed as normalization by line, normalization by column or logarithm transformation by base-2, and were subsequently clustered by hierarchical clustering, k-means clustering and self-organizing maps (SOMs) with Pearson correlation coefficient or Euclidean distance as similarity metric. Finally, the quality of clusters was evaluated by adjusted Rand index. The results illustrate that k -means clustering and SOMs have distinct advantages over hierarchical clustering in gene clustering, and SOMs are a bit better than k-means when randomly initialized. It also shows that hierarchical clustering prefers Pearson correlation coefficient as similarity metric and dataset normalized by line. Meanwhile, k -means clustering and SOMs can produce better clusters with Euclidean distance and logarithm transformed datasets. These results will afford valuable reference to the implementation of gene expression cluster analysis.
引用
收藏
页码:607 / 613
页数:7
相关论文
共 50 条
  • [31] Research on preprocessing methods for monitoring drilling data
    Xiao, Haohan
    Cao, Ruilang
    Wang, Yujie
    Zhao, Yufei
    Sun, Yanpeng
    Shuili Xuebao/Journal of Hydraulic Engineering, 2024, 55 (11): : 1379 - 1390
  • [32] Web Log Data Preprocessing using Raspberry Pi Cluster and hadoop cluster
    Svec, Peter
    Chylo, Lukas
    Filipik, Jakub
    DIVAI 2018: 12TH INTERNATIONAL SCIENTIFIC CONFERENCE ON DISTANCE LEARNING IN APPLIED INFORMATICS, 2018, : 513 - 521
  • [33] The study of preprocessing methods' utility in analysis of multidimensional and highly imbalanced medical data
    Werner, Aleksandra
    Bach, Malgorzata
    Pluskiewicz, Wojciech
    PROCEEDINGS OF THE 11TH SCIENTIFIC CONFERENCE INTERNET IN THE INFORMATION SOCIETY 2016, 2016, : 71 - 87
  • [34] Comparative Analysis of Data Preprocessing Methods in Machine Learning for Breast Cancer Classification
    Stockton, Timothy
    Peddle, Brandon
    Gaulin, Angelica
    Wiechert, Emma
    Lu, Wei
    ADVANCED INFORMATION NETWORKING AND APPLICATIONS, VOL 3, AINA 2024, 2024, 201 : 268 - 279
  • [35] Comparing gene expression similarity metrics for connectivity map
    Cheng, Jie
    Yang, Lun
    2013 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2013,
  • [36] Investigating the preprocessing methods in ECG analysis
    Ekinci, Goktug
    Kardes, Ege
    Guvenkaya, Hazal
    Karagoz, Pinar
    29TH IEEE CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS (SIU 2021), 2021,
  • [37] Review of Visualization Methods for Categorical Data in Cluster Analysis
    Cibulkova, Jana
    Kupkova, Barbora
    STATISTIKA-STATISTICS AND ECONOMY JOURNAL, 2022, 102 (04) : 396 - 408
  • [38] Graphical Methods for Influential Data Points in Cluster Analysis
    Jang, Dae-Heung
    Kim, Youngil
    Anderson-Cook, Christine M.
    QUALITY AND RELIABILITY ENGINEERING INTERNATIONAL, 2016, 32 (01) : 231 - 239
  • [39] Comparison of Similarity Measurement Metrics on Medical Image Data
    Samantaray, Aswini. K.
    Rahulkar, Amol D.
    2019 10TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND NETWORKING TECHNOLOGIES (ICCCNT), 2019,
  • [40] Learning local metrics from pairwise similarity data
    Bohne, Julien
    Ying, Yiming
    Gentric, Stephane
    Pontil, Massimiliano
    PATTERN RECOGNITION, 2018, 75 : 315 - 326