Performance comparison of gene family clustering methods with expert curated gene family data set in Arabidopsis thaliana

被引:4
|
作者
Yang, Kuan [2 ,3 ]
Zhang, Liqing [1 ,3 ]
机构
[1] Virginia Tech, Dept Comp Sci, Blacksburg, VA 24061 USA
[2] Virginia Tech, Virginia Bioinformat Inst, Blacksburg, VA 24061 USA
[3] Virginia Tech, Program Genet Bioinformat & Computat Biol, Blacksburg, VA 24061 USA
基金
美国国家科学基金会;
关键词
Arabidopsis; complete linkage; gene family; hierarchical clustering algorithm; K-means clustering; single linkage; TribeMCL;
D O I
10.1007/s00425-008-0748-7
中图分类号
Q94 [植物学];
学科分类号
071001 ;
摘要
With the exponential growth of genomics data, the demand for reliable clustering methods is increasing every day. Despite the wide usage of many clustering algorithms, the accuracy of these algorithms has been evaluated mostly on simulated data sets and seldom on real biological data for which a "correct answer" is available. In order to address this issue, we use the manually curated high-quality Arabidopsis thaliana gene family database as a "gold standard" to conduct a comprehensive comparison of the accuracies of four widely used clustering methods including K-means, TribeMCL, single-linkage clustering and complete-linkage clustering. We compare the results from running different clustering methods on two matrices: the E-value matrix and the k-tuple distance matrix. The E-value matrix is computed based on BLAST E-values. The k-tuple distance matrix is computed based on the difference in tuple frequencies. The TribeMCL with the E-value matrix performed best, with the Inflation parameter (=1.15) tuned considerably lower than what has been suggested previously (=2). The single-linkage clustering method with the E-value matrix was second best. Single-linkage clustering, K-means clustering, complete-linkage clustering, and TribeMCL with a k-tuple distance matrix performed reasonably well. Complete-linkage clustering with the k-tuple distance matrix performed the worst.
引用
收藏
页码:439 / 447
页数:9
相关论文
共 50 条
  • [1] Performance comparison of gene family clustering methods with expert curated gene family data set in Arabidopsis thaliana
    Kuan Yang
    Liqing Zhang
    Planta, 2008, 228 : 439 - 447
  • [2] Performance Comparison of Clustering Methods for Gene Family Data
    Wei, Dan
    Jiang, Qingshan
    FRONTIERS IN COMPUTER EDUCATION, 2012, 133 : 827 - +
  • [3] The carboxylesterase gene family from Arabidopsis thaliana
    Marshall, SDG
    Putterill, JJ
    Plummer, KM
    Newcomb, RD
    JOURNAL OF MOLECULAR EVOLUTION, 2003, 57 (05) : 487 - 500
  • [4] WRKY gene family evolution in Arabidopsis thaliana
    Qishan Wang
    Minghui Wang
    Xiangzhe Zhang
    Boji Hao
    S. K. Kaushik
    Yuchun Pan
    Genetica, 2011, 139 : 973 - 983
  • [5] Expression of the Arabidopsis thaliana invertase gene family
    Tymowska-Lalanne, Z
    Kreis, M
    PLANTA, 1998, 207 (02) : 259 - 265
  • [6] WRKY gene family evolution in Arabidopsis thaliana
    Wang, Qishan
    Wang, Minghui
    Zhang, Xiangzhe
    Hao, Boji
    Kaushik, S. K.
    Pan, Yuchun
    GENETICA, 2011, 139 (08) : 973 - 983
  • [7] The Carboxylesterase Gene Family from Arabidopsis thaliana
    Sean D. G. Marshall
    Joanna J. Putterill
    Kim M. Plummer
    Richard D. Newcomb
    Journal of Molecular Evolution, 2003, 57 : 487 - 500
  • [8] Expression of the Arabidopsis thaliana invertase gene family
    Zuzanna Tymowska-Lalanne
    Martin Kreis
    Planta, 1998, 207 : 259 - 265
  • [9] Cytokinin Regulation of Gene Expression in the AHP Gene Family in Arabidopsis thaliana
    Jana Hradilová
    Jiří Malbeck
    Břetislav Brzobohatý
    Journal of Plant Growth Regulation, 2007, 26 : 229 - 244
  • [10] Cytokinin regulation of gene expression in the AHP gene family in Arabidopsis thaliana
    Hradilova, Jana
    Malbeck, Jiri
    Brzobohaty, Bretislav
    JOURNAL OF PLANT GROWTH REGULATION, 2007, 26 (03) : 229 - 244