Classification of a large microarray data set: Algorithm comparison and analysis of drug signatures

被引:74
作者
Natsoulis, G [1 ]
El Ghaoui, L
Lanckriet, GRG
Tolley, AM
Leroy, F
Dunlea, S
Eynon, BP
Pearson, CI
Tugendreich, S
Jarnagin, K
机构
[1] Iconix Pharmaceut, Mountain View, CA 94043 USA
[2] Univ Calif Berkeley, Dept Elect Engn & Comp Sci, Berkeley, CA 94720 USA
[3] SPSS, Chicago, IL 60606 USA
关键词
D O I
10.1101/gr.2807605
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
A large gene expression database has been produced that characterizes the gene expression and physiological effects of hundreds of approved and withdrawn drugs, toxicants, and biochemical standards in various organs of live rats. Ill order to derive useful biological knowledge from this large database, a variety Of Supervised classification algorithms were compared using a 597-microarray Subset of the data. Our Studies show that several types of linear classifiers based Oil Support Vector Machines (SVMs) and Logistic Regression can be used to derive readily interpretable drug signatures with high classification performance. Both methods can be tuned to produce classifiers of drug treatments in the form of short, weighted gene lists which upon analysis reveal that some of the signature genes have a positive contribution (act as "rewards" for the class-of-interest) while others have a negative contribution (act as "penalties") to the classification decision. The combination of reward and penalty genes enhances performance by keeping the number of false positive treatments low. The results of these algorithms are combined with feature selection techniques that further reduce the length of the drug signatures, an important step towards the development of useful diagnostic biomarkers and low-cost assays. Multiple signatures with no genes in common can be generated for the same classification end-point. Comparison of these gene lists identifies biological processes characteristic of a given class.
引用
收藏
页码:724 / 736
页数:13
相关论文
共 29 条
[1]   MEVINOLIN - A HIGHLY POTENT COMPETITIVE INHIBITOR OF HYDROXYMETHYLGLUTARYL-COENZYME-A REDUCTASE AND A CHOLESTEROL-LOWERING AGENT [J].
ALBERTS, AW ;
CHEN, J ;
KURON, G ;
HUNT, V ;
HUFF, J ;
HOFFMAN, C ;
ROTHROCK, J ;
LOPEZ, M ;
JOSHUA, H ;
HARRIS, E ;
PATCHETT, A ;
MONAGHAN, R ;
CURRIE, S ;
STAPLEY, E ;
ALBERSSCHONBERG, G ;
HENSENS, O ;
HIRSHFIELD, J ;
HOOGSTEEN, K ;
LIESCH, J ;
SPRINGER, J .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA-BIOLOGICAL SCIENCES, 1980, 77 (07) :3957-3961
[2]  
[Anonymous], 1998, Encyclopedia of Biostatistics
[3]  
[Anonymous], CASARETT DOULLS TOXI
[4]   A proteolytic pathway that controls the cholesterol content of membranes, cells, and blood [J].
Brown, MS ;
Goldstein, JL .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (20) :11041-11048
[5]  
Cristianini N., 2000, Intelligent Data Analysis: An Introduction, DOI 10.1017/CBO9780511801389
[6]   A highly reproducible, linear, and automated sample preparation method for DNA microarrays [J].
Dorris, DR ;
Ramakrishnan, R ;
Trakas, D ;
Dudzik, F ;
Belval, R ;
Zhao, C ;
Nguyen, A ;
Domanus, M ;
Mazumder, A .
GENOME RESEARCH, 2002, 12 (06) :976-984
[7]   Cluster analysis and display of genome-wide expression patterns [J].
Eisen, MB ;
Spellman, PT ;
Brown, PO ;
Botstein, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (25) :14863-14868
[8]  
ELGHAOUI L, 2003, UCBCSD031279 EECS U
[9]  
Friedman J., 2001, The elements of statistical learning, V1, DOI DOI 10.1007/978-0-387-21606-5
[10]  
Fu L, 1994, NEURAL NETWORKS COMP