Efficient C4.5

被引:245
作者
Ruggieri, S [1 ]
机构
[1] Univ Pisa, Dipartimento Informat, I-56125 Pisa, Italy
关键词
C4.5; decision trees; inductive learning; supervised learning; data mining;
D O I
10.1109/69.991727
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present an analytic evaluation of the runtime behavior of the C4.5 algorithm which highlights some efficiency improvements. Based on the analytic evaluation, we have implemented a more efficient version of the algorithm, called EC4.5. it improves on C4.5 by adopting the best among three strategies for computing the information gain of continuous attributes. All the strategies adopt a binary search of the threshold in the whole training set starting from the local threshold computed at a rode. The first strategy computes the local threshold using the algorithm of C4.5, which, in particular, sorts cases by means of the quicksort method. The second strategy also uses the algorithm of C4.5, but adopts a counting sort method. The third strategy calculates the local threshold using a main-memory version of the RainForest algorithm, which does not need sorting. Our implementation computes the same decision trees as C4.5 with a performance gain of up to five times.
引用
收藏
页码:438 / 444
页数:7
相关论文
共 22 条
[1]  
Alsabti K., 1998, Proceedings Fourth International Conference on Knowledge Discovery and Data Mining, P2
[2]  
BAY SD, 1999, UCI KDD ARCH
[3]  
CORMEN TH, 1992, INTRO ALGORITHMS
[4]   General and efficient multisplitting of numerical attributes [J].
Elomaa, T ;
Rousu, J .
MACHINE LEARNING, 1999, 36 (03) :201-244
[5]  
FAYYAD UM, 1992, MACH LEARN, V8, P87, DOI 10.1023/A:1022638503176
[6]  
Fukuda T, 1996, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, P146
[7]  
Gehrke J, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P169, DOI 10.1145/304181.304197
[8]   RainForest - A framework for fast decision tree construction of large datasets [J].
Gehrke, J ;
Ramakrishnan, R ;
Ganti, V .
DATA MINING AND KNOWLEDGE DISCOVERY, 2000, 4 (2-3) :127-162
[9]  
Hong SJ, 1997, IEEE T KNOWL DATA EN, V9, P718, DOI 10.1109/69.634751
[10]   ScalParC: A new scalable and efficient parallel classification algorithm for mining large datasets [J].
Joshi, MV ;
Karypis, G ;
Kumar, V .
FIRST MERGED INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM & SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING, 1998, :573-579