RainForest - A framework for fast decision tree construction of large datasets

被引:85
作者
Gehrke, J [1 ]
Ramakrishnan, R [1 ]
Ganti, V [1 ]
机构
[1] Univ Wisconsin, Dept Comp Sci, Madison, WI 53706 USA
关键词
data mining; decision trees; classification; scalability;
D O I
10.1023/A:1009839829793
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Classification of large datasets is an important data mining problem. Many classification algorithms have been proposed in the literature, but studies have shown that so far no algorithm uniformly outperforms all other algorithms in terms of quality. In this paper, we present a unifying framework called Rain Forest for classification tree construction that separates the scalability aspects of algorithms for constructing a tree from the central features that determine the quality of the tree. The generic algorithm is easy to instantiate with specific split selection methods from the literature (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, SPRINT and QUEST). In addition to its generality, in that it yields scalable versions of a wide range of classification algorithms, our approach also offers performance improvements of over a factor of three over the SPRINT algorithm, the fastest scalable classification algorithm proposed previously. In contrast to SPRINT, however, our generic algorithm requires a certain minimum amount of main memory, proportional to the set of distinct values in a column of the input relation. Given current main memory costs, this requirement is readily met in most if not all workloads.
引用
收藏
页码:127 / 162
页数:36
相关论文
共 67 条
[1]  
AGRAWAL R, 1992, PROC INT CONF VERY L, P560
[2]   DATABASE MINING - A PERFORMANCE PERSPECTIVE [J].
AGRAWAL, R ;
IMIELINSKI, T ;
SWAMI, A .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1993, 5 (06) :914-925
[3]  
Agresti A., 1990, CATEGORICAL DATA ANA
[4]  
[Anonymous], MACHINE LEARNING
[5]  
[Anonymous], 1979, Computers and Intractablity: A Guide to the Theoryof NP-Completeness
[6]  
[Anonymous], P 2 INT C INF KNOWL
[7]  
[Anonymous], 1993, P 13 INT JOINT C ART
[8]   APPROXIMATING THE NUMBER OF UNIQUE VALUES OF AN ATTRIBUTE WITHOUT SORTING [J].
ASTRAHAN, MM ;
SCHKOLNICK, M ;
WHANG, KY .
INFORMATION SYSTEMS, 1987, 12 (01) :11-15
[9]  
Bishop C. M., 1995, NEURAL NETWORKS PATT
[10]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669