Data mining in large databases using domain generalization graphs

被引:20
|
作者
Hilderman, RJ [1 ]
Hamilton, HJ
Cercone, N
机构
[1] Univ Regina, Dept Comp Sci, Regina, SK S4S 0A2, Canada
[2] Univ Waterloo, Fac Math, Dept Comp Sci, Waterloo, ON N2L 3G1, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
data mining; knowledge discovery; machine learning; knowledge representation; attribute-oriented generalization; domain generalization graphs;
D O I
10.1023/A:1008769516670
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Attribute-oriented generalization summarizes the information in a relational database by repeatedly replacing specific attribute values with more general concepts according to user-defined concept hierarchies. We introduce domain generalization graphs for controlling the generalization of a set of attributes and show how they are constructed. We then present serial and parallel versions of the Multi-Attribute Generalization algorithm for traversing the generalization state space described by joining the domain generalization graphs for multiple attributes. Based upon a generate-and-test approach, the algorithm generates all possible summaries consistent with the domain generalization graphs. Our experimental results show that significant speedups are possible by partitioning path combinations from the DGGs across multiple processors. We also rank the interestingness of the resulting summaries using measures based upon variance and relative entropy. Our experimental results also show that these measures provide an effective basis for analyzing summary data generated from relational databases. Variance appears more useful because it tends to rank the less complex summaries (i.e., those with few attributes and/or tuples) as more interesting.
引用
收藏
页码:195 / 234
页数:40
相关论文
共 50 条
  • [31] Hypertext databases and data mining
    Chakrabarti, S
    SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999: SIGMOD99: PROCEEDINGS OF THE 1999 ACM SIGMOD - INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 1999, : 508 - 508
  • [32] Mining databases and data streams
    Zaniolo, Carlo
    Thakkar, Hetal
    HOMELAND SECURITY TECHNOLOGY CHALLENGES: FROM SENSING AND ENCRYPTING TO MINING AND MODELING, 2008, : 103 - +
  • [33] Mining constrained gradients in large databases
    Dong, GZ
    Han, JW
    Lam, JMW
    Pei, JA
    Wang, K
    Zou, W
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2004, 16 (08) : 922 - 938
  • [34] Outlier Detection in Spatial Databases Using Clustering Data Mining
    Karmaker, Amitava
    Rahman, Syed M.
    PROCEEDINGS OF THE 2009 SIXTH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: NEW GENERATIONS, VOLS 1-3, 2009, : 1657 - +
  • [35] Case mining from large databases
    Yang, Q
    Cheng, H
    CASE-BASED REASONING RESEARCH AND DEVELOPMENT, PROCEEDINGS, 2003, 2689 : 691 - 702
  • [36] An Algorithm for Mining Large Sequences in Databases
    Bhasker, Bharat
    INNOVATION AND KNOWLEDGE MANAGEMENT IN TWIN TRACK ECONOMIES: CHALLENGES & SOLUTIONS, VOLS 1-3, 2009, : 21 - 25
  • [37] Scaling mining algorithms to large databases
    Bradley, P
    Gehrke, J
    Ramakrishnan, R
    Srikant, R
    COMMUNICATIONS OF THE ACM, 2002, 45 (08) : 38 - 43
  • [38] Data mining: Efficiency of using sequence databases for polymorphism discovery
    Cox, DG
    Boillot, C
    Canzian, F
    HUMAN MUTATION, 2001, 17 (02) : 141 - 150
  • [39] Probabilistic Mining in Large Transaction Databases
    Anand, Hareendran S.
    Chandra, S. S. Vinod
    DATA MINING AND BIG DATA, DMBD 2016, 2016, 9714 : 486 - 494
  • [40] Learning from the data: Mining of large high-throughput screening databases
    Yan, S. Frank
    King, Frederick J.
    He, Yun
    Caldwell, Jeremy S.
    Zhou, Yingyao
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2006, 46 (06) : 2381 - 2395