On efficiently summarizing categorical databases

被引:29
|
作者
Wang, JY
Karypis, G [1 ]
机构
[1] Univ Minnesota, Digital Technol Ctr, Dept Comp Sci, Minneapolis, MN 55455 USA
[2] Univ Minnesota, Army HPC Res Ctr, Minneapolis, MN 55455 USA
[3] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
关键词
data mining; frequent itemset; categorical database; clustering;
D O I
10.1007/s10115-005-0216-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Frequent itemset mining was initially proposed and has been studied extensively in the context of association rule mining. In recent years, several studies have also extended its application to transaction or document clustering. However, most of the frequent itemset based clustering algorithms need to first mine a large intermediate set of frequent itemsets in order to identify a subset of the most promising ones that can be used for clustering. In this paper, we study how to directly find a subset of high quality frequent itemsets that can be used as a concise summary of the transaction database and to cluster the categorical data. By exploring key properties of the subset of itemsets that we are interested in, we proposed several search space pruning methods and designed an efficient algorithm called SUMMARY. Our empirical results show that SUMMARY runs very fast even when the minimum support is extremely low and scales very well with respect to the database size, and surprisingly, as a: pure frequent itemset mining algorithm it is very effective in clustering the categorical data and summarizing the dense transaction databases.
引用
收藏
页码:19 / 37
页数:19
相关论文
共 50 条
  • [21] Efficiently Summarizing Text and Graph Encodings of Multi-Document Clusters
    Pasunuru, Ramakanth
    Liu, Mengwen
    Bansal, Mohit
    Ravi, Sujith
    Dreyer, Markus
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 4768 - 4779
  • [22] VIEWS AND DECOMPOSITIONS OF DATABASES FROM A CATEGORICAL PERSPECTIVE
    TUIJN, C
    GYSSENS, M
    LECTURE NOTES IN COMPUTER SCIENCE, 1992, 646 : 99 - 112
  • [23] Inferential disclosure limitation in multivariate categorical databases
    Justice, R
    Mukherjee, S
    SAM'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON SECURITY AND MANAGEMENT, VOLS 1 AND 2, 2003, : 650 - 653
  • [24] Mining categorical concept hierarchies in large databases
    Chien, BC
    Liao, SY
    7TH WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL II, PROCEEDINGS: COMPUTER SCIENCE AND ENGINEERING, 2003, : 244 - 249
  • [25] Computing Distance Histograms Efficiently in Scientific Databases
    Tu, Yi-Cheng
    Chen, Shaoping
    Pandit, Sagar
    ICDE: 2009 IEEE 25TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2009, : 796 - +
  • [26] Efficiently Evaluating Skyline Queries on RDF Databases
    Chen, Ling
    Gao, Sidan
    Anyanwu, Kemafor
    SEMANTIC WEB: RESEARCH AND APPLICATIONS, PT II, 2011, 6644 : 123 - 138
  • [27] Efficiently matching proximity relationships in spatial databases
    Lin, XM
    Zhou, XM
    Liu, CF
    ADVANCES IN SPATIAL DATABASES, 1999, 1651 : 188 - 206
  • [28] Efficiently calculating inbreeding on large pedigrees databases
    Elliott, Brendan
    Cheng, En
    Mayes, Stephen
    Ozsoyoglu, Z. Meral
    INFORMATION SYSTEMS, 2009, 34 (06) : 469 - 492
  • [29] EFFICIENTLY MINING FREQUENT ITEMSETS IN TRANSACTIONAL DATABASES
    Alghyaline, Salah
    Hsieh, Jun-Wei
    Lai, Jim Z. C.
    JOURNAL OF MARINE SCIENCE AND TECHNOLOGY-TAIWAN, 2016, 24 (02): : 184 - 191
  • [30] Efficiently Managing Encrypted Data in Cloud Databases
    Ben Omran, Osama M.
    Panda, Brajendra
    2015 IEEE 2ND INTERNATIONAL CONFERENCE ON CYBER SECURITY AND CLOUD COMPUTING (CSCLOUD), 2015, : 266 - 271