A scalable algorithm for clustering sequential data

被引:49
|
作者
Guralnik, V [1 ]
Karypis, G [1 ]
机构
[1] Univ Minnesota, Dept Comp Sci, Minneapolis, MN 55455 USA
关键词
D O I
10.1109/ICDM.2001.989516
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, we have seen an enormous growth in the amount of available commercial and scientific data. Data from domains such as protein sequences, retail transactions, intrusion detection, and web-logs have an inherent sequential nature. Clustering of such data sets is useful for various purposes. For example, clustering of sequences from commercial data sets may help marketer identify different customer groups based upon their purchasing patterns. Grouping protein sequences that share similar structure helps in identifying sequences with similar functionality. Over the years, many methods have been developed for clustering objects according to their similarity. However these methods tend to have a computational complexity that is at least quadratic on the number of sequences. In this paper we present an entirely different approach to sequence clustering that does not require an all-against-all analysis and uses a near-linear complexity K-means based clustering algorithm. Our experiments using data sets derived from sequences of purchasing transactions and protein sequences show that this approach is scalable and leads to reasonably good clusters.
引用
收藏
页码:179 / 186
页数:8
相关论文
共 50 条
  • [31] A Scalable Exemplar-Based Subspace Clustering Algorithm for Class-Imbalanced Data
    You, Chong
    Li, Chi
    Robinson, Daniel P.
    Vidal, Rene
    COMPUTER VISION - ECCV 2018, PT IX, 2018, 11213 : 68 - 85
  • [32] DACE: a scalable DP-means algorithm for clustering extremely large sequence data
    Jiang, Linhao
    Dong, Yichao
    Chen, Ning
    Chen, Ting
    BIOINFORMATICS, 2017, 33 (06) : 834 - 842
  • [33] A Scalable Hierarchical Clustering Algorithm Using Spark
    Jin, Chen
    Liu, Ruoqian
    Hendrix, William
    Agrawal, Ankit
    Choudhary, Alok
    Chen, Zhengzhang
    2015 IEEE FIRST INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (BIGDATASERVICE 2015), 2015, : 418 - U533
  • [34] A Scalable Clustering Algorithm for Serendipity in Recommender Systems
    Deshmukh, Anup Anand
    Nair, Pratheeksha
    Rao, Shrisha
    2018 18TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2018, : 1279 - 1288
  • [35] HPC enabled a Novel Deep Fuzzy Scalable Clustering Algorithm and its Application for Protein Data
    Jha, Preeti
    Tiwari, Aruna
    Bharill, Neha
    Ratnaparkhe, Milind
    Patel, Om Prakash
    Anand, Vaibhav
    Arya, Sudhanshu
    Singh, Tanmay
    2022 IEEE CONFERENCE ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY (IEEE CIBCB 2022), 2022, : 257 - 264
  • [36] Scalable decision fusion algorithm for enabling decentralized computation in distributed, big data clustering problems
    Jennath, H. S.
    Asharaf, S.
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, 15 (09) : 3803 - 3827
  • [37] Sequential clustering algorithm for Gaussian mixture initialization
    Messina, R
    Jouvet, D
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 833 - 836
  • [38] Improved sequential IB algorithm for document clustering
    Ye, Yang-Dong
    Zhang, Jie
    Liu, Dong
    Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2008, 21 (03): : 417 - 423
  • [39] Scalable parallel clustering for data mining on multicomputers
    Foti, D
    Lipari, D
    Pizzuti, C
    Talia, D
    PARALLEL AND DISTRIBUTED PROCESSING, PROCEEDINGS, 2000, 1800 : 390 - 398
  • [40] Scalable Active Constrained Clustering for Temporal Data
    Mai, Son T.
    Amer-Yahia, Sihem
    Chouakria, Ahlame Douzal
    Nguyen, Ky T.
    Anh-Duong Nguyen
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2018, PT I, 2018, 10827 : 566 - 582