Mass of short texts clustering and topic extraction based on frequent itemsets

被引:0
|
作者
Peng, Min [1 ,2 ]
Huang, Jiajia [1 ]
Zhu, Jiahui [3 ]
Huang, Jimin [1 ]
Liu, Jiping [1 ]
机构
[1] Computer School, Wuhan University, Wuhan,430072, China
[2] Shenzhen Research, Wuhan University, Shenzhen,Guangdong,518057, China
[3] State Key Laboratory of Software Engineering (Wuhan University), Wuhan,430072, China
关键词
D O I
10.7544/issn1000-1239.2015.20140533
中图分类号
学科分类号
摘要
Short texts generated in social media have the characteristics of volume, velocity, low quality and variety, thus make the vector-space-based clustering methods face the challenges of high-dimensions, features sparsity and noisy disturbing. In this paper, we propose a short texts clustering and topic extraction (STC-TE) framework based on the frequent itemsets mined from the texts. This framework firstly studies the impact of multi-features on the short texts' quality. Then, a large amount of frequent itemsets are dug out from the high quality short text set via setting a low support level, and a similar itemsets filtering strategy is devised to discard most of the unimportant frequent itemsets. Furthermore, based on the frequent itemsets similarity evaluated by relevant texts, we proposed a cluster self-adaptive spectral clustering (CSA_SC) algorithm to form the itemsets into different topic clusters. At last, the large-scale of short texts are classified into associated clusters according to the topic words extracted from the frequent itemset clusters. The framework is tested on one million of SinaWeibo dataset to evaluate the performance of the important frequent itemset selection and clustering, the topic words extraction, and the large scale of short texts classification. Experimental results show that the STC-TE framework can achieve topic extraction and large-scale short texts clustering with high accuracy. ©, 2015, Science Press. All right reserved.
引用
收藏
页码:1941 / 1953
相关论文
共 50 条
  • [21] Research on the distributed treatment of frequent itemsets extraction based on pruned concept lattices
    Xu, Yong
    Zhou, Sen-Xin
    PROCEEDINGS OF 2006 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2006, : 1332 - +
  • [22] Mining quantitative frequent itemsets using adaptive density-based subspace clustering
    Washio, T
    Mitsunaga, Y
    Motoda, H
    FIFTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2005, : 793 - 796
  • [23] Mining updated frequent itemsets based on directed itemsets graph
    Wen Lei
    Li Min-qiang
    Proceedings of 2004 Chinese Control and Decision Conference, 2004, : 690 - 693
  • [24] Approximate Frequent Itemsets Compression Using Dynamic Clustering Method
    Yan, Hua
    Sang, Yongsheng
    2008 IEEE CONFERENCE ON CYBERNETICS AND INTELLIGENT SYSTEMS, VOLS 1 AND 2, 2008, : 1110 - 1115
  • [25] Mining maximum frequent itemsets based on directed itemsets graph
    Wen Lei
    PROCEEDINGS OF 2004 CHINESE CONTROL AND DECISION CONFERENCE, 2004, : 681 - 683
  • [26] Domain-Oriented Topic Discovery Based on Features Extraction and Topic Clustering
    Lu, Xiaofeng
    Zhou, Xiao
    Wang, Wenting
    Lio, Pietro
    Hui, Pan
    IEEE ACCESS, 2020, 8 (08): : 93648 - 93662
  • [27] Topic Based Temporal Generative Short Text Clustering
    Smitha, E. S.
    Sendhilkumar, S.
    Mahalakshmi, G. S.
    Sanju, S. Krithika
    PROCEEDING OF THE INTERNATIONAL CONFERENCE ON COMPUTER NETWORKS, BIG DATA AND IOT (ICCBI-2018), 2020, 31 : 912 - 922
  • [28] Attention-based Autoencoder Topic Model for Short Texts
    Tian, Tian
    Fang, Zheng
    10TH INTERNATIONAL CONFERENCE ON AMBIENT SYSTEMS, NETWORKS AND TECHNOLOGIES (ANT 2019) / THE 2ND INTERNATIONAL CONFERENCE ON EMERGING DATA AND INDUSTRY 4.0 (EDI40 2019) / AFFILIATED WORKSHOPS, 2019, 151 : 1134 - 1139
  • [29] The short texts classification based on neural network topic model
    Shao, Dangguo
    Li, Chengyao
    Huang, Chusheng
    An, Qing
    Xiang, Yan
    Guo, Junjun
    He, Jianfeng
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2022, 42 (03) : 2143 - 2155
  • [30] ASTM: An Attentional Segmentation based Topic Model for Short Texts
    Wang, Jiamiao
    Chen, Ling
    Qin, Lu
    Wu, Xindong
    2018 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2018, : 577 - 586