Mass of short texts clustering and topic extraction based on frequent itemsets

被引:0
|
作者
Peng, Min [1 ,2 ]
Huang, Jiajia [1 ]
Zhu, Jiahui [3 ]
Huang, Jimin [1 ]
Liu, Jiping [1 ]
机构
[1] Computer School, Wuhan University, Wuhan,430072, China
[2] Shenzhen Research, Wuhan University, Shenzhen,Guangdong,518057, China
[3] State Key Laboratory of Software Engineering (Wuhan University), Wuhan,430072, China
关键词
D O I
10.7544/issn1000-1239.2015.20140533
中图分类号
学科分类号
摘要
Short texts generated in social media have the characteristics of volume, velocity, low quality and variety, thus make the vector-space-based clustering methods face the challenges of high-dimensions, features sparsity and noisy disturbing. In this paper, we propose a short texts clustering and topic extraction (STC-TE) framework based on the frequent itemsets mined from the texts. This framework firstly studies the impact of multi-features on the short texts' quality. Then, a large amount of frequent itemsets are dug out from the high quality short text set via setting a low support level, and a similar itemsets filtering strategy is devised to discard most of the unimportant frequent itemsets. Furthermore, based on the frequent itemsets similarity evaluated by relevant texts, we proposed a cluster self-adaptive spectral clustering (CSA_SC) algorithm to form the itemsets into different topic clusters. At last, the large-scale of short texts are classified into associated clusters according to the topic words extracted from the frequent itemset clusters. The framework is tested on one million of SinaWeibo dataset to evaluate the performance of the important frequent itemset selection and clustering, the topic words extraction, and the large scale of short texts classification. Experimental results show that the STC-TE framework can achieve topic extraction and large-scale short texts clustering with high accuracy. ©, 2015, Science Press. All right reserved.
引用
收藏
页码:1941 / 1953
相关论文
共 50 条
  • [1] Topic extraction by clustering word embeddings on short online texts
    Nabergoj, David
    D’Alconzo, Alessandro
    Valerio, Danilo
    Štrumbelj, Erik
    Elektrotehniski Vestnik/Electrotechnical Review, 2022, 89 (1-2): : 64 - 72
  • [2] Topic extraction by clustering word embeddings on short online texts
    Nabergoj, David
    D'Alconzo, Alessandro
    Valerio, Danilo
    Strumbelj, Erik
    ELEKTROTEHNISKI VESTNIK, 2022, 89 (1-2): : 64 - 72
  • [3] Clustering Frequent Itemsets Based on Generators
    Li, Jinhong
    Yang, Bingru
    Song, Wei
    Hou, Wei
    2008 INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY APPLICATION, VOL II, PROCEEDINGS, 2008, : 1083 - +
  • [4] An Intention-Topic Model Based on Verbs Clustering and Short Texts Topic Mining
    Lu, Tingting
    Hou, Shifeng
    Chen, Zhenxiang
    Cui, Lizhen
    Zhang, Lei
    CIT/IUCC/DASC/PICOM 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY - UBIQUITOUS COMPUTING AND COMMUNICATIONS - DEPENDABLE, AUTONOMIC AND SECURE COMPUTING - PERVASIVE INTELLIGENCE AND COMPUTING, 2015, : 837 - 842
  • [5] Enhanced Frequent Itemsets Based on Topic Modeling in Information Filtering
    Than Than Wai
    Aung, Sint Sint
    2017 16TH IEEE/ACIS INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS 2017), 2017, : 155 - 160
  • [6] Clustering categorical data based on maximal frequent itemsets
    Yu, Dadong
    Liu, Dongbo
    Luo, Rui
    Wang, Jianxin
    ICMLA 2007: SIXTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2007, : 93 - +
  • [7] Clustering Transactions Based on Weighting Maximal Frequent Itemsets
    Huang, Faliang
    Xie, Guoqing
    Yao, Zhiqiang
    Cai, Shengzhen
    2008 3RD INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEM AND KNOWLEDGE ENGINEERING, VOLS 1 AND 2, 2008, : 262 - +
  • [8] Text clustering using frequent itemsets
    Zhang, Wen
    Yoshida, Taketoshi
    Tang, Xijin
    Wang, Qing
    KNOWLEDGE-BASED SYSTEMS, 2010, 23 (05) : 379 - 388
  • [9] A Multilevel Clustering Model for Coherent Topic Discovery in Short Texts
    Maithya, Emmanuel Muthoka
    Nderu, Lawrence
    Njagi, Dennis
    2022 IST-AFRICA CONFERENCE, 2022,
  • [10] Hierarchical document clustering using frequent itemsets
    Fung, BCM
    Wang, K
    Ester, M
    PROCEEDINGS OF THE THIRD SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2003, : 59 - 70