Mass of short texts clustering and topic extraction based on frequent itemsets

被引:0
|
作者
Peng, Min [1 ,2 ]
Huang, Jiajia [1 ]
Zhu, Jiahui [3 ]
Huang, Jimin [1 ]
Liu, Jiping [1 ]
机构
[1] Computer School, Wuhan University, Wuhan,430072, China
[2] Shenzhen Research, Wuhan University, Shenzhen,Guangdong,518057, China
[3] State Key Laboratory of Software Engineering (Wuhan University), Wuhan,430072, China
关键词
D O I
10.7544/issn1000-1239.2015.20140533
中图分类号
学科分类号
摘要
Short texts generated in social media have the characteristics of volume, velocity, low quality and variety, thus make the vector-space-based clustering methods face the challenges of high-dimensions, features sparsity and noisy disturbing. In this paper, we propose a short texts clustering and topic extraction (STC-TE) framework based on the frequent itemsets mined from the texts. This framework firstly studies the impact of multi-features on the short texts' quality. Then, a large amount of frequent itemsets are dug out from the high quality short text set via setting a low support level, and a similar itemsets filtering strategy is devised to discard most of the unimportant frequent itemsets. Furthermore, based on the frequent itemsets similarity evaluated by relevant texts, we proposed a cluster self-adaptive spectral clustering (CSA_SC) algorithm to form the itemsets into different topic clusters. At last, the large-scale of short texts are classified into associated clusters according to the topic words extracted from the frequent itemset clusters. The framework is tested on one million of SinaWeibo dataset to evaluate the performance of the important frequent itemset selection and clustering, the topic words extraction, and the large scale of short texts classification. Experimental results show that the STC-TE framework can achieve topic extraction and large-scale short texts clustering with high accuracy. ©, 2015, Science Press. All right reserved.
引用
收藏
页码:1941 / 1953
相关论文
共 50 条
  • [31] Online Topic Modeling for Short Texts
    Roy, Suman
    Malladi, Vijay Varma
    Sengupta, Ayan
    Das, Souparna
    SERVICE-ORIENTED COMPUTING (ICSOC 2020), 2020, 12571 : 563 - 579
  • [32] Tracking Topic Trends for Short Texts
    He, Liyan
    Du, Yajun
    Ye, Yongtao
    KNOWLEDGE GRAPH AND SEMANTIC COMPUTING: LANGUAGE, KNOWLEDGE, AND INTELLIGENCE, CCKS 2017, 2017, 784 : 117 - 128
  • [33] Improving Business Intelligence Based on Frequent Itemsets Using k-Means Clustering Algorithm
    Paulraj, Prabhu
    Neelamegam, Anbazhagan
    NETWORKS AND COMMUNICATIONS (NETCOM2013), 2014, 284 : 243 - 254
  • [34] CL-MAX: a clustering-based approximation algorithm for mining maximal frequent itemsets
    Fatemi, Seyed Mohsen
    Hosseini, Seyed Mohsen
    Kamandi, Ali
    Shabankhah, Mahmood
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2021, 12 (02) : 365 - 383
  • [35] Sentiment Classification of Short Texts based on Semantic Clustering
    He, Yunchao
    Yang, Chin-Sheng
    Yu, Liang-Chih
    Lai, K. Robert
    Liu, Weiyi
    PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ORANGE TECHNOLOGIES (ICOT), 2015, : 54 - 57
  • [36] CL-MAX: a clustering-based approximation algorithm for mining maximal frequent itemsets
    Seyed Mohsen Fatemi
    Seyed Mohsen Hosseini
    Ali Kamandi
    Mahmood Shabankhah
    International Journal of Machine Learning and Cybernetics, 2021, 12 : 365 - 383
  • [37] Combine clustering and frequent itemsets mining to enhance biomedical text summarization
    Rouane, Oussama
    Belhadef, Hacene
    Bouakkaz, Mustapha
    EXPERT SYSTEMS WITH APPLICATIONS, 2019, 135 : 362 - 373
  • [38] Topic Navigation Generation Using Topic Extraction and Clustering
    Zhang Chengzhi
    Zhang Qingguo
    KAM: 2008 INTERNATIONAL SYMPOSIUM ON KNOWLEDGE ACQUISITION AND MODELING, PROCEEDINGS, 2008, : 333 - 339
  • [39] TSSE-DMM: Topic Modeling for Short Texts Based on Topic Subdivision and Semantic Enhancement
    Mai, Chengcheng
    Qiu, Xueming
    Luo, Kaiwen
    Chen, Min
    Zhao, Bo
    Huang, Yihua
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2021, PT II, 2021, 12713 : 640 - 651
  • [40] The Algorithm of Mining Frequent Itemsets Based on MapReduce
    He, Bo
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON SOFT COMPUTING TECHNIQUES AND ENGINEERING APPLICATION, ICSCTEA 2013, 2014, 250 : 529 - 534