On Text Clustering with Side Information

被引:12
|
作者
Aggarwal, Charu C. [1 ]
Zhao, Yuchen [2 ]
Yu, Philip S. [2 ]
机构
[1] IBM TJ Watson Res Ctr, Hawthorne, NY 10532 USA
[2] Univ Illinois, Chicago, IL USA
关键词
D O I
10.1109/ICDE.2012.111
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Text clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. In most cases, the data is not purely available in text form. A lot of side-information is available along with the text documents. Such side-information may be of different kinds, such as the links in the document, user-access behavior from web logs, or other non-textual attributes which are embedded into the text document. Such attributes may contain a tremendous amount of information for clustering purposes. However, the relative importance of this side-information may be difficult to estimate, especially when some of the information is noisy. In such cases, it can be risky to incorporate side-information into the clustering process, because it can either improve the quality of the representation for clustering, or can add noise to the process. Therefore, we need a principled way to perform the clustering process, so as to maximize the advantages from using this side information. In this paper, we design an algorithm which combines classical partitioning algorithms with probabilistic models in order to create an effective clustering approach. We present experimental results on a number of real data sets in order to illustrate the advantages of using such an approach.
引用
收藏
页码:894 / 904
页数:11
相关论文
共 50 条
  • [1] Co-Clustering with Side Information for Text Mining
    Thomas, Ramya Elizabeth
    Khan, Shamsuddin S.
    PROCEEDINGS OF 2016 INTERNATIONAL CONFERENCE ON DATA MINING AND ADVANCED COMPUTING (SAPIENCE), 2016, : 105 - 108
  • [2] Text mining with information - Theoretic clustering
    Kogan, J
    Nicholas, C
    Volkovich, V
    COMPUTING IN SCIENCE & ENGINEERING, 2003, 5 (06) : 52 - 59
  • [3] A Text Clustering Framework for Information Retrieval
    Decherchi, Sergio
    Gastaldo, Paolo
    Redi, Judith
    Zunino, Rodolfo
    JOURNAL OF INFORMATION ASSURANCE AND SECURITY, 2009, 4 (03): : 174 - 182
  • [4] Query Complexity of Clustering with Side Information
    Mazumdar, Arya
    Saha, Barna
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [5] Internet traffic clustering with side information
    Wang, Yu
    Xiang, Yang
    Zhang, Jun
    Zhou, Wanlei
    Xie, Bailin
    JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 2014, 80 (05) : 1021 - 1036
  • [6] Clustering with Partition Level Side Information
    Liu, Hongfu
    Fu, Yun
    2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2015, : 877 - 882
  • [7] Research of fast SOM clustering for text information
    Liu, Yuan-chao
    Wu, Chong
    Liu, Ming
    EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (08) : 9325 - 9333
  • [8] Is the contextual information relevant in text clustering by compression?
    Granados, Ana
    Camacho, David
    Borja Rodriguez, Francisco
    EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (10) : 8537 - 8546
  • [9] On the Use of Side Information for Mining Text Data
    Aggarwal, Charu C.
    Zhao, Yuchen
    Yu, Philip S.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (06) : 1415 - 1429
  • [10] Clustering with Instance and Attribute Level Side Information
    Wang, Jinlong
    Wu, Shunyao
    Li, Gang
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2010, 3 (06) : 770 - 785