Efficient phrase-based document indexing for web document clustering

被引:163
|
作者
Hammouda, KM [1 ]
Kamel, MS [1 ]
机构
[1] Univ Waterloo, Dept Syst Design Engn, Waterloo, ON N2L 3G1, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Web mining; document similarity; phrase-based indexing; document clustering; document structure; document index graph; phrase matching;
D O I
10.1109/TKDE.2004.58
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This paper presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the Document Index Graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.
引用
收藏
页码:1279 / 1296
页数:18
相关论文
共 50 条
  • [21] An Improvised Sub-Document Based Framework for Efficient Document Clustering
    Memon, Muhammad Qasim
    He, Jingsha
    Lu, Yu
    Zhu, Nafei
    Memon, Aasma
    JOURNAL OF INTERNET TECHNOLOGY, 2019, 20 (04): : 1191 - 1203
  • [22] Classify Web document by key phrase understanding
    Tang, CJ
    Li, T
    Liu, CY
    Ge, Y
    ADVANCES IN WEB-AGE INFORMATION MANAGEMENT, PROCEEDINGS, 2001, 2118 : 80 - 88
  • [23] Web document clustering using Document Index Graph
    Momin, B. F.
    Kulkarni, P. J.
    Chaudhari, Amol
    2006 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND COMMUNICATIONS, VOLS 1 AND 2, 2007, : 30 - 35
  • [24] Free-text medical document retrieval via phrase-based vector space model
    Mao, WL
    Chu, WW
    AMIA 2002 SYMPOSIUM, PROCEEDINGS: BIOMEDICAL INFORMATICS: ONE DISCIPLINE, 2002, : 489 - 493
  • [25] Efficient indexing of versioned document sequences
    Herscovici, Michael
    Lempel, Ronny
    Yogev, Sivan
    ADVANCES IN INFORMATION RETRIEVAL, 2007, 4425 : 76 - +
  • [26] Document clustering using locality preserving indexing
    Cai, D
    He, XF
    Han, JW
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (12) : 1624 - 1637
  • [27] Clustering algorithm based on swarm intelligence for Web document
    Wu, Bin
    Fu, Wei-Peng
    Zheng, Yi
    Liu, Shao-Hui
    Shi, Zhong-Zhi
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2002, 39 (11):
  • [28] Towards Clustering of Web-based Document Structures
    Dehmer, Matthias
    Emmert-Streib, Frank
    Kilian, Juergen
    Zulauf, Andreas
    PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL 10, 2005, 10 : 289 - 294
  • [29] Anchor point indexing in Web document retrieval
    Kao, B
    Lee, J
    Ng, CY
    Cheung, D
    IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2000, 30 (03): : 364 - 373
  • [30] Web Document Clustering Research Based on Granular Computing
    Zheng Shangzhi
    Zhao Xiaolong
    Zhang Buqun
    Bu Hualong
    PROCEEDINGS OF THE SECOND INTERNATIONAL SYMPOSIUM ON ELECTRONIC COMMERCE AND SECURITY, VOL II, 2009, : 446 - 450