A novel weighted phrase-based similarity for Web documents clustering

被引:2
|
作者
Yang R. [1 ]
Zhu Q. [1 ]
Xia Y. [1 ]
机构
[1] College of Computer Science, Chongqing University, Chongqing
关键词
Document structure; Phrase-based similarity; Suffix tree; Web document clustering; Weight computing;
D O I
10.4304/jsw.6.8.1521-1528
中图分类号
学科分类号
摘要
Phrase has been considered as a more informative feature term for improving the effectiveness of document clustering. In this paper, a weighted phrase-based document similarity is proposed to compute the pairwise similarities of documents based on the Weighted Suffix Tree Document (WSTD) model. The weighted phrase-based document similarity is applied to the Group-average Hierarchical Agglomerative Clustering (GHAC) algorithm to develop a new Web document clustering approach. According to the structures of the Web documents, different document parts are assigned different levels of significance as structure weights stored in the nodes of the weighted suffix tree which is constructed with sentences instead of documents. By mapping each node and its weights in WSTD model into a unique feature term in the Vector Space Document (VSD) model, the new weighted phrase-based document similarity naturally inherits the term TF-IDF weighting scheme in computing the document similarity with weighted phrases. The evaluation experiments indicate that the new clustering approach is very effective on clustering the Web documents. Its quality greatly surpasses the traditional phrase-based approach in which the Web documents structures are ignored. In conclusion, the weighted phrase-based similarity works much better than ordinary phrase-based similarity. © 2011 ACADEMY PUBLISHER.
引用
收藏
页码:1521 / 1528
页数:7
相关论文
共 50 条
  • [1] Efficient phrase-based document similarity for clustering
    Chim, Hung
    Deng, Xiaotie
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (09) : 1217 - 1229
  • [2] Phrase-based text representation for managing the Web documents
    Sharma, R
    Raman, S
    ITCC 2003: INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: COMPUTERS AND COMMUNICATIONS, PROCEEDINGS, 2003, : 165 - 169
  • [3] Phrase-based hierarchical clustering of web search results
    Maslowska, I
    ADVANCES IN INFORMATION RETRIEVAL, 2003, 2633 : 555 - 562
  • [4] A Phrase-Based Method for Hierarchical Clustering of Web Snippets
    Li, Zhao
    Wu, Xindong
    PROCEEDINGS OF THE TWENTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-10), 2010, : 1947 - 1948
  • [5] Efficient phrase-based document indexing for web document clustering
    Hammouda, KM
    Kamel, MS
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2004, 16 (10) : 1279 - 1296
  • [6] Efficient Incremental Phrase-Based Document Clustering
    Bakr, Ahmad M.
    Yousri, Noha A.
    Ismail, Mohamed A.
    2012 21ST INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR 2012), 2012, : 517 - 520
  • [7] Phrase-based document similarity based on an Index Graph model
    Hammouda, KM
    Kamel, MS
    2002 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2002, : 203 - 210
  • [8] Phrase-based Hierarchical Method for Clustering Search Results
    Yang Ke
    Han Baoming
    Li Zujie
    PROCEEDINGS OF THE THIRD INTERNATIONAL SYMPOSIUM ON TEST AUTOMATION & INSTRUMENTATION, VOLS 1 - 4, 2010, : 1430 - 1435
  • [9] Phrase-based Semantic Textual Similarity for Linking Researchers
    Reyes-Ortiz, Jose A.
    Bravo, Maricela
    Padilla, Omar E.
    2015 26TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS (DEXA), 2015, : 202 - 206
  • [10] Similarity-based soft clustering algorithm for web documents
    School of Remote Sensing Information Engineering, Wuhan University, Wuhan 430079, China
    Jisuanji Gongcheng, 2006, 2 (59-61):