A novel weighted phrase-based similarity for Web documents clustering

被引:2
|
作者
Yang R. [1 ]
Zhu Q. [1 ]
Xia Y. [1 ]
机构
[1] College of Computer Science, Chongqing University, Chongqing
关键词
Document structure; Phrase-based similarity; Suffix tree; Web document clustering; Weight computing;
D O I
10.4304/jsw.6.8.1521-1528
中图分类号
学科分类号
摘要
Phrase has been considered as a more informative feature term for improving the effectiveness of document clustering. In this paper, a weighted phrase-based document similarity is proposed to compute the pairwise similarities of documents based on the Weighted Suffix Tree Document (WSTD) model. The weighted phrase-based document similarity is applied to the Group-average Hierarchical Agglomerative Clustering (GHAC) algorithm to develop a new Web document clustering approach. According to the structures of the Web documents, different document parts are assigned different levels of significance as structure weights stored in the nodes of the weighted suffix tree which is constructed with sentences instead of documents. By mapping each node and its weights in WSTD model into a unique feature term in the Vector Space Document (VSD) model, the new weighted phrase-based document similarity naturally inherits the term TF-IDF weighting scheme in computing the document similarity with weighted phrases. The evaluation experiments indicate that the new clustering approach is very effective on clustering the Web documents. Its quality greatly surpasses the traditional phrase-based approach in which the Web documents structures are ignored. In conclusion, the weighted phrase-based similarity works much better than ordinary phrase-based similarity. © 2011 ACADEMY PUBLISHER.
引用
收藏
页码:1521 / 1528
页数:7
相关论文
共 50 条
  • [21] Integrating Phrase Inseparability in Phrase-Based Model
    Shi, Lixin
    Nie, Jian-Yun
    PROCEEDINGS 32ND ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2009, : 708 - 709
  • [22] Discriminative Phrase-based Lexicalized Reordering Models using Weighted Reordering Graphs
    L2F Spoken Systems Lab, INESC-ID, Lisboa, Portugal
    不详
    PA, United States
    IJCNLP - Proc. Int. Jt. Conf. Nat. Lang. Process., (47-55):
  • [23] Constrained phrase-based translation using weighted finite-state transducers
    Zhou, BW
    Chen, SF
    Gao, YQ
    2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5: SPEECH PROCESSING, 2005, : 1017 - 1020
  • [24] Statistical phrase-based speech translation
    Mathias, Lambert
    Byrne, William
    2006 IEEE International Conference on Acoustics, Speech and Signal Processing, Vols 1-13, 2006, : 561 - 564
  • [25] Improved techniques for phrase-based translation
    Ruiz Costa-Jussa, Marta
    Fonollosa, Jose A. R.
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2005, (35): : 351 - 356
  • [26] Deriving phrase-based language models
    Heeman, PA
    Damnati, G
    1997 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, PROCEEDINGS, 1997, : 41 - 48
  • [27] Phrase-based statistical machine translation
    Zens, R
    Och, FJ
    Ney, H
    KI2002: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2002, 2479 : 18 - 32
  • [28] Syntactically lexicalized phrase-based SMT
    Hassan, Hany
    Sima'an, Khalil
    Way, Andy
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2008, 16 (07): : 1260 - 1273
  • [29] Semantic based clustering of web documents
    Lin, TY
    Chiang, IJ
    2005 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, VOLS 1 AND 2, 2005, : 189 - 192
  • [30] Clustering template based web documents
    Gottron, Thomas
    ADVANCES IN INFORMATION RETRIEVAL, 2008, 4956 : 40 - 51