A novel weighted phrase-based similarity for Web documents clustering

被引:2
|
作者
Yang R. [1 ]
Zhu Q. [1 ]
Xia Y. [1 ]
机构
[1] College of Computer Science, Chongqing University, Chongqing
关键词
Document structure; Phrase-based similarity; Suffix tree; Web document clustering; Weight computing;
D O I
10.4304/jsw.6.8.1521-1528
中图分类号
学科分类号
摘要
Phrase has been considered as a more informative feature term for improving the effectiveness of document clustering. In this paper, a weighted phrase-based document similarity is proposed to compute the pairwise similarities of documents based on the Weighted Suffix Tree Document (WSTD) model. The weighted phrase-based document similarity is applied to the Group-average Hierarchical Agglomerative Clustering (GHAC) algorithm to develop a new Web document clustering approach. According to the structures of the Web documents, different document parts are assigned different levels of significance as structure weights stored in the nodes of the weighted suffix tree which is constructed with sentences instead of documents. By mapping each node and its weights in WSTD model into a unique feature term in the Vector Space Document (VSD) model, the new weighted phrase-based document similarity naturally inherits the term TF-IDF weighting scheme in computing the document similarity with weighted phrases. The evaluation experiments indicate that the new clustering approach is very effective on clustering the Web documents. Its quality greatly surpasses the traditional phrase-based approach in which the Web documents structures are ignored. In conclusion, the weighted phrase-based similarity works much better than ordinary phrase-based similarity. © 2011 ACADEMY PUBLISHER.
引用
收藏
页码:1521 / 1528
页数:7
相关论文
共 50 条
  • [41] Phrase-based hashtag recommendation for microblog posts
    Yeyun GONG
    Qi ZHANG
    Xiaoying HAN
    Xuanjing HUANG
    ScienceChina(InformationSciences), 2017, 60 (01) : 132 - 144
  • [42] Browsing in digital libraries: A phrase-based approach
    NevillManning, CG
    Witten, IH
    Paynter, GW
    ACM DIGITAL LIBRARIES '97, 1997, : 230 - 236
  • [43] Introducing a translation dictionary into phrase-based SMT
    Okuma, Hideo
    Yamamoto, Hirofumi
    Sumita, Eiichiro
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2008, E91D (07): : 2051 - 2057
  • [44] FACTORED PHRASE-BASED STATISTICAL MACHINE TRANSLATION
    Tufis, Dan
    Ceausu, Alexandru
    FROM SPEECH PROCESSING TO SPOKEN LANGUAGE TECHNOLOGY, 2009, : 115 - 124
  • [45] Syntactic phrase-based statistical machine translation
    Hassan, Hany
    Heame, Mary
    Way, Andy
    Sima'an, Khalil
    2006 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, 2006, : 238 - +
  • [46] Phrase-Based Machine Translation based on Simulated Annealing
    Lavecchia, Caroline
    Langlois, David
    Smaili, Kamel
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 3123 - 3129
  • [47] Phrase table filtration based on virtual context in phrase-based statistical machine translation
    Yin, Yue
    Zhang, Yu Jie
    Xu, Jin An
    INFORMATION TECHNOLOGY AND COMPUTER APPLICATION ENGINEERING, 2014, : 327 - 330
  • [48] Clustering XML Documents for Web Based Learning
    Periakaruppan, Ramanathan
    Nadarajan, Rethinaswamy
    ADVANCES IN WEB-BASED LEARNING, 2015, 8390 : 234 - 243
  • [49] Clustering web documents based on knowledge granularity
    Huang, FL
    Zhang, SC
    FRONTIERS OF WWW RESEARCH AND DEVELOPMENT - APWEB 2006, PROCEEDINGS, 2006, 3841 : 85 - 96
  • [50] A clustering algorithm for short documents based on concept similarity
    Peng, Jing
    Yang, Dong-qing
    Wang, Jian-wei
    Wu, Meng-qing
    Wang, Jun-gang
    2007 IEEE PACIFIC RIM CONFERENCE ON COMMUNICATIONS, COMPUTERS AND SIGNAL PROCESSING, VOLS 1 AND 2, 2007, : 42 - 45