A novel weighted phrase-based similarity for Web documents clustering

被引:2
|
作者
Yang R. [1 ]
Zhu Q. [1 ]
Xia Y. [1 ]
机构
[1] College of Computer Science, Chongqing University, Chongqing
关键词
Document structure; Phrase-based similarity; Suffix tree; Web document clustering; Weight computing;
D O I
10.4304/jsw.6.8.1521-1528
中图分类号
学科分类号
摘要
Phrase has been considered as a more informative feature term for improving the effectiveness of document clustering. In this paper, a weighted phrase-based document similarity is proposed to compute the pairwise similarities of documents based on the Weighted Suffix Tree Document (WSTD) model. The weighted phrase-based document similarity is applied to the Group-average Hierarchical Agglomerative Clustering (GHAC) algorithm to develop a new Web document clustering approach. According to the structures of the Web documents, different document parts are assigned different levels of significance as structure weights stored in the nodes of the weighted suffix tree which is constructed with sentences instead of documents. By mapping each node and its weights in WSTD model into a unique feature term in the Vector Space Document (VSD) model, the new weighted phrase-based document similarity naturally inherits the term TF-IDF weighting scheme in computing the document similarity with weighted phrases. The evaluation experiments indicate that the new clustering approach is very effective on clustering the Web documents. Its quality greatly surpasses the traditional phrase-based approach in which the Web documents structures are ignored. In conclusion, the weighted phrase-based similarity works much better than ordinary phrase-based similarity. © 2011 ACADEMY PUBLISHER.
引用
收藏
页码:1521 / 1528
页数:7
相关论文
共 50 条
  • [31] Clustering XML documents based on structural similarity
    Xing, Guangming
    Xia, Zhonghang
    Guo, Jinhua
    ADVANCES IN DATABASES: CONCEPTS, SYSTEMS AND APPLICATIONS, 2007, 4443 : 905 - +
  • [32] Phrase Based Web Document Clustering: An Indexing Approach
    Singh, Amit Prakash
    Srivastava, Shalini
    Sahu, Sanjib Kumar
    COMPUTER COMMUNICATION, NETWORKING AND INTERNET SECURITY, 2017, 5 : 481 - 492
  • [33] Document Classification Efficiency of Phrase-Based Techniques
    Kapalavayi, Nagesh
    Murthy, S. N. Jayaram
    Hu, Gongzhu
    2009 IEEE/ACS INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, VOLS 1 AND 2, 2009, : 174 - 178
  • [34] Improvements in phrase-based statistical machine translation
    Zens, R
    Ney, H
    HLT-NAACL 2004: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, 2004, : 257 - 264
  • [35] Phrase-based pattern matching in compressed text
    Culpepper, J. Shane
    Moffat, Alistair
    STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2006, 4209 : 337 - 345
  • [36] Phrase-Based & Neural Unsupervised Machine Translation
    Lample, Guillaume
    Ott, Myle
    Conneau, Alexis
    Denoyer, Ludovic
    Ranzato, Marc'Aurelio
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 5039 - 5049
  • [37] A reordering model for phrase-based machine translation
    Nguyen, Vinh Van
    Nguyen, Thai Phuong
    Shimazu, Akira
    Nguyen, Minh Le
    ADVANCES IN NATURAL LANGUAGE PROCESSING, PROCEEDINGS, 2008, 5221 : 476 - +
  • [38] A Well Organized Phrase-Based Document Clustering Using ASCII Values and Adjacency List
    Lukka, Srikanth
    Shaik, Rizwana
    PROCEEDINGS OF THE EIGHTH INTERNATIONAL CONFERENCE ON SOFT COMPUTING AND PATTERN RECOGNITION (SOCPAR 2016), 2018, 614 : 113 - 120
  • [39] The phrase-based vector space model for automatic retrieval of free-text medical documents
    Mao, Wenlei
    Chu, Wesley W.
    DATA & KNOWLEDGE ENGINEERING, 2007, 61 (01) : 76 - 92
  • [40] Semantic Similarity-Based Clustering of Web Documents Using Fuzzy C-Means
    Avanija, J.
    Ramar, K.
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, 2015, 14 (03)