A New Retrieval Model Based on TextTiling for Document Similarity Search

被引:0
|
作者
Xiao-Jun Wan
Yu-Xin Peng
机构
[1] Peking University,National Key Laboratory of Text Processing Technology, Institute of Computer Science and Technology
关键词
document similarity search; retrieval model; similarity measure; TextTiling; optimal matching;
D O I
暂无
中图分类号
学科分类号
摘要
Document similarity search is to find documents similar to a given query document and return a ranked list of similar documents to users, which is widely used in many text and web systems, such as digital library, search engine, etc. Traditional retrieval models, including the Okapi's BM25 model and the Smart's vector space model with length normalization, could handle this problem to some extent by taking the query document as a long query. In practice, the Cosine measure is considered as the best model for document similarity search because of its good ability to measure similarity between two documents. In this paper, the quantitative performances of the above models are compared using experiments. Because the Cosine measure is not able to reflect the structural similarity between documents, a new retrieval model based on TextTiling is proposed in the paper. The proposed model takes into account the subtopic structures of documents. It first splits the documents into text segments with TextTiling and calculates the similarities for different pairs of text segments in the documents. Lastly the overall similarity between the documents is returned by combining the similarities of different pairs of text segments with optimal matching method. Experiments are performed and results show: 1) the popular retrieval models (the Okapi's BM25 model and the Smart's vector space model with length normalization) do not perform well for document similarity search; 2) the proposed model based on TextTiling is effective and outperforms other models, including the Cosine measure; 3) the methods for the three components in the proposed model are validated to be appropriately employed.
引用
收藏
页码:552 / 558
页数:6
相关论文
共 50 条
  • [31] Content based image retrieval based on a nonlinear similarity model
    Cha, Guang-Ho
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2006, PT 1, 2006, 3980 : 344 - 353
  • [32] A Similarity Search Method for Encrypted Cloud Document
    Fu, Zhangjie
    Shu, Jiangang
    Wang, Jin
    Sun, Xingming
    2014 TENTH INTERNATIONAL CONFERENCE ON INTELLIGENT INFORMATION HIDING AND MULTIMEDIA SIGNAL PROCESSING (IIH-MSP 2014), 2014, : 791 - 794
  • [33] Comparison of two "document similarity search engines"
    Poinçot, P
    Lesteven, S
    Murtagh, F
    LIBRARY AND INFORMATION SERVICES IN ASTRONOMY III (LISA III), 1998, 153 : 85 - 92
  • [34] A Document Retrieval Model Based on Digital Signal Filtering
    Costa, Alberto
    Di Buccio, Emanuele
    Melucci, Massimo
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2015, 34 (01)
  • [35] New similarity search based glioma grading
    Katrin Haegler
    Martin Wiesmann
    Christian Böhm
    Jessica Freiherr
    Oliver Schnell
    Hartmut Brückmann
    Jörg-Christian Tonn
    Jennifer Linn
    Neuroradiology, 2012, 54 : 829 - 837
  • [36] New similarity search based glioma grading
    Haegler, Katrin
    Wiesmann, Martin
    Boehm, Christian
    Freiherr, Jessica
    Schnell, Oliver
    Brueckmann, Hartmut
    Tonn, Joerg-Christian
    Linn, Jennifer
    NEURORADIOLOGY, 2012, 54 (08) : 829 - 837
  • [37] Markov network retrieval model based on document cliques
    Wang, Mingwen, 1600, Science Press (51):
  • [38] MAC/FAC - A MODEL FOR SIMILARITY-BASED RETRIEVAL
    FORBUS, KD
    GENTNER, D
    LAW, K
    COGNITIVE SCIENCE, 1995, 19 (02) : 141 - 205
  • [39] Genetic algorithm based model for effective document retrieval
    Department of Computer Science, Jamia Hamdard, Hamdard Nagar, New Delhi 110 062, India
    不详
    Lect. Notes Electr. Eng., (191-201):
  • [40] A model based on Influence Diagrams for structured document retrieval
    Xu, JM
    Zhao, S
    Chai, BF
    PROCEEDINGS OF 2005 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-9, 2005, : 3225 - 3231