A New Retrieval Model Based on TextTiling for Document Similarity Search

被引:0
|
作者
Xiao-Jun Wan
Yu-Xin Peng
机构
[1] Peking University,National Key Laboratory of Text Processing Technology, Institute of Computer Science and Technology
关键词
document similarity search; retrieval model; similarity measure; TextTiling; optimal matching;
D O I
暂无
中图分类号
学科分类号
摘要
Document similarity search is to find documents similar to a given query document and return a ranked list of similar documents to users, which is widely used in many text and web systems, such as digital library, search engine, etc. Traditional retrieval models, including the Okapi's BM25 model and the Smart's vector space model with length normalization, could handle this problem to some extent by taking the query document as a long query. In practice, the Cosine measure is considered as the best model for document similarity search because of its good ability to measure similarity between two documents. In this paper, the quantitative performances of the above models are compared using experiments. Because the Cosine measure is not able to reflect the structural similarity between documents, a new retrieval model based on TextTiling is proposed in the paper. The proposed model takes into account the subtopic structures of documents. It first splits the documents into text segments with TextTiling and calculates the similarities for different pairs of text segments in the documents. Lastly the overall similarity between the documents is returned by combining the similarities of different pairs of text segments with optimal matching method. Experiments are performed and results show: 1) the popular retrieval models (the Okapi's BM25 model and the Smart's vector space model with length normalization) do not perform well for document similarity search; 2) the proposed model based on TextTiling is effective and outperforms other models, including the Cosine measure; 3) the methods for the three components in the proposed model are validated to be appropriately employed.
引用
收藏
页码:552 / 558
页数:6
相关论文
共 50 条
  • [41] A conceptual model of trademark retrieval based on conceptual similarity
    Anuar, Fatahiyah Mohd
    Setchi, Rossitza
    Lai, Yu-Kun
    17TH INTERNATIONAL CONFERENCE IN KNOWLEDGE BASED AND INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS - KES2013, 2013, 22 : 450 - 459
  • [42] Optimizing document similarity detection in Persian information retrieval
    Kashefi O.
    Mohseni N.
    Minaei B.
    Journal of Convergence Information Technology, 2010, 5 (02) : 101 - 106
  • [43] Combination of similarity measures for effective spoken document retrieval
    Crestani, F
    JOURNAL OF INFORMATION SCIENCE, 2003, 29 (02) : 87 - 96
  • [44] Text Segmentation Based on PLSA-TextTiling Model
    Zheng, YuChao
    MECHATRONICS ENGINEERING, COMPUTING AND INFORMATION TECHNOLOGY, 2014, 556-562 : 4018 - 4022
  • [45] VERSATILE DOCUMENT SEARCH AND RETRIEVAL-SYSTEM
    ESKINAZI, J
    MACERO, DJ
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 1977, 174 (SEP): : 2 - 2
  • [46] Beyond document similarity: Understanding value-based search and browsing technologies
    Paepcke, Andreas
    Garcia-Molina, Hector
    Rodriguez-Mula, Gerard
    Cho, Junghoo
    SIGMOD Record (ACM Special Interest Group on Management of Data), 2000, 29 (01): : 80 - 92
  • [47] PathEmb: Random Walk Based Document Embedding for Global Pathway Similarity Search
    Zhang, Jiao
    Kwong, Sam
    Liu, Guangming
    Lin, Qiuzhen
    Wong, Ka-Chun
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2019, 23 (03) : 1329 - 1335
  • [48] Beyond document similarity: Understanding value-based search and browsing technologies
    Paepcke, A
    Garcia-Molina, H
    Rodriguez-Mula, G
    Cho, J
    SIGMOD RECORD, 2000, 29 (01) : 80 - 92
  • [49] A Generic Document Retrieval Framework Based on UMLS Similarity for Biomedical Question Answering System
    Sarrouti, Mourad
    El Alaoui, Said Ouatik
    INTELLIGENT DECISION TECHNOLOGIES 2016, PT II, 2016, 57 : 207 - 216
  • [50] Phrase-based document similarity based on an Index Graph model
    Hammouda, KM
    Kamel, MS
    2002 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2002, : 203 - 210