SimDoc: Topic Sequence Alignment based Document Similarity Framework

被引:0
|
作者
Maheshwari, Gaurav [1 ]
Trivedi, Priyansh [1 ]
Sahijwani, Harshita [2 ]
Jha, Kunal [1 ]
Dasgupta, Sourish [3 ]
Lehmann, Jens [1 ]
机构
[1] Univ Bonn, Bonn, Germany
[2] Emory Univ, Atlanta, GA 30322 USA
[3] Rygbee Inc, Denver, CO USA
关键词
Similarity Measures; Document Topic Models; Lexical Semantics;
D O I
10.1145/3148011.3148016
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Document similarity is the problem of estimating the degree to which a given pair of documents has similar semantic content. An accurate document similarity measure can improve several enterprise relevant tasks such as document clustering, text mining, and question-answering. In this paper, we show that a document's thematic flow, which is often disregarded by bag-of-word techniques, is pivotal in estimating their similarity. To this end, we propose a novel semantic document similarity framework, called SimDoc. We model documents as topic-sequences, where topics represent latent generative clusters of related words. Then, we use a sequence alignment algorithm to estimate their semantic similarity. We further conceptualize a novel mechanism to compute topic-topic similarity to fine tune our system. In our experiments, we show that SimDoc outperforms many contemporary bag-of-words techniques in accurately computing document similarity, and on practical applications such as document clustering.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Document Similarity Measure Based on Topic Model
    He, Ming
    Wang, Zhen-zhen
    Du, Yong-ping
    APPLIED SCIENCE, MATERIALS SCIENCE AND INFORMATION TECHNOLOGIES IN INDUSTRY, 2014, 513-517 : 1280 - 1284
  • [2] Novel Similarity Measure for Document Clustering Based on Topic Phrases
    ELdesoky, A. E.
    Saleh, M.
    Sakr, N. A.
    ICNM: 2009 INTERNATIONAL CONFERENCE ON NETWORKING & MEDIA CONVERGENCE, 2007, : 92 - +
  • [3] Similarity Measurement of Web Sessions Based on Sequence Alignment
    LI Chaofeng1
    2. College of Management
    WuhanUniversityJournalofNaturalSciences, 2007, (05) : 814 - 818
  • [4] Spoken Document Retrieval Based on Approximated Sequence Alignment
    Comas, Pere R.
    Turmo, Jordi
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2008, 5246 : 285 - 292
  • [5] Topic Model Based Text Similarity Measure for Chinese Judgment Document
    Wang, Yue
    Ge, Jidong
    Zhou, Yemao
    Feng, Yi
    Li, Chuanyi
    Li, Zhongjin
    Zhou, Xiaoyu
    Luo, Bin
    DATA SCIENCE, PT II, 2017, 728 : 42 - 54
  • [6] Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification
    Borozan, Ivan
    Watt, Stuart
    Ferretti, Vincent
    BIOINFORMATICS, 2015, 31 (09) : 1396 - 1404
  • [7] A framework for Alignment-free methods to perform similarity analysis of biological sequence
    Gupta, Manoj Kumar
    Niyogi, Rajdeep
    Misra, Manoj
    2013 SIXTH INTERNATIONAL CONFERENCE ON CONTEMPORARY COMPUTING (IC3), 2013, : 337 - 342
  • [8] Model Based Audio Sequence Alignment Based on Deterministic Similarity Methods
    Basaran, Dogac
    Cemgil, Ali Taylan
    Anarim, Emin
    2013 21ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2013,
  • [9] Protein Multiple Sequence Alignment Based on Secondary Structure Similarity
    Hamidi, Sarvenaz
    Naghibzadeh, Mahmoud
    Sadri, Javad
    2013 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2013, : 1224 - 1229
  • [10] A similarity-based framework for the alignment of an ontology for remote sensing
    Farah, Mohamed
    Nefzi, Hafedh
    Farah, Imed Riadh
    COMPUTERS & GEOSCIENCES, 2016, 96 : 202 - 207