Searching Corrupted Document Collections

被引:0
|
作者
Soo, Jason [1 ]
Frieder, Ophir [1 ]
机构
[1] Georgetown Univ, Informat Retrieval Lab, Washington, DC 20007 USA
关键词
OCR; known item retrieval; RETRIEVAL;
D O I
10.1109/DAS.2016.28
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Historical documents are typically digitized using optical Character Recognition. While effective, the results may not always be accurate and are highly dependent on the input. Consequently, degraded documents are often corrupted. Our focus is finding flexible, reliable methods to correct for such degradation, in the face of limited resources. We extend upon our substring and context fusion based retrieval system known as Segments, to consider metadata. By extracting topics from documents, and supplementing and weighting our lexicon with co-occurring terms found in documents with those topics, we achieve a statistically significant improvement over the state-of-the-art in all but one test configuration. Our mean reciprocal rank measured on two free, publicly available, independently judged datasets is 0.7657 and 0.5382.
引用
收藏
页码:440 / 445
页数:6
相关论文
共 50 条
  • [1] STATISTICAL ASSOCIATION METHODS FOR SIMULTANEOUS SEARCHING OF MULTIPLE DOCUMENT COLLECTIONS
    HAMMOND, W
    STATISTICAL ASSOCIATION METHODS FOR MECHANIZED DOCUMENTATION SYMPOSIUM PROCEEDINGS, 1965, 1964 (NBS26): : 237 - &
  • [2] CAP7: Searching and browsing in distributed document collections
    Fuhr, N
    Grossjohann, K
    Kokkelink, S
    RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, PROCEEDINGS, 2000, 1923 : 364 - 367
  • [3] On Searching Misspelled Collections
    Soo, Jason
    Frieder, Ophir
    JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2015, 66 (06) : 1294 - 1298
  • [4] Inspecting document collections
    Bohnacker, U
    Franke, J
    Mogg-Schneider, H
    Renz, I
    READING AND LEARNING, 2004, 2956 : 235 - 251
  • [5] Document file searching
    Falk, H
    ELECTRONIC LIBRARY, 1998, 16 (03): : 199 - 203
  • [6] Document Retrieval on Repetitive Collections
    Navarro, Gonzalo
    Puglisi, Simon J.
    Siren, Jouni
    ALGORITHMS - ESA 2014, 2014, 8737 : 725 - 736
  • [7] Document Listing on Repetitive Collections
    Gagie, Travis
    Karhu, Kalle
    Navarro, Gonzalo
    Puglisi, Simon J.
    Siren, Jouni
    COMBINATORIAL PATTERN MATCHING, 2013, 7922 : 107 - 119
  • [8] Semantic Wordification of Document Collections
    Paulovich, Fernando V.
    Toledo, Franklina M. B.
    Telles, Guilherme P.
    Minghim, Rosane
    Nonato, Luis Gustavo
    COMPUTER GRAPHICS FORUM, 2012, 31 (03) : 1145 - 1153
  • [9] DOCUMENT COLLECTIONS OF THE LIBRARY OF CONGRESS
    Falkner, Roland P.
    LIBRARY JOURNAL, 1901, 26 (12) : 870 - 871
  • [10] Metrics for XML document collections
    Klettke, M
    Schneider, L
    Heuer, A
    XML-BASED DATA MANAGEMENT AND MULTIMEDIA ENGINEERING-EDBT 2002 WORKSHOPS, 2002, 2490 : 15 - 28