Searching Corrupted Document Collections

被引:0
|
作者
Soo, Jason [1 ]
Frieder, Ophir [1 ]
机构
[1] Georgetown Univ, Informat Retrieval Lab, Washington, DC 20007 USA
关键词
OCR; known item retrieval; RETRIEVAL;
D O I
10.1109/DAS.2016.28
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Historical documents are typically digitized using optical Character Recognition. While effective, the results may not always be accurate and are highly dependent on the input. Consequently, degraded documents are often corrupted. Our focus is finding flexible, reliable methods to correct for such degradation, in the face of limited resources. We extend upon our substring and context fusion based retrieval system known as Segments, to consider metadata. By extracting topics from documents, and supplementing and weighting our lexicon with co-occurring terms found in documents with those topics, we achieve a statistically significant improvement over the state-of-the-art in all but one test configuration. Our mean reciprocal rank measured on two free, publicly available, independently judged datasets is 0.7657 and 0.5382.
引用
收藏
页码:440 / 445
页数:6
相关论文
共 50 条
  • [11] PERSONAL REFERENCE COLLECTIONS AND ONLINE SEARCHING
    POLLITT, AS
    WHITCOMBE, AC
    JOURNAL OF INFORMATION SCIENCE, 1981, 3 (01) : 35 - 38
  • [12] Searching and browsing collections of structural information
    Wolff, JE
    Flörke, H
    Cremers, AB
    IEEE ADVANCES IN DIGITAL LIBRARIES 2000, PROCEEDINGS, 2000, : 141 - 150
  • [13] Managing and searching personal photo collections
    Gargi, U
    Deng, Y
    Tretter, DR
    STORAGE AND RETRIEVAL FOR MEDIA DATABASES 2003, 2003, 5021 : 13 - 21
  • [15] Asking questions on handwritten document collections
    Minesh Mathew
    Lluis Gomez
    Dimosthenis Karatzas
    C. V. Jawahar
    International Journal on Document Analysis and Recognition (IJDAR), 2021, 24 : 235 - 249
  • [16] Collections of facts. Document and discussions
    de Pury, Jean
    ARCHIVES DE PSYCHOLOGIE, 1902, 2 (05) : 58 - 60
  • [17] Document retrieval on repetitive string collections
    Gagie, Travis
    Hartikainen, Aleksi
    Karhu, Kalle
    Karkkainen, Juha
    Navarro, Gonzalo
    Puglisi, Simon J.
    Siren, Jouni
    INFORMATION RETRIEVAL JOURNAL, 2017, 20 (03): : 253 - 291
  • [18] Measurement of clustering effectiveness for document collections
    Yuan, Meng
    Zobel, Justin
    Lin, Pauline
    INFORMATION RETRIEVAL JOURNAL, 2022, 25 (03): : 239 - 268
  • [19] Pattern based browsing in document collections
    Feldman, R
    Klosgen, W
    Ben-Yehuda, Y
    Kedar, G
    Reznikov, V
    PRINCIPLES OF DATA MINING AND KNOWLEDGE DISCOVERY, 1997, 1263 : 112 - 122
  • [20] Efficient search in document image collections
    Kumar, Anand
    Jawahar, C. V.
    Manmatha, R.
    COMPUTER VISION - ACCV 2007, PT I, PROCEEDINGS, 2007, 4843 : 586 - +