Searching Corrupted Document Collections

被引:0
|
作者
Soo, Jason [1 ]
Frieder, Ophir [1 ]
机构
[1] Georgetown Univ, Informat Retrieval Lab, Washington, DC 20007 USA
关键词
OCR; known item retrieval; RETRIEVAL;
D O I
10.1109/DAS.2016.28
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Historical documents are typically digitized using optical Character Recognition. While effective, the results may not always be accurate and are highly dependent on the input. Consequently, degraded documents are often corrupted. Our focus is finding flexible, reliable methods to correct for such degradation, in the face of limited resources. We extend upon our substring and context fusion based retrieval system known as Segments, to consider metadata. By extracting topics from documents, and supplementing and weighting our lexicon with co-occurring terms found in documents with those topics, we achieve a statistically significant improvement over the state-of-the-art in all but one test configuration. Our mean reciprocal rank measured on two free, publicly available, independently judged datasets is 0.7657 and 0.5382.
引用
收藏
页码:440 / 445
页数:6
相关论文
共 50 条
  • [41] Content-based document image retrieval in complex document collections
    Agam, G.
    Argamon, S.
    Friedera, O.
    Grossman, D.
    Lewis, D.
    DOCUMENT RECOGNITION AND RETRIEVAL XIV, 2007, 6500
  • [42] On the issue of operative experiment as the main measure to document the crime of corrupted nature
    Lazarev, A., V
    LEGAL SCIENCE AND PRACTICE-BULLETIN OF NIZHNIY NOVGOROD ACADEMY OF THE MINISTRY IF THE INTERIOR OF RUSSIA, 2010, 12 (01): : 145 - 148
  • [43] Feature selection for the classification of large document collections
    Brank, Janez
    Mladenic, Dunja
    Grobelnik, Marko
    Milic-Frayling, Natasa
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2008, 14 (10) : 1562 - 1596
  • [44] Universal indexes for highly repetitive document collections
    Claude, Francisco
    Farina, Antonio
    Martinez-Prieto, Miguel A.
    Navarro, Gonzalo
    INFORMATION SYSTEMS, 2016, 61 : 1 - 23
  • [45] Document listing on repetitive collections with guaranteed performance
    Navarro, Gonzalo
    THEORETICAL COMPUTER SCIENCE, 2019, 772 : 58 - 72
  • [46] Access Control Framework for XML Document Collections
    Sladic, Goran
    Milosavljevic, Branko
    Konjovic, Zora
    Vidakovic, Milan
    COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2011, 8 (03) : 591 - 609
  • [47] Automatic hypertext conversion of paper document collections
    Myka, A
    Guntzer, U
    DIGITAL LIBRARIES: CURRENT ISSUES, 1995, 916 : 65 - 90
  • [48] RepoZip: A Technique for Lossless Compression of Document Collections
    Sumanaweera, D. N.
    Doole, F. Fahima
    Pathiraja, D. P.
    Deshapriya, G. G. K.
    Dias, Gihan
    2015 MORATUWA ENGINEERING RESEARCH CONFERENCE (MERCON), 2015, : 330 - 335
  • [49] Contextual maps for browsing huge document collections
    Ciesielski, Krzysztof
    Klopotek, Mieczyslaw A.
    FOUNDATIONS OF INTELLIGENT SYSTEMS, PROCEEDINGS, 2006, 4203 : 713 - 722
  • [50] A methodology for hiding knowledge in XML document collections
    Johnsten, T
    Sweeney, RB
    Raghavan, VV
    27TH ANNUAL INTERNATIONAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE, PROCEEDINGS, 2003, : 628 - 632