Searching Corrupted Document Collections

被引:0
|
作者
Soo, Jason [1 ]
Frieder, Ophir [1 ]
机构
[1] Georgetown Univ, Informat Retrieval Lab, Washington, DC 20007 USA
关键词
OCR; known item retrieval; RETRIEVAL;
D O I
10.1109/DAS.2016.28
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Historical documents are typically digitized using optical Character Recognition. While effective, the results may not always be accurate and are highly dependent on the input. Consequently, degraded documents are often corrupted. Our focus is finding flexible, reliable methods to correct for such degradation, in the face of limited resources. We extend upon our substring and context fusion based retrieval system known as Segments, to consider metadata. By extracting topics from documents, and supplementing and weighting our lexicon with co-occurring terms found in documents with those topics, we achieve a statistically significant improvement over the state-of-the-art in all but one test configuration. Our mean reciprocal rank measured on two free, publicly available, independently judged datasets is 0.7657 and 0.5382.
引用
收藏
页码:440 / 445
页数:6
相关论文
共 50 条
  • [21] Asking questions on handwritten document collections
    Mathew, Minesh
    Gomez, Lluis
    Karatzas, Dimosthenis
    Jawahar, C., V
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2021, 24 (03) : 235 - 249
  • [22] Structured Search in Annotated Document Collections
    Gupta, Dhruv
    Berberich, Klaus
    PROCEEDINGS OF THE TWELFTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'19), 2019, : 794 - 797
  • [23] Measuring group cohesion in document collections
    Renoust, Benjamin
    Melancon, Guy
    Viaud, Marie-Luce
    2013 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 1, 2013, : 373 - 380
  • [24] Document Expansion Using External Collections
    Sherman, Garrick
    Efron, Miles
    SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 1045 - 1048
  • [25] Document retrieval on repetitive string collections
    Travis Gagie
    Aleksi Hartikainen
    Kalle Karhu
    Juha Kärkkäinen
    Gonzalo Navarro
    Simon J. Puglisi
    Jouni Sirén
    Information Retrieval Journal, 2017, 20 : 253 - 291
  • [26] User adaptive categorization of document collections
    Nürnberger, A
    ADAPTIVE MULTIMEDIA RETRIEVAL, 2004, 3094 : 87 - 98
  • [27] Measurement of clustering effectiveness for document collections
    Meng Yuan
    Justin Zobel
    Pauline Lin
    Information Retrieval Journal, 2022, 25 : 239 - 268
  • [28] A Versatile Hypergraph Model for Document Collections
    Spitz, Andreas
    Aumiller, Dennis
    Soproni, Balint
    Gertz, Michael
    PROCEEDINGS OF THE 32TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, SSDBM 2020, 2020,
  • [29] Retrieval from document image collections
    Balasubramanian, A
    Meshesha, M
    Jawahar, C
    DOCUMENT ANALYSIS SYSTEMS VII, PROCEEDINGS, 2006, 3872 : 1 - 12
  • [30] Facilitating Understanding of Large Document Collections
    Bae, Jae Hyeon
    Xu, Weijia
    Esteva, Maria
    11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, : 1334 - 1338