Searching Corrupted Document Collections

被引：0

作者：

Soo, Jason ^{[1
]}

Frieder, Ophir ^{[1
]}

机构：

[1] Georgetown Univ, Informat Retrieval Lab, Washington, DC 20007 USA

来源：

PROCEEDINGS OF 12TH IAPR WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS, (DAS 2016) | 2016年

关键词：

OCR; known item retrieval; RETRIEVAL;

D O I：

10.1109/DAS.2016.28

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Historical documents are typically digitized using optical Character Recognition. While effective, the results may not always be accurate and are highly dependent on the input. Consequently, degraded documents are often corrupted. Our focus is finding flexible, reliable methods to correct for such degradation, in the face of limited resources. We extend upon our substring and context fusion based retrieval system known as Segments, to consider metadata. By extracting topics from documents, and supplementing and weighting our lexicon with co-occurring terms found in documents with those topics, we achieve a statistically significant improvement over the state-of-the-art in all but one test configuration. Our mean reciprocal rank measured on two free, publicly available, independently judged datasets is 0.7657 and 0.5382.

引用

页码：440 / 445

页数：6

共 50 条

[41] Content-based document image retrieval in complex document collections
Agam, G.
Argamon, S.
Friedera, O.
Grossman, D.
Lewis, D.
DOCUMENT RECOGNITION AND RETRIEVAL XIV, 2007, 6500
[42] On the issue of operative experiment as the main measure to document the crime of corrupted nature
Lazarev, A., V
LEGAL SCIENCE AND PRACTICE-BULLETIN OF NIZHNIY NOVGOROD ACADEMY OF THE MINISTRY IF THE INTERIOR OF RUSSIA, 2010, 12 (01): : 145 - 148
[43] Feature selection for the classification of large document collections
Brank, Janez
Mladenic, Dunja
Grobelnik, Marko
Milic-Frayling, Natasa
JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2008, 14 (10) : 1562 - 1596
[44] Universal indexes for highly repetitive document collections
Claude, Francisco
Farina, Antonio
Martinez-Prieto, Miguel A.
Navarro, Gonzalo
INFORMATION SYSTEMS, 2016, 61 : 1 - 23
[45] Document listing on repetitive collections with guaranteed performance
Navarro, Gonzalo
THEORETICAL COMPUTER SCIENCE, 2019, 772 : 58 - 72
[46] Access Control Framework for XML Document Collections
Sladic, Goran
Milosavljevic, Branko
Konjovic, Zora
Vidakovic, Milan
COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2011, 8 (03) : 591 - 609
[47] Automatic hypertext conversion of paper document collections
Myka, A
Guntzer, U
DIGITAL LIBRARIES: CURRENT ISSUES, 1995, 916 : 65 - 90
[48] RepoZip: A Technique for Lossless Compression of Document Collections
Sumanaweera, D. N.
Doole, F. Fahima
Pathiraja, D. P.
Deshapriya, G. G. K.
Dias, Gihan
2015 MORATUWA ENGINEERING RESEARCH CONFERENCE (MERCON), 2015, : 330 - 335
[49] Contextual maps for browsing huge document collections
Ciesielski, Krzysztof
Klopotek, Mieczyslaw A.
FOUNDATIONS OF INTELLIGENT SYSTEMS, PROCEEDINGS, 2006, 4203 : 713 - 722
[50] A methodology for hiding knowledge in XML document collections
Johnsten, T
Sweeney, RB
Raghavan, VV
27TH ANNUAL INTERNATIONAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE, PROCEEDINGS, 2003, : 628 - 632

← 1 2 3 4 5 →