Digitization of Text documents Using PDF/A

被引：3

作者：

Han, Yan ^{[1
]}

Wan, Xueheng ^{[2
]}

机构：

[1] Univ Arizona Lib, Tucson, AZ 85721 USA

[2] Univ Arizona, Dept Comp Sci, Tucson, AZ 85721 USA

来源：

INFORMATION TECHNOLOGY AND LIBRARIES | 2018年 / 37卷 / 01期

关键词：

D O I：

10.6017/ITAL.V37I1.9878

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The purpose of this article is to demonstrate a practical use case of PDF/A for digitization of text documents following FADGI's recommendation of using PDF/A as a preferred digitization file format. The authors demonstrate how to convert and combine TIFFs with associated metadata into a single PDF/A-2b file for a document. Using real-life examples and open source software, the authors show readers how to convert TIFF images, extract associated metadata and International Color Consortium (ICC) profiles, and validate against the newly released PDF/A validator. The generated PDF/A file is a self-contained and self-described container that accommodates all the data from digitization of textual materials, including page-level metadata and ICC profiles. Providing theoretical analysis and empirical examples, the authors show that PDF/A has many advantages over the traditionally preferred file format, TIFF/JPEG2000, for digitization of text documents.

引用

页码：52 / 64

页数：13

共 50 条

[21] A Wrapper Generation System for PDF Documents
Fazzinga, Bettina
Flesca, Sergio
Tagarelli, Andrea
Garruzzo, Salvatore
Masciari, Elio
APPLIED COMPUTING 2008, VOLS 1-3, 2008, : 442 - +
[22] Recognition and classification of figures in PDF documents
Shao, Mingyan
Futrelle, Robert P.
GRAPHICS RECOGNITION: TEN YEARS REVIEW AND FUTURE PERSPECTIVES, 2006, 3926 : 231 - 242
[23] Layout and content extraction for PDF documents
Chao, H
Fan, J
DOCUMENT ANALYSIS SYSTEMS VI, PROCEEDINGS, 2004, 3163 : 213 - 224
[24] ONLINE BINARY VISUALIZATION FOR PDF DOCUMENTS
Mavric, Soon Heng Tan
Yeo, Chai Kiat
2018 INTERNATIONAL SYMPOSIUM ON CONSUMER TECHNOLOGIES (ISCT), 2018, : 18 - 21
[25] Malicious URI resolving in PDF documents
Hamon, Valentin
JOURNAL IN COMPUTER VIROLOGY AND HACKING TECHNIQUES, 2013, 9 (02): : 65 - 76
[26] On Lexical Resources for Digitization of Historical Documents
Gotscharek, Annette
Reffle, Ulrich
Ringlstetter, Christoph
Schulz, Klaus U.
DOCENG'09: PROCEEDINGS OF THE 2009 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, 2009, : 193 - 200
[27] Towards Reverse Engineering of PDF Documents
Baker, Josef B.
Sexton, Alan P.
Sorge, Volker
DML 2011: TOWARDS A DIGITAL MATHEMATICS LIBRARY, 2011, : 65 - 75
[28] Recognition and classification of figures in PDF documents
Northeastern University, Boston, MA 02115, United States
1611, 231-242 (2006):
[29] Autentification of text documents using digital watermarking
Micic, A
Radenkovic, D
Nikolic, S
Telsiks 2005, Proceedings, Vols 1 and 2, 2005, : 503 - 505
[30] Classifying Text Documents using Unconventional Representation
Harish, B. S.
Kumar, S. V. Aruna
Manjunath, S.
2014 INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2014, : 210 - +

← 1 2 3 4 5 →