Digitization of Text documents Using PDF/A

被引:3
|
作者
Han, Yan [1 ]
Wan, Xueheng [2 ]
机构
[1] Univ Arizona Lib, Tucson, AZ 85721 USA
[2] Univ Arizona, Dept Comp Sci, Tucson, AZ 85721 USA
关键词
D O I
10.6017/ITAL.V37I1.9878
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The purpose of this article is to demonstrate a practical use case of PDF/A for digitization of text documents following FADGI's recommendation of using PDF/A as a preferred digitization file format. The authors demonstrate how to convert and combine TIFFs with associated metadata into a single PDF/A-2b file for a document. Using real-life examples and open source software, the authors show readers how to convert TIFF images, extract associated metadata and International Color Consortium (ICC) profiles, and validate against the newly released PDF/A validator. The generated PDF/A file is a self-contained and self-described container that accommodates all the data from digitization of textual materials, including page-level metadata and ICC profiles. Providing theoretical analysis and empirical examples, the authors show that PDF/A has many advantages over the traditionally preferred file format, TIFF/JPEG2000, for digitization of text documents.
引用
收藏
页码:52 / 64
页数:13
相关论文
共 50 条
  • [21] A Wrapper Generation System for PDF Documents
    Fazzinga, Bettina
    Flesca, Sergio
    Tagarelli, Andrea
    Garruzzo, Salvatore
    Masciari, Elio
    APPLIED COMPUTING 2008, VOLS 1-3, 2008, : 442 - +
  • [22] Recognition and classification of figures in PDF documents
    Shao, Mingyan
    Futrelle, Robert P.
    GRAPHICS RECOGNITION: TEN YEARS REVIEW AND FUTURE PERSPECTIVES, 2006, 3926 : 231 - 242
  • [23] Layout and content extraction for PDF documents
    Chao, H
    Fan, J
    DOCUMENT ANALYSIS SYSTEMS VI, PROCEEDINGS, 2004, 3163 : 213 - 224
  • [24] ONLINE BINARY VISUALIZATION FOR PDF DOCUMENTS
    Mavric, Soon Heng Tan
    Yeo, Chai Kiat
    2018 INTERNATIONAL SYMPOSIUM ON CONSUMER TECHNOLOGIES (ISCT), 2018, : 18 - 21
  • [25] Malicious URI resolving in PDF documents
    Hamon, Valentin
    JOURNAL IN COMPUTER VIROLOGY AND HACKING TECHNIQUES, 2013, 9 (02): : 65 - 76
  • [26] On Lexical Resources for Digitization of Historical Documents
    Gotscharek, Annette
    Reffle, Ulrich
    Ringlstetter, Christoph
    Schulz, Klaus U.
    DOCENG'09: PROCEEDINGS OF THE 2009 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, 2009, : 193 - 200
  • [27] Towards Reverse Engineering of PDF Documents
    Baker, Josef B.
    Sexton, Alan P.
    Sorge, Volker
    DML 2011: TOWARDS A DIGITAL MATHEMATICS LIBRARY, 2011, : 65 - 75
  • [28] Recognition and classification of figures in PDF documents
    Northeastern University, Boston, MA 02115, United States
    1611, 231-242 (2006):
  • [29] Autentification of text documents using digital watermarking
    Micic, A
    Radenkovic, D
    Nikolic, S
    Telsiks 2005, Proceedings, Vols 1 and 2, 2005, : 503 - 505
  • [30] Classifying Text Documents using Unconventional Representation
    Harish, B. S.
    Kumar, S. V. Aruna
    Manjunath, S.
    2014 INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2014, : 210 - +