Digitization of Text documents Using PDF/A

被引:3
|
作者
Han, Yan [1 ]
Wan, Xueheng [2 ]
机构
[1] Univ Arizona Lib, Tucson, AZ 85721 USA
[2] Univ Arizona, Dept Comp Sci, Tucson, AZ 85721 USA
关键词
D O I
10.6017/ITAL.V37I1.9878
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The purpose of this article is to demonstrate a practical use case of PDF/A for digitization of text documents following FADGI's recommendation of using PDF/A as a preferred digitization file format. The authors demonstrate how to convert and combine TIFFs with associated metadata into a single PDF/A-2b file for a document. Using real-life examples and open source software, the authors show readers how to convert TIFF images, extract associated metadata and International Color Consortium (ICC) profiles, and validate against the newly released PDF/A validator. The generated PDF/A file is a self-contained and self-described container that accommodates all the data from digitization of textual materials, including page-level metadata and ICC profiles. Providing theoretical analysis and empirical examples, the authors show that PDF/A has many advantages over the traditionally preferred file format, TIFF/JPEG2000, for digitization of text documents.
引用
收藏
页码:52 / 64
页数:13
相关论文
共 50 条
  • [41] Homeomorphic digitization, correction, and compression of digital documents
    Gross, A
    Latecki, L
    WORKSHOP ON DOCUMENT IMAGE ANALYSIS (DIA'97), PROCEEDINGS: IN COOPERATION WITH CVPR '97, 1997, : 89 - 96
  • [42] Malware Detection in PDF and Office Documents: A survey
    Singh, Priyansh
    Tapaswi, Shashikala
    Gupta, Sanchit
    INFORMATION SECURITY JOURNAL, 2020, 29 (03): : 134 - 153
  • [43] Providing access to historical documents through digitization
    Chmielewska, Barbara
    Wrobel, Agnieszka
    LIBRARY MANAGEMENT, 2013, 34 (4-5) : 324 - 334
  • [44] Shape from shading for the digitization of curved documents
    Courteille, Frederic
    Crouzil, Alain
    Durou, Jean-Denis
    Gurdjos, Pierre
    MACHINE VISION AND APPLICATIONS, 2007, 18 (05) : 301 - 316
  • [45] Serving olympic documents VLA PDF files
    Wilson, W
    IOLS'98: INTEGRATED ONLINE LIBRARY SYSTEMS, PROCEEDINGS-1998: EMBRACE AND EXTEND, 1998, : 145 - 150
  • [46] A practical approach on clustering malicious PDF documents
    Vatamanu, Cristina
    Gavrilut, Dragos
    Benchea, Razvan
    JOURNAL IN COMPUTER VIROLOGY AND HACKING TECHNIQUES, 2012, 8 (04): : 151 - 163
  • [47] TEXUS: Table Extraction System for PDF Documents
    Rastan, Roya
    Paik, Hye-Young
    Shepherd, John
    Ryu, Seung Hwan
    Beheshti, Amin
    DATABASES THEORY AND APPLICATIONS, ADC 2018, 2018, 10837 : 345 - 349
  • [48] Shape from shading for the digitization of curved documents
    Frédéric Courteille
    Alain Crouzil
    Jean-Denis Durou
    Pierre Gurdjos
    Machine Vision and Applications, 2007, 18 : 301 - 316
  • [49] A Method of Mathematical Formula Detection in PDF Documents
    Jiao, Na
    Tian, Xuedong
    Jia, Xuesha
    Xue, Bei
    2012 INTERNATIONAL CONFERENCE ON APPLIED INFORMATICS AND COMMUNICATION (ICAIC 2012), 2013, : 131 - 137
  • [50] A Fuzzy Logic Approach to Wrapping PDF Documents
    Flesca, Sergio
    Masciari, Elio
    Tagarelli, Andrea
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2011, 23 (12) : 1826 - 1841