Data-Driven Recognition and Extraction of PDF Document Elements

被引:4
|
作者
Hansen, Matthias [1 ]
Pomp, Andre [2 ]
Erki, Kemal [1 ]
Meisen, Tobias [2 ]
机构
[1] Rhein Westfal TH Aachen, Inst Informat Management Mech Engn, D-52068 Aachen, Germany
[2] Univ Wuppertal, Chair Technol & Management Digital Transformat, D-42119 Wuppertal, Germany
关键词
PDF extraction; machine learning; data corpus; data processing; unstructured data; FIGURES;
D O I
10.3390/technologies7030065
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
In the age of digitalization, the collection and analysis of large amounts of data is becoming increasingly important for enterprises to improve their businesses and processes, such as the introduction of new services or the realization of resource-efficient production. Enterprises concentrate strongly on the integration, analysis and processing of their data. Unfortunately, the majority of data analysis focuses on structured and semi-structured data, although unstructured data such as text documents or images account for the largest share of all available enterprise data. One reason for this is that most of this data is not machine-readable and requires dedicated analysis methods, such as natural language processing for analyzing textual documents or object recognition for recognizing objects in images. Especially in the latter case, the analysis methods depend strongly on the application. However, there are also data formats, such as PDF documents, which are not machine-readable and consist of many different document elements such as tables, figures or text sections. Although the analysis of PDF documents is a major challenge, they are used in all enterprises and contain various information that may contribute to analysis use cases. In order to enable their efficient retrievability and analysis, it is necessary to identify the different types of document elements so that we are able to process them with tailor-made approaches. In this paper, we propose a system that forms the basis for structuring unstructured PDF documents, so that the identified document elements can subsequently be retrieved and analyzed with tailor-made approaches. Due to the high diversity of possible document elements and analysis methods, this paper focuses on the automatic identification and extraction of data visualizations, algorithms, other diagram-like objects and tables from a mixed document body. For that, we present two different approaches. The first approach uses methods from the area of deep learning and rule-based image processing whereas the second approach is purely based on deep learning. To train our neural networks, we manually annotated a large corpus of PDF documents with our own annotation tool, of which both are being published together with this paper. The results of our extraction pipeline show that we are able to automatically extract graphical items with a precision of 0.73 and a recall of 0.8. For tables, we reach a precision of 0.78 and a recall of 0.94.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] EXPLAINING DATA-DRIVEN DOCUMENT CLASSIFICATIONS
    Martens, David
    Provost, Foster
    MIS QUARTERLY, 2014, 38 (01) : 73 - +
  • [2] Graphics extraction in PDF document
    Chao, H
    DOCUMENT RECOGNITION AND RETRIEVAL X, 2003, 5010 : 317 - 325
  • [3] A Data-Driven Feature Extraction Method Based on Data Supplement for Human Activity Recognition
    Yi, Myung-Kyu
    Hwang, Seong Oun
    IEEE SENSORS JOURNAL, 2024, 24 (14) : 23311 - 23323
  • [4] Data-Driven Extraction of Hadron Radii
    Daniele Binosi
    Few-Body Systems, 64
  • [5] Data-Driven Extraction of Hadron Radii
    Binosi, Daniele
    FEW-BODY SYSTEMS, 2023, 64 (04)
  • [6] Improving malicious PDF classifier with feature engineering: A data-driven approach
    Falah, Ahmed
    Pan, Lei
    Huda, Shamsul
    Pokhrel, Shiva Raj
    Anwar, Adnan
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2021, 115 : 314 - 326
  • [7] PERFORMANCE TRADEOFFS IN RINGS OF DATA-DRIVEN ELEMENTS
    SPRAY, A
    JONES, S
    IEEE TRANSACTIONS ON COMPUTERS, 1993, 42 (01) : 113 - 118
  • [8] Activity Recognition in WSN: A Data-driven Approach
    Awan, Muhammad Arshad
    Zheng Guangbin
    Kim, Shin-Dug
    2012 7TH INTERNATIONAL CONFERENCE ON COMPUTING AND CONVERGENCE TECHNOLOGY (ICCCT2012), 2012, : 15 - 20
  • [9] Data-Driven Modeling for PDF Shaping of Fiber Length Distribution in Refining Process
    Li, Mingjie
    Zhou, Ping
    2018 IEEE 8TH ANNUAL INTERNATIONAL CONFERENCE ON CYBER TECHNOLOGY IN AUTOMATION, CONTROL, AND INTELLIGENT SYSTEMS (IEEE-CYBER), 2018, : 1467 - 1471
  • [10] Data-Driven Finite Elements for Geometry and Material Design
    Chen, Desai
    Levin, David I. W.
    Sueda, Shinjiro
    Matusik, Wojciech
    ACM TRANSACTIONS ON GRAPHICS, 2015, 34 (04):