A Highly Accurate PDF-To-Text Conversion System for Academic Papers Using Natural Language Processing Approach

被引:0
|
作者
Yong, Tien Fui [1 ]
Azad, Saiful [2 ,3 ]
Rahman, Mohammed Mostafizur [4 ]
Zamli, Kamal Z. [2 ,3 ]
Rabby, Gollam [2 ]
机构
[1] Univ Tunku Abdul Rahman, Fac Informat & Commun Technol, Kampar 31900, Perak, Malaysia
[2] Univ Malaysia Pahang, Fac Comp Syst & Software Engn, Gambang 26300, Pahang, Malaysia
[3] UMP, IBM Ctr Excellence, Gambang, Malaysia
[4] Amer Int Univ Bangladesh, Dhaka, Bangladesh
关键词
PDF-To-Text Conversion; Natural Language Processing; Edit Distance;
D O I
10.1166/asl.2018.13029
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Extracting text out of PDF documents is never an easy task when a higher degree of accuracy and consistency are the two main criteria to be attained. Although, there exist a considerable number of such systems; however, most of them are falling short of offering desirable performance especially when academic literature is the concern. Researches, those involved heavily in text mining and project analyzing, need an accurate and consistent supporting tool for PDF-To-Text (PTT) conversion. Therefore, in this paper, we propose a Natural Language Processing based PDF-to-text (NLPDF) conversion system, which comprises of two major steps, namely (i) reads contents from the PDF and (ii) reconstruct the text. The performance of the proposed system is evaluated via four metrics, namely Precision, Recall, F-Measure (AF), and standard deviation, and compared with eight other similar benchmarked systems available in the market. The experimental results evidently demonstrate the effectiveness of the proposed system.
引用
收藏
页码:7844 / 7849
页数:6
相关论文
共 50 条
  • [41] An Approach for Generating SQL Query Using Natural Language Processing
    More, Priyanka
    Kudale, Bharti
    Deshmukh, Pranali
    Biswas, Indira N.
    More, Neha J.
    Gomes, Francisco S.
    INTELLIGENT COMMUNICATION TECHNOLOGIES AND VIRTUAL MOBILE NETWORKS, ICICV 2019, 2020, 33 : 226 - 230
  • [42] Natural Language Interface to Database Using the DialogFlow Voice Recognition and Text Conversion API
    Villeda Maldonado, Julio Alejandro
    Gaona Cuadra, Jose Arturo
    2019 8TH INTERNATIONAL CONFERENCE ON SOFTWARE PROCESS IMPROVEMENT (CIMPS), 2019,
  • [43] From text to model: Leveraging natural language processing for system dynamics model development
    Veldhuis, Guido A.
    Blok, Dominique
    de Boer, Maaike H. T.
    Kalkman, Gino J.
    Bakker, Roos M.
    van Waas, Rob P. M.
    SYSTEM DYNAMICS REVIEW, 2024, 40 (03)
  • [44] Natural Language Query to SQL conversion using Machine Learning Approach
    Arefin, Minhazul
    Hossen, Kazi Mojammel
    Uddin, Mohammed Nasir
    2021 3RD INTERNATIONAL CONFERENCE ON SUSTAINABLE TECHNOLOGIES FOR INDUSTRY 4.0 (STI), 2021,
  • [45] Unveiling AI-Generated Financial Text: A Computational Approach Using Natural Language Processing and Generative Artificial Intelligence
    Arshed, Muhammad Asad
    Gherghina, Stefan Cristian
    Dewi, Christine
    Iqbal, Asma
    Mumtaz, Shahzad
    COMPUTATION, 2024, 12 (05)
  • [46] Indonesian Text Translator into Database Structured Query Language with Multi Parameters using Natural Language Processing
    Hermawan, G.
    Faturohman, I
    Isharmawan, N.
    2ND INTERNATIONAL CONFERENCE ON INFORMATICS, ENGINEERING, SCIENCE, AND TECHNOLOGY (INCITEST 2019), 2019, 662
  • [47] A SUBLANGUAGE APPROACH TO NATURAL-LANGUAGE PROCESSING FOR AN EXPERT-SYSTEM
    LIDDY, ED
    JORGENSEN, CL
    SIBERT, EE
    YU, ES
    INFORMATION PROCESSING & MANAGEMENT, 1993, 29 (05) : 633 - 645
  • [48] Generating Mind Map from Indonesian Text using Natural Language Processing Tools
    Saelan, Athia
    Purwarianti, Ayu
    4TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATICS (ICEEI 2013), 2013, 11 : 1163 - 1169
  • [49] Novel Text Steganography Using Natural Language Processing and Part-of-Speech Tagging
    Banik, Barnali Gupta
    Bandyopadhyay, Samir Kumar
    IETE JOURNAL OF RESEARCH, 2020, 66 (03) : 384 - 395
  • [50] Web Document Text and Images Extraction using DOM Analysis and Natural Language Processing
    Joshi, Parag Mulendra
    Liu, Sam
    DOCENG'09: PROCEEDINGS OF THE 2009 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, 2009, : 218 - 221