A Highly Accurate PDF-To-Text Conversion System for Academic Papers Using Natural Language Processing Approach

被引:0
|
作者
Yong, Tien Fui [1 ]
Azad, Saiful [2 ,3 ]
Rahman, Mohammed Mostafizur [4 ]
Zamli, Kamal Z. [2 ,3 ]
Rabby, Gollam [2 ]
机构
[1] Univ Tunku Abdul Rahman, Fac Informat & Commun Technol, Kampar 31900, Perak, Malaysia
[2] Univ Malaysia Pahang, Fac Comp Syst & Software Engn, Gambang 26300, Pahang, Malaysia
[3] UMP, IBM Ctr Excellence, Gambang, Malaysia
[4] Amer Int Univ Bangladesh, Dhaka, Bangladesh
关键词
PDF-To-Text Conversion; Natural Language Processing; Edit Distance;
D O I
10.1166/asl.2018.13029
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Extracting text out of PDF documents is never an easy task when a higher degree of accuracy and consistency are the two main criteria to be attained. Although, there exist a considerable number of such systems; however, most of them are falling short of offering desirable performance especially when academic literature is the concern. Researches, those involved heavily in text mining and project analyzing, need an accurate and consistent supporting tool for PDF-To-Text (PTT) conversion. Therefore, in this paper, we propose a Natural Language Processing based PDF-to-text (NLPDF) conversion system, which comprises of two major steps, namely (i) reads contents from the PDF and (ii) reconstruct the text. The performance of the proposed system is evaluated via four metrics, namely Precision, Recall, F-Measure (AF), and standard deviation, and compared with eight other similar benchmarked systems available in the market. The experimental results evidently demonstrate the effectiveness of the proposed system.
引用
收藏
页码:7844 / 7849
页数:6
相关论文
共 50 条
  • [21] Analysis of Stock Market using Text Mining and Natural Language Processing
    Abdullah, Sheikh Shaugat
    Rahaman, Mohammad Saiedur
    Rahman, Mohammad Saidur
    2013 INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS & VISION (ICIEV), 2013,
  • [22] Using natural language processing to analyse text data in behavioural science
    Feuerriegel, Stefan
    Maarouf, Abdurahman
    Baer, Dominik
    Geissler, Dominique
    Schweisthal, Jonas
    Proellochs, Nicolas
    Robertson, Claire E.
    Rathje, Steve
    Hartmann, Jochen
    Mohammad, Saif M.
    Netzer, Oded
    Siegel, Alexandra A.
    Plank, Barbara
    Van Bavel, Jay J.
    NATURE REVIEWS PSYCHOLOGY, 2025, 4 (02): : 96 - 111
  • [23] Using Natural Language Processing for Aftermarket Text to Increase Accuracy and Efficiency
    Hollingshead, Derek
    Parendo, Carol
    Peter, Priya
    2022 68TH ANNUAL RELIABILITY AND MAINTAINABILITY SYMPOSIUM (RAMS 2022), 2022,
  • [24] A natural language processing approach to acquire accurate health provider directory information
    Cook, Matthew
    Yao, Lixia
    Wang, Xiaoyan
    2018 IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS WORKSHOPS (ICHI-W), 2018, : 76 - 77
  • [25] Pre-processing Online Financial Text for Sentiment Classification: A Natural Language Processing Approach
    Sun, Fan
    Belatreche, Ammar
    Coleman, Sonya
    McGinnity, T. M.
    Li, Yuhua
    2014 IEEE CONFERENCE ON COMPUTATIONAL INTELLIGENCE FOR FINANCIAL ENGINEERING & ECONOMICS (CIFER), 2014, : 122 - 129
  • [26] Natural Language Processing in Mixed-methods Text Analysis: A Workflow Approach
    Parks, Louisa
    Peters, Wim
    INTERNATIONAL JOURNAL OF SOCIAL RESEARCH METHODOLOGY, 2023, 26 (04) : 377 - 389
  • [27] Text-to-Speech Conversion Using Concatenative Approach for Gujarati Language
    Narvani, Vishal
    Arolkar, Harshal
    SMART TRENDS IN COMPUTING AND COMMUNICATIONS, VOL 5, SMARTCOM 2024, 2024, 949 : 183 - 193
  • [28] Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers
    Bayer, Markus
    Kaufhold, Marc-Andre
    Buchhold, Bjorn
    Keller, Marcel
    Dallmeyer, Joerg
    Reuter, Christian
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2023, 14 (01) : 135 - 150
  • [29] Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers
    Markus Bayer
    Marc-André Kaufhold
    Björn Buchhold
    Marcel Keller
    Jörg Dallmeyer
    Christian Reuter
    International Journal of Machine Learning and Cybernetics, 2023, 14 : 135 - 150
  • [30] Natural Language Processing System for Text Classification Corpus Based on Machine Learning
    Su, Yawen
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (08)