Table detection from plain text using machine learning and document structure

被引:0
|
作者
Li, JZ
Tang, J
Song, Q
Xu, P
机构
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Addressed in this paper is the issue of table extraction from plain text. Table is one of the commonest modes for presenting information. Table extraction has applications in information retrieval, knowledge acquisition, and text mining. Automatic information extraction from table is a challenge. Existing methods was mainly focusing on table extraction from web pages (formatted table extraction). So far the problem of table extraction on plain text, to the best of our knowledge, has not received sufficient attention. In this paper, unformatted table extraction is formalized as unformatted table block detection and unformatted table row identification. We concentrate particularly on the table extraction from Chinese documents. We propose to conduct the task of table extraction by combining machine learning methods and document structure. We first view the task as classification and propose a statistical approach to deal with it based on Naive Bayes. We define features in the classification model. Next, we use document structure to improve the detection performance. Experimental results indicate that the proposed methods can significantly outperform the baseline methods for unformatted table extraction.
引用
收藏
页码:818 / 823
页数:6
相关论文
共 50 条
  • [1] Text Detection in Document Images by Machine Learning Algorithms
    Zelenika, Darko
    Povh, Janez
    Zenko, Bernard
    PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON COMPUTER RECOGNITION SYSTEMS, CORES 2015, 2016, 403 : 169 - 179
  • [2] Detection of emotion by text analysis using machine learning
    Machova, Kristina
    Szaboova, Martina
    Paralic, Jan
    Micko, Jan
    FRONTIERS IN PSYCHOLOGY, 2023, 14
  • [3] Continual Learning for Table Detection in Document Images
    Minouei, Mohammad
    Hashmi, Khurram Azeem
    Soheili, Mohammad Reza
    Afzal, Muhammad Zeshan
    Stricker, Didier
    APPLIED SCIENCES-BASEL, 2022, 12 (18):
  • [4] Altered Handwritten Text Detection in Document Images Using Deep Learning
    Patil, Gayatri
    Palaiahnakote, Shivakumara
    Gornale, Shivanand S.
    Lopresti, Daniel P.
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2024, 38 (03)
  • [5] Robust Table Detection and Structure Recognition from Heterogeneous Document Images
    Ma, Chixiang
    Lin, Weihong
    Sun, Lei
    Huo, Qiang
    PATTERN RECOGNITION, 2023, 133
  • [6] Hate Speech Detection Using Text Mining and Machine Learning
    Alaoui, Safae Sossi
    Farhaoui, Yousef
    Aksasse, Brahim
    INTERNATIONAL JOURNAL OF DECISION SUPPORT SYSTEM TECHNOLOGY, 2022, 14 (01)
  • [7] Emotion Detection in Roman Urdu Text using Machine Learning
    Majeed, Adil
    Mujtaba, Hasan
    Beg, Mirza Omer
    2020 35TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING WORKSHOPS (ASEW 2020), 2020, : 125 - 130
  • [8] "What is relevant in a text document?": An interpretable machine learning approach
    Arras, Leila
    Horn, Franziska
    Montavon, Gregoire
    Mueller, Klaus-Robert
    Samek, Wojciech
    PLOS ONE, 2017, 12 (08):
  • [9] KNN based machine learning approach for text and document mining
    Institute of technology Gopeshwar, Chamoli, Uttarakhand, India
    不详
    不详
    不详
    1600, Science and Engineering Research Support Society (07):
  • [10] Deep Learning-Based Document Modeling for Personality Detection from Text
    Majumder, Navonil
    Poria, Soujanya
    Gelbukh, Alexander
    Cambria, Erik
    IEEE INTELLIGENT SYSTEMS, 2017, 32 (02) : 74 - 79