Table detection from plain text using machine learning and document structure

被引:0
|
作者
Li, JZ
Tang, J
Song, Q
Xu, P
机构
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Addressed in this paper is the issue of table extraction from plain text. Table is one of the commonest modes for presenting information. Table extraction has applications in information retrieval, knowledge acquisition, and text mining. Automatic information extraction from table is a challenge. Existing methods was mainly focusing on table extraction from web pages (formatted table extraction). So far the problem of table extraction on plain text, to the best of our knowledge, has not received sufficient attention. In this paper, unformatted table extraction is formalized as unformatted table block detection and unformatted table row identification. We concentrate particularly on the table extraction from Chinese documents. We propose to conduct the task of table extraction by combining machine learning methods and document structure. We first view the task as classification and propose a statistical approach to deal with it based on Naive Bayes. We define features in the classification model. Next, we use document structure to improve the detection performance. Experimental results indicate that the proposed methods can significantly outperform the baseline methods for unformatted table extraction.
引用
收藏
页码:818 / 823
页数:6
相关论文
共 50 条
  • [21] Computer-Generated Text Detection Using Machine Learning: A Systematic Review
    Beresneva, Daria
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, NLDB 2016, 2016, 9612 : 421 - 426
  • [22] Table Detection using Deep Learning
    Gilani, Azka
    Qasim, Shah Rukh
    Malik, Imran
    Shafait, Faisal
    2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1, 2017, : 771 - 776
  • [23] Machine Learning Algorithms for Document Clustering and Fraud Detection
    Yaram, Suresh
    PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON DATA SCIENCE & ENGINEERING (ICDSE), 2016, : 103 - 108
  • [24] Table Detection in Document Images using Foreground and Background Features
    Arif, Saman
    Shafait, Faisal
    2018 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA), 2018, : 245 - 252
  • [25] Document Table Detection and Analysis Using Projection Scale Space
    Kalyon, L. Ilham
    Akgul, Yusuf Sinan
    2014 22ND SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2014, : 1319 - 1322
  • [26] Personality Classification from Online Text using Machine Learning Approach
    Khan, Alam Sher
    Ahmad, Hussain
    Asghar, Muhammad Zubair
    Saddozai, Furcian Khan
    Arir, Areeba
    Khalid, Hassan Ali
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (03) : 460 - 476
  • [27] Machine learning-based guilt detection in text
    Abdul Gafar Manuel Meque
    Nisar Hussain
    Grigori Sidorov
    Alexander Gelbukh
    Scientific Reports, 13
  • [28] Machine Learning for Text Anomaly Detection: A Systematic Review
    Boutalbi, Karima
    Loukil, Faiza
    Verjus, Herve
    Telisson, David
    Salamatian, Kave
    2023 IEEE 47TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE, COMPSAC, 2023, : 1319 - 1324
  • [29] Machine learning-based guilt detection in text
    Meque, Abdul Gafar Manuel
    Hussain, Nisar
    Sidorov, Grigori
    Gelbukh, Alexander
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [30] Robust text detection from binarized document images
    Okun, O
    Yan, Y
    Pietikäinen, M
    16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL III, PROCEEDINGS, 2002, : 61 - 64