Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations

被引:5
|
作者
Choudhury, Muntabir Hasan [1 ]
Jayanetti, Himarsha R. [1 ]
Wu, Jian [1 ]
Ingram, William A. [2 ]
Fox, Edward A. [2 ]
机构
[1] Old Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USA
[2] Virginia Polytech Inst & State Univ, Dept Comp Sci, Blacksburg, VA 24061 USA
来源
2021 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL 2021) | 2021年
关键词
Digital Libraries; Optical Character Recognition; Text Mining; Metadata Extraction; CRF; BiLSTM;
D O I
10.1109/JCDL52503.2021.00066
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Electronic Theses and Dissertations (ETDs) contain domain knowledge that can be used for many digital library tasks, such as analyzing citation networks and predicting research trends. Automatic metadata extraction is important to build scalable digital library search engines. Most existing methods are designed for born-digital documents such as GROBID, CERMINE, and ParsCit, so they often fail to extract metadata from scanned documents such as for ETDs. Traditional sequence tagging methods mainly rely on text-based features. In this paper, we propose a conditional random field (CRF) model that combines text-based and visual features. To verify the robustness of our model, we extended an existing corpus and created a new ground truth corpus consisting of 500 ETD cover pages with human validated metadata. Our experiments show that CRF with visual features outperformed both a heuristic baseline and a CRF model with only text-based features. The proposed model achieved 81.3%-96% F1 measure on seven metadata fields. The data and source code are publicly available on Google Drive(1) and a GitHub repository(2), respectively.
引用
收藏
页码:230 / 233
页数:4
相关论文
共 50 条
  • [21] Automatic thesaurus development: Term extraction from title metadata
    Wang, J
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2006, 57 (07): : 907 - 920
  • [22] CERMINE - automatic extraction of metadata and references from scientific literature
    Tkaczyk, Dominika
    Szostek, Pawel
    Dendek, Piotr Jan
    Fedoryszak, Mateusz
    Bolikowski, Lukasz
    2014 11TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS 2014), 2014, : 217 - 221
  • [23] Automatic thesaurus development:Term extraction from title metadata
    Jun Wang
    数字图书馆论坛, 2006, (05) : 78 - 78
  • [24] CERMINE: automatic extraction of structured metadata from scientific literature
    Dominika Tkaczyk
    Paweł Szostek
    Mateusz Fedoryszak
    Piotr Jan Dendek
    Łukasz Bolikowski
    International Journal on Document Analysis and Recognition (IJDAR), 2015, 18 : 317 - 335
  • [25] CERMINE: automatic extraction of structured metadata from scientific literature
    Tkaczyk, Dominika
    Szostek, Pawel
    Fedoryszak, Mateusz
    Dendek, Piotr Jan
    Bolikowski, Lukasz
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2015, 18 (04) : 317 - 335
  • [26] Automatic Feature Extraction and Text Recognition From Scanned Topographic Maps
    Pezeshk, Aria
    Tutwiler, Richard L.
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2011, 49 (12): : 5047 - 5063
  • [27] Automatic extraction of coordinate grid lines from scanned mining maps
    Sheng, Yehua
    Guo, Dazhi
    Du, Peijun
    Tang, Hong
    Zhongguo Kuangye Daxue Xuebao/Journal of China University of Mining and Technology, 2000, 29 (01): : 60 - 62
  • [28] Extraction of High Level Visual Features for the Automatic Recognition of UTIs
    Andreini, Paolo
    Bonechi, Simone
    Bianchini, Monica
    Baghini, Andrea
    Bianchi, Giovanni
    Guerri, Francesco
    Galano, Angelo
    Mecocci, Alessandro
    Vaggelli, Guendalina
    FUZZY LOGIC AND SOFT COMPUTING APPLICATIONS, WILF 2016, 2017, 10147 : 249 - 259
  • [29] Automatic extraction of web search interface based on visual features
    Zhang, Yu-lian
    Qiao, Jing-yang
    ICIEA 2008: 3RD IEEE CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS, PROCEEDINGS, VOLS 1-3, 2008, : 2288 - 2291
  • [30] COMPARISONS OF VISUAL FEATURES EXTRACTION TOWARDS AUTOMATIC LIP READING
    Butt, Waqqas Ur Rehman
    Lombardi, Luca
    EDULEARN13: 5TH INTERNATIONAL CONFERENCE ON EDUCATION AND NEW LEARNING TECHNOLOGIES, 2013, : 2188 - 2196