Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations

被引:5
|
作者
Choudhury, Muntabir Hasan [1 ]
Jayanetti, Himarsha R. [1 ]
Wu, Jian [1 ]
Ingram, William A. [2 ]
Fox, Edward A. [2 ]
机构
[1] Old Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USA
[2] Virginia Polytech Inst & State Univ, Dept Comp Sci, Blacksburg, VA 24061 USA
关键词
Digital Libraries; Optical Character Recognition; Text Mining; Metadata Extraction; CRF; BiLSTM;
D O I
10.1109/JCDL52503.2021.00066
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Electronic Theses and Dissertations (ETDs) contain domain knowledge that can be used for many digital library tasks, such as analyzing citation networks and predicting research trends. Automatic metadata extraction is important to build scalable digital library search engines. Most existing methods are designed for born-digital documents such as GROBID, CERMINE, and ParsCit, so they often fail to extract metadata from scanned documents such as for ETDs. Traditional sequence tagging methods mainly rely on text-based features. In this paper, we propose a conditional random field (CRF) model that combines text-based and visual features. To verify the robustness of our model, we extended an existing corpus and created a new ground truth corpus consisting of 500 ETD cover pages with human validated metadata. Our experiments show that CRF with visual features outperformed both a heuristic baseline and a CRF model with only text-based features. The proposed model achieved 81.3%-96% F1 measure on seven metadata fields. The data and source code are publicly available on Google Drive(1) and a GitHub repository(2), respectively.
引用
收藏
页码:230 / 233
页数:4
相关论文
共 50 条
  • [1] Automatic Metadata Extraction From Iranian Theses And Dissertations
    Rahnama, Mohadese
    Hasheminejad, Seyed Mohammad Hossein
    Nasiri, Jalal A.
    2020 6TH IRANIAN CONFERENCE ON SIGNAL PROCESSING AND INTELLIGENT SYSTEMS (ICSPIS), 2020,
  • [2] ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations
    Kahu, Sampanna Yashwant
    Ingram, William A.
    Fox, Edward A.
    Wu, Jian
    2021 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL 2021), 2021, : 180 - 191
  • [3] Morphing metadata: maximizing access to electronic theses and dissertations
    McCutcheon, Sevim
    Kreyche, Michael
    Maurer, Margaret Beecher
    Nickerson, Joshua
    LIBRARY HI TECH, 2008, 26 (01) : 41 - 57
  • [4] Automatic classification of digital objects for improved metadata quality of electronic theses and dissertations in institutional repositories
    Phiri, Lighton
    International Journal of Metadata, Semantics and Ontologies, 2020, 14 (03): : 234 - 248
  • [5] MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations of University Libraries
    Choudhury, Muntabir Hasan
    Salsabil, Lamia
    Jayanetti, Himarsha R.
    Wu, Jian
    Ingram, William A.
    Fox, Edward A.
    2023 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, JCDL, 2023, : 61 - 65
  • [6] An Analysis of Evolving Metadata Influences, Standards, and Practices in Electronic Theses and Dissertations
    Potvin, Sarah
    Thompson, Santi
    LIBRARY RESOURCES & TECHNICAL SERVICES, 2016, 60 (02): : 99 - 114
  • [7] Building datasets to support information extraction and structure parsing from electronic theses and dissertations
    Ingram, William A.
    Wu, Jian
    Kahu, Sampanna Yashwant
    Manzoor, Javaid Akbar
    Banerjee, Bipasha
    Ahuja, Aman
    Choudhury, Muntabir Hasan
    Salsabil, Lamia
    Shields, Winston
    Fox, Edward A.
    INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, 2024, 25 (02) : 175 - 196
  • [8] Metadata matters: evaluating the quality of Electronic Theses and Dissertations (ETDs) descriptions in Malaysian institutional repositories
    Osman, R.
    Idaya, A. M. K. Yanti
    Abrizah, A.
    MALAYSIAN JOURNAL OF LIBRARY & INFORMATION SCIENCE, 2023, 28 (01) : 109 - 125
  • [9] Metadata versus Full-Text: Tracking Users' Electronic Theses and Dissertations (ETDs) Seeking Behavior
    Alemneh, Daniel Gelaw
    Phillips, Mark
    TRANSFORMING DIGITAL WORLDS, ICONFERENCE 2018, 2018, 10766 : 317 - 322
  • [10] Electronic theses and dissertations and academia: A preliminary study from India
    Vijayakumar, J. K.
    Murthy, T. A. V.
    Khan, M. T. M.
    JOURNAL OF ACADEMIC LIBRARIANSHIP, 2007, 33 (03): : 417 - 421