Data Augmentation and Text Recognition on Khmer Historical Manuscripts

被引:6
|
作者
Valy, Dona [1 ]
Verleysen, Michel [2 ]
Chhun, Sophea [1 ]
机构
[1] Inst Technol Cambodia, Dept Informat & Commun Engn, Phnom Penh, Cambodia
[2] Catholic Univ Louvain, ICTEAM Inst, Ottignies, Belgium
关键词
historical document analysis; palm leaf manuscript; neural network; data augmentation; CHARACTER;
D O I
10.1109/ICFHR2020.2020.00024
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Analysis and recognition of historical documents faces many challenges, one of which is the scarcity of the ground truth data needed for most machine learning techniques, deep learning in particular. In this paper, we present a novel approach which significantly augments the word image samples generated from an existing dataset of Khmer ancient palm leaf manuscripts. Instead of segmenting real Khmer words, we combine the annotated glyphs into groups called sub-syllables. A new text recognition method is also proposed to take into account the spatially complex structure of Khmer writing. The proposed method is composed of two main modules: a feature generator and a decoder. The generator utilizes convolutional blocks, inception blocks, and also a bi-directional LSTM to encode information extracted from the input image so that it can be decoded by the attention-based decoder to predict the final text transcription. Experiments are conducted on a new dataset of groups of sub-syllables constructed from annotated glyphs of the SleukRith Set.
引用
收藏
页码:73 / 78
页数:6
相关论文
共 50 条
  • [31] Data Augmentation with Transformers for Text Classification
    Medardo Tapia-Tellez, Jose
    Jair Escalante, Hugo
    ADVANCES IN COMPUTATIONAL INTELLIGENCE, MICAI 2020, PT II, 2020, 12469 : 247 - 259
  • [32] Text Data Augmentation for the Korean Language
    Dang Thanh Vu
    Yu, Gwanghyun
    Lee, Chilwoo
    Kim, Jinyoung
    APPLIED SCIENCES-BASEL, 2022, 12 (07):
  • [33] A Mobile System for Historical Manuscripts Capturing, Recognition and Classification
    Al-Maadeed, Somaya
    AlKadiry, Mohammad
    Shaar, Mohammad
    Alja'am, Jihad Mohamad
    2018 INTERNATIONAL CONFERENCE ON COMPUTER AND APPLICATIONS (ICCA), 2018, : 313 - 316
  • [34] A Survey on Data Augmentation for Text Classification
    Bayer, Markus
    Kaufhold, Marc-Andre
    Reuter, Christian
    ACM COMPUTING SURVEYS, 2023, 55 (07)
  • [35] Data augmentation for face recognition
    Lv, Jiang-Jing
    Shao, Xiao-Hu
    Huang, Jia-Shui
    Zhou, Xiang-Dong
    Zhou, Xi
    NEUROCOMPUTING, 2017, 230 : 184 - 196
  • [36] High Performance Offline Handwritten Chinese Text Recognition with a New Data Preprocessing and Augmentation Pipeline
    Xie, Canyu
    Lai, Songxuan
    Liao, Qianying
    Jin, Lianwen
    DOCUMENT ANALYSIS SYSTEMS, 2020, 12116 : 45 - 59
  • [37] A Historical Handwritten French Manuscripts Text Detection Method in Full Pages
    Sang, Rui
    Zhao, Shili
    Meng, Yan
    Zhang, Mingxian
    Li, Xuefei
    Xia, Huijie
    Zhao, Ran
    INFORMATION, 2024, 15 (08)
  • [38] Seam Carving for Text Line Extraction on Color and Grayscale Historical Manuscripts
    Arvanitopoulos, Nikolaos
    Suesstrunk, Sabine
    2014 14TH INTERNATIONAL CONFERENCE ON FRONTIERS IN HANDWRITING RECOGNITION (ICFHR), 2014, : 726 - 731
  • [39] Text Segmentation of Historical Arabic Handwritten Manuscripts Using Projection Profile
    Alghamdi, Arwa
    Alluhaybi, Dareen
    Almehmadi, Doaa
    Alameer, Khadijah
    Bin Siddeq, Sundos
    Alsubait, Tahani
    2021 IEEE NATIONAL COMPUTING COLLEGES CONFERENCE (NCCC 2021), 2021, : 1012 - +
  • [40] Training transformer architectures on few annotated data: an application to historical handwritten text recognition
    Barrere, Killian
    Soullard, Yann
    Lemaitre, Aurelie
    Coueasnon, Bertrand
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2024, 27 (04) : 553 - 566