Data Augmentation and Text Recognition on Khmer Historical Manuscripts

被引:6
|
作者
Valy, Dona [1 ]
Verleysen, Michel [2 ]
Chhun, Sophea [1 ]
机构
[1] Inst Technol Cambodia, Dept Informat & Commun Engn, Phnom Penh, Cambodia
[2] Catholic Univ Louvain, ICTEAM Inst, Ottignies, Belgium
关键词
historical document analysis; palm leaf manuscript; neural network; data augmentation; CHARACTER;
D O I
10.1109/ICFHR2020.2020.00024
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Analysis and recognition of historical documents faces many challenges, one of which is the scarcity of the ground truth data needed for most machine learning techniques, deep learning in particular. In this paper, we present a novel approach which significantly augments the word image samples generated from an existing dataset of Khmer ancient palm leaf manuscripts. Instead of segmenting real Khmer words, we combine the annotated glyphs into groups called sub-syllables. A new text recognition method is also proposed to take into account the spatially complex structure of Khmer writing. The proposed method is composed of two main modules: a feature generator and a decoder. The generator utilizes convolutional blocks, inception blocks, and also a bi-directional LSTM to encode information extracted from the input image so that it can be decoded by the attention-based decoder to predict the final text transcription. Experiments are conducted on a new dataset of groups of sub-syllables constructed from annotated glyphs of the SleukRith Set.
引用
收藏
页码:73 / 78
页数:6
相关论文
共 50 条
  • [41] VGTS: Visually Guided Text Spotting for novel categories in historical manuscripts
    Hu, Wenbo
    Zhan, Hongjian
    Ma, Xinchen
    Liu, Cong
    Yin, Bing
    Lu, Yue
    Suen, Ching Y.
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 261
  • [42] Khmer-Vietnamese Neural Machine Translation Improvement Using Data Augmentation Strategies
    Quoc T.N.
    Thanh H.L.
    Van H.P.
    Informatica (Slovenia), 2023, 47 (03): : 349 - 360
  • [43] Using Data Augmentation for Improving Text Summarization
    Constantin, Daniel
    Mihaescu, Marian Cristian
    Heras, Stella
    Jordan, Jaume
    Palanca, Javier
    Julian, Vicente
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2024, PT II, 2025, 15347 : 132 - 144
  • [44] Hierarchical Data Augmentation and the Application in Text Classification
    Yu, Shujuan
    Yang, Jie
    Liu, Danlei
    Li, Runqi
    Zhang, Yun
    Zhao, Shengmei
    IEEE ACCESS, 2019, 7 : 185476 - 185485
  • [45] NeighborMix data augmentation for image recognition
    Wang, Feipeng
    Ben, Kerong
    Peng, Hu
    Yang, Meini
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (09) : 26581 - 26598
  • [46] NeighborMix data augmentation for image recognition
    Feipeng Wang
    Kerong Ben
    Hu Peng
    Meini Yang
    Multimedia Tools and Applications, 2024, 83 : 26581 - 26598
  • [47] Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis
    Cong-Thanh Do
    Imai, Shuhei
    Doddipatla, Rama
    Hain, Thomas
    32ND EUROPEAN SIGNAL PROCESSING CONFERENCE, EUSIPCO 2024, 2024, : 136 - 140
  • [48] Data Augmentation for Text Generation Without Any Augmented Data
    Bi, Wei
    Li, Huayang
    Huang, Jiacheng
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2223 - 2237
  • [49] Archaeology of Northeast Thailand in relation to the pre-khmer and khmer historical records
    Welch D.J.
    International Journal of Historical Archaeology, 1998, 2 (3) : 205 - 233
  • [50] Text Augmentation for Language Models in High Error Recognition Scenario
    Benes, Karel
    Burget, Lukas
    INTERSPEECH 2021, 2021, : 1872 - 1876