A large-scale dataset for Chinese historical document recognition and analysis

被引:0
|
作者
Shi, Yongxin [1 ]
Peng, Dezhi [1 ,2 ]
Zhang, Yuyi [1 ]
Cao, Jiahuan [1 ]
Jin, Lianwen [1 ,3 ]
机构
[1] South China Univ Technol, Sch Elect & Informat Engn, Guangzhou 510641, Peoples R China
[2] Huawei Cloud, Shenzhen 518129, Peoples R China
[3] SCUT, Zhuhai Inst Modern Ind Innovat, Zhuhai 519175, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1038/s41597-025-04495-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The development of Chinese civilization has produced a vast collection of historical documents. Recognizing and analyzing these documents hold significant value for the research of ancient culture. Recently, researchers have tried to utilize deep-learning techniques to automate recognition and analysis. However, existing Chinese historical document datasets, which are heavily relied upon by deep-learning models, suffer from limited data scale, insufficient character category, and lack of book-level annotation. To fill this gap, we introduce HisDoc1B, a large-scale dataset for Chinese historical document recognition and analysis. The HisDoc1B comprises 40,281 books, over 3 million document images, and over 1 billion characters across 30,615 character categories. To the best of our knowledge, HisDoc1B is the largest dataset in the field, surpassing existing datasets by more than 200 times in scale. Additionally, it is the only dataset with book-level annotations and punctuation annotations. Furthermore, extensive experiments demonstrate the high quality and practical utility of the proposed HisDoc1B. We believe that HisDoc1B could provide valuable resources to boost the advancement of research in this domain.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] A Large-Scale Dataset for Benchmarking Elevator Button Segmentation and Character Recognition
    Liu, Jianbang
    Fang, Yuqi
    Zhu, Delong
    Ma, Nachuan
    Pan, Jin
    Meng, Max Q-H
    2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 14018 - 14024
  • [32] DNRTI: A Large-scale Dataset for Named Entity Recognition in Threat Intelligence
    Wang, Xuren
    Liu, Xinpei
    Ao, Shengqin
    Li, Ning
    Jiang, Zhengwei
    Xu, Zongyi
    Xiong, Zihan
    Xiong, Mengbo
    Zhang, Xiaoqing
    2020 IEEE 19TH INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2020), 2020, : 1842 - 1848
  • [33] Training Convolutional Neural Network for Sketch Recognition on Large-Scale Dataset
    Zhou, Wen
    Jia, Jinyuan
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2020, 17 (01) : 82 - 89
  • [34] DuSQL: A Large-Scale and Pragmatic Chinese Text-to-SQL Dataset
    Wang, Lijie
    Zhang, Ao
    Wu, Kun
    Sun, Ke
    Li, Zhenghua
    Wu, Hua
    Zhang, Min
    Wang, Haifeng
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6923 - 6935
  • [35] Large-scale multi-unit floor plan dataset for architectural plan analysis and recognition
    Pizarro, Pablo N.
    Hitschfeld, Nancy
    Sipiran, Ivan
    AUTOMATION IN CONSTRUCTION, 2023, 156
  • [36] DuEE-Fin: A Large-Scale Dataset for Document-Level Event Extraction
    Han, Cuiyun
    Zhang, Jinchuan
    Li, Xinyu
    Xu, Guojin
    Peng, Weihua
    Zeng, Zengfeng
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT I, 2022, 13551 : 172 - 183
  • [37] DMDD: A Large-Scale Dataset for Dataset Mentions Detection
    Pan, Huitong
    Zhang, Qi
    Dragut, Eduard
    Caragea, Cornelia
    Latecki, Longin Jan
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2023, 11 : 1132 - 1146
  • [38] Fast Chinese calligraphic character recognition with large-scale data
    Gao Pengcheng
    Wu Jiangqin
    Lin Yuan
    Xia Yang
    Mao Tianjiao
    Multimedia Tools and Applications, 2015, 74 : 7221 - 7238
  • [39] Large-scale continual learning for ancient Chinese character recognition
    Xu, Yue
    Zhang, Xu-Yao
    Zhang, Zhaoxiang
    Liu, Cheng-Lin
    PATTERN RECOGNITION, 2024, 150
  • [40] Fast Chinese calligraphic character recognition with large-scale data
    Gao Pengcheng
    Wu Jiangqin
    Lin Yuan
    Xia Yang
    Mao Tianjiao
    MULTIMEDIA TOOLS AND APPLICATIONS, 2015, 74 (17) : 7221 - 7238