A large-scale dataset for Chinese historical document recognition and analysis

被引:0
|
作者
Shi, Yongxin [1 ]
Peng, Dezhi [1 ,2 ]
Zhang, Yuyi [1 ]
Cao, Jiahuan [1 ]
Jin, Lianwen [1 ,3 ]
机构
[1] South China Univ Technol, Sch Elect & Informat Engn, Guangzhou 510641, Peoples R China
[2] Huawei Cloud, Shenzhen 518129, Peoples R China
[3] SCUT, Zhuhai Inst Modern Ind Innovat, Zhuhai 519175, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1038/s41597-025-04495-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The development of Chinese civilization has produced a vast collection of historical documents. Recognizing and analyzing these documents hold significant value for the research of ancient culture. Recently, researchers have tried to utilize deep-learning techniques to automate recognition and analysis. However, existing Chinese historical document datasets, which are heavily relied upon by deep-learning models, suffer from limited data scale, insufficient character category, and lack of book-level annotation. To fill this gap, we introduce HisDoc1B, a large-scale dataset for Chinese historical document recognition and analysis. The HisDoc1B comprises 40,281 books, over 3 million document images, and over 1 billion characters across 30,615 character categories. To the best of our knowledge, HisDoc1B is the largest dataset in the field, surpassing existing datasets by more than 200 times in scale. Additionally, it is the only dataset with book-level annotations and punctuation annotations. Furthermore, extensive experiments demonstrate the high quality and practical utility of the proposed HisDoc1B. We believe that HisDoc1B could provide valuable resources to boost the advancement of research in this domain.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] THE SPEECHTRANSFORMER FOR LARGE-SCALE MANDARIN CHINESE SPEECH RECOGNITION
    Zhao, Yuanyuan
    Li, Jie
    Wang, Xiaorui
    Li, Yan
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7095 - 7099
  • [22] DOCNLI: A Large-scale Dataset for Document-level Natural Language Inference
    Yin, Wenpeng
    Radev, Dragomir
    Xiong, Caiming
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 4913 - 4922
  • [23] A robust and efficient algorithm for Chinese historical document analysis and recognition
    Liu, Chongyu
    Jian, Cheng
    Huang, Jiarong
    Yang, Wentao
    Shi, Yongxin
    Jiang, Qing
    Jin, Lianwen
    NATIONAL SCIENCE REVIEW, 2023, 10 (06)
  • [24] A robust and efficient algorithm for Chinese historical document analysis and recognition
    Chongyu Liu
    Cheng Jian
    Jiarong Huang
    Wentao Yang
    Yongxin Shi
    Qing Jiang
    Lianwen Jin
    NationalScienceReview, 2023, 10 (06) : 9 - 12
  • [25] Nostalgia on Twitter: Detection and Analysis of a Large-Scale Dataset
    Stanley Jothiraj, Fiona Victoria
    Hong, Lingzi
    Mashhadi, Afra
    Proceedings of the Association for Information Science and Technology, 2024, 61 (01) : 349 - 360
  • [26] Effective geometric restoration of distorted historical document for large-scale digitisation
    Yang, Po
    Antonacopoulos, Apostolos
    Clausner, Christian
    Pletschacher, Stefan
    Qi, Jun
    IET IMAGE PROCESSING, 2017, 11 (10) : 841 - 853
  • [27] UnityShip: A Large-Scale Synthetic Dataset for Ship Recognition in Aerial Images
    He, Boyong
    Li, Xianjiang
    Huang, Bo
    Gu, Enhui
    Guo, Weijie
    Wu, Liaoni
    REMOTE SENSING, 2021, 13 (24)
  • [28] A large-scale dataset for end-to-end table recognition in the wild
    Fan Yang
    Lei Hu
    Xinwu Liu
    Shuangping Huang
    Zhenghui Gu
    Scientific Data, 10
  • [29] Vietnam-Celeb: a large-scale dataset for Vietnamese speaker recognition
    Pham Viet Thanh
    Nguyen Xuan Thai Hoa
    Hoang Long Vu
    Nguyen Thi Thu Trang
    INTERSPEECH 2023, 2023, : 1918 - 1922
  • [30] A large-scale dataset for end-to-end table recognition in the wild
    Yang, Fan
    Hu, Lei
    Liu, Xinwu
    Huang, Shuangping
    Gu, Zhenghui
    SCIENTIFIC DATA, 2023, 10 (01)