A large-scale dataset for Chinese historical document recognition and analysis

被引:0
|
作者
Shi, Yongxin [1 ]
Peng, Dezhi [1 ,2 ]
Zhang, Yuyi [1 ]
Cao, Jiahuan [1 ]
Jin, Lianwen [1 ,3 ]
机构
[1] South China Univ Technol, Sch Elect & Informat Engn, Guangzhou 510641, Peoples R China
[2] Huawei Cloud, Shenzhen 518129, Peoples R China
[3] SCUT, Zhuhai Inst Modern Ind Innovat, Zhuhai 519175, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1038/s41597-025-04495-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The development of Chinese civilization has produced a vast collection of historical documents. Recognizing and analyzing these documents hold significant value for the research of ancient culture. Recently, researchers have tried to utilize deep-learning techniques to automate recognition and analysis. However, existing Chinese historical document datasets, which are heavily relied upon by deep-learning models, suffer from limited data scale, insufficient character category, and lack of book-level annotation. To fill this gap, we introduce HisDoc1B, a large-scale dataset for Chinese historical document recognition and analysis. The HisDoc1B comprises 40,281 books, over 3 million document images, and over 1 billion characters across 30,615 character categories. To the best of our knowledge, HisDoc1B is the largest dataset in the field, surpassing existing datasets by more than 200 times in scale. Additionally, it is the only dataset with book-level annotations and punctuation annotations. Furthermore, extensive experiments demonstrate the high quality and practical utility of the proposed HisDoc1B. We believe that HisDoc1B could provide valuable resources to boost the advancement of research in this domain.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] SDADDS-Guelma: A large-scale, multi-purpose dataset for degraded Arabic document analysis and recognition
    Kefali, Abderrahmane
    Bouacha, Ismail
    Ferkous, Chokri
    Sari, Toufik
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 269
  • [2] Large-Scale Historical Watermark Recognition: dataset and a new consistency-based approach
    Shen, Xi
    Pastrolin, Ilaria
    Bounou, Oumayma
    Gidaris, Spyros
    Smith, Marc
    Poncet, Olivier
    Aubry, Mathieu
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 6810 - 6817
  • [3] M5HisDoc: A Large-scale Multi-style Chinese Historical Document Analysis Benchmark
    Shi, Yongxin
    Liu, Chongyu
    Peng, Dezhi
    Jian, Cheng
    Huang, Jiarong
    Jin, Lianwen
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [4] A large-scale fMRI dataset for human action recognition
    Zhou, Ming
    Gong, Zhengxin
    Dai, Yuxuan
    Wen, Yushan
    Liu, Youyi
    Zhen, Zonglei
    SCIENTIFIC DATA, 2023, 10 (01)
  • [5] A large-scale fMRI dataset for human action recognition
    Ming Zhou
    Zhengxin Gong
    Yuxuan Dai
    Yushan Wen
    Youyi Liu
    Zonglei Zhen
    Scientific Data, 10
  • [6] CStory: A Chinese Large-scale News Storyline Dataset
    Shi, Kaijie
    Wang, Xiaozhi
    Yu, Jifan
    Hou, Lei
    Li, Juanzi
    Wu, Jingtong
    Yong, Dingyu
    Xiao, Jinghui
    Liu, Qun
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 4475 - 4479
  • [7] A large-scale Chinese patent dataset for information extraction
    Zheng, Qian
    Guo, Kefu
    Xu, Lin
    SYSTEMS SCIENCE & CONTROL ENGINEERING, 2024, 12 (01)
  • [8] AgCNER, the First Large-Scale Chinese Named Entity Recognition Dataset for Agricultural Diseases and Pests
    Yao, Xiaochuang
    Hao, Xia
    Liu, Ruilin
    Li, Lin
    Guo, Xuchao
    SCIENTIFIC DATA, 2024, 11 (01)
  • [9] HISTORIAN: A LARGE-SCALE HISTORICAL FILM DATASET WITH CINEMATOGRAPHIC ANNOTATION
    Helm, Daniel
    Jogl, Fabian
    Kampel, Martin
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2087 - 2091
  • [10] DocRED: A Large-Scale Document-Level Relation Extraction Dataset
    Yao, Yuan
    Ye, Deming
    Li, Peng
    Han, Xu
    Lin, Yankai
    Liu, Zhenghao
    Liu, Zhiyuan
    Huang, Lixin
    Zhou, Jie
    Sun, Maosong
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 764 - 777