A large-scale dataset for Chinese historical document recognition and analysis

被引:0
|
作者
Shi, Yongxin [1 ]
Peng, Dezhi [1 ,2 ]
Zhang, Yuyi [1 ]
Cao, Jiahuan [1 ]
Jin, Lianwen [1 ,3 ]
机构
[1] South China Univ Technol, Sch Elect & Informat Engn, Guangzhou 510641, Peoples R China
[2] Huawei Cloud, Shenzhen 518129, Peoples R China
[3] SCUT, Zhuhai Inst Modern Ind Innovat, Zhuhai 519175, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1038/s41597-025-04495-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The development of Chinese civilization has produced a vast collection of historical documents. Recognizing and analyzing these documents hold significant value for the research of ancient culture. Recently, researchers have tried to utilize deep-learning techniques to automate recognition and analysis. However, existing Chinese historical document datasets, which are heavily relied upon by deep-learning models, suffer from limited data scale, insufficient character category, and lack of book-level annotation. To fill this gap, we introduce HisDoc1B, a large-scale dataset for Chinese historical document recognition and analysis. The HisDoc1B comprises 40,281 books, over 3 million document images, and over 1 billion characters across 30,615 character categories. To the best of our knowledge, HisDoc1B is the largest dataset in the field, surpassing existing datasets by more than 200 times in scale. Additionally, it is the only dataset with book-level annotations and punctuation annotations. Furthermore, extensive experiments demonstrate the high quality and practical utility of the proposed HisDoc1B. We believe that HisDoc1B could provide valuable resources to boost the advancement of research in this domain.
引用
收藏
页数:10
相关论文
共 50 条
  • [11] Large-Scale Analysis of the Docker Hub Dataset
    Zhao, Nannan
    Tarasov, Vasily
    Albahar, Hadeel
    Anwar, Ali
    Rupprecht, Lukas
    Skourtis, Dimitrios
    Warke, Amit S.
    Mohamed, Mohamed
    Butt, Ali R.
    2019 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2019, : 215 - 224
  • [12] Oceanship: A Large-Scale Dataset for Underwater Audio Target Recognition
    Li, Zeyu
    Xiang, Suncheng
    Yu, Tong
    Gao, Jingsheng
    Ruan, Jiacheng
    Hu, Yanping
    Liu, Ting
    Fu, Yuzhuo
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT IV, ICIC 2024, 2024, 14865 : 475 - 486
  • [13] A Large-Scale 3D Object Recognition dataset
    Solund, Thomas
    Buch, Anders Glent
    Kruger, Norbert
    Aanaes, Henrik
    PROCEEDINGS OF 2016 FOURTH INTERNATIONAL CONFERENCE ON 3D VISION (3DV), 2016, : 73 - 82
  • [14] A Large-scale Benchmark Dataset for Event Recognition in Surveillance Video
    Oh, Sangmin
    Hoogs, Anthony
    Perera, Amitha
    Cuntoor, Naresh
    Chen, Chia-Chih
    Lee, Jong Taek
    Mukherjee, Saurajit
    Aggarwal, J. K.
    Lee, Hyungtae
    Davis, Larry
    Swears, Eran
    Wang, Xioyang
    Ji, Qiang
    Reddy, Kishore
    Shah, Mubarak
    Vondrick, Carl
    Pirsiavash, Hamed
    Ramanan, Deva
    Yuen, Jenny
    Torralba, Antonio
    Song, Bi
    Fong, Anesco
    Roy-Chowdhury, Amit
    Desai, Mita
    2011 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2011,
  • [15] LSSED: A LARGE-SCALE DATASET AND BENCHMARK FOR SPEECH EMOTION RECOGNITION
    Fan, Weiquan
    Xu, Xiangmin
    Xing, Xiaofen
    Chen, Weidong
    Huang, Dongyan
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 641 - 645
  • [16] LEVEN: A Large-Scale Chinese Legal Event Detection Dataset
    Yao, Feng
    Xiao, Chaojun
    Wang, Xiaozhi
    Liu, Zhiyuan
    Hou, Lei
    Tu, Cunchao
    Li, Juanzi
    Liu, Yun
    Shen, Weixing
    Sun, Maosong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 183 - 201
  • [17] ChID: A Large-scale Chinese IDiom Dataset for Cloze Test
    Zheng, Chujie
    Huang, Minlie
    Sun, Aixin
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 778 - 787
  • [18] A Large-Scale Chinese Multimodal NER Dataset with Speech Clues
    Sui, Dianbo
    Tian, Zhengkun
    Chen, Yubo
    Liu, Kang
    Zhao, Jun
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 2807 - 2818
  • [19] Advancing music emotion recognition: large-scale dataset construction and evaluator impact analysis
    Hu, Qiong
    Murad, Masrah Azrifah Azmi
    Li, Qi
    MULTIMEDIA SYSTEMS, 2025, 31 (02)
  • [20] EMBEDDED LARGE-SCALE HANDWRITTEN CHINESE CHARACTER RECOGNITION
    Chherawala, Youssouf
    Dolfing, Hans J. G. A.
    Dixon, Ryan S.
    Bellegarda, Jerome R.
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8169 - 8173