A large-scale dataset for Chinese historical document recognition and analysis

被引：0

作者：

Shi, Yongxin ^{[1
]}

Peng, Dezhi ^{[1
,2
]}

Zhang, Yuyi ^{[1
]}

Cao, Jiahuan ^{[1
]}

Jin, Lianwen ^{[1
,3
]}

机构：

[1] South China Univ Technol, Sch Elect & Informat Engn, Guangzhou 510641, Peoples R China

[2] Huawei Cloud, Shenzhen 518129, Peoples R China

[3] SCUT, Zhuhai Inst Modern Ind Innovat, Zhuhai 519175, Peoples R China

来源：

SCIENTIFIC DATA | 2025年 / 12卷 / 01期

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1038/s41597-025-04495-x

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

The development of Chinese civilization has produced a vast collection of historical documents. Recognizing and analyzing these documents hold significant value for the research of ancient culture. Recently, researchers have tried to utilize deep-learning techniques to automate recognition and analysis. However, existing Chinese historical document datasets, which are heavily relied upon by deep-learning models, suffer from limited data scale, insufficient character category, and lack of book-level annotation. To fill this gap, we introduce HisDoc1B, a large-scale dataset for Chinese historical document recognition and analysis. The HisDoc1B comprises 40,281 books, over 3 million document images, and over 1 billion characters across 30,615 character categories. To the best of our knowledge, HisDoc1B is the largest dataset in the field, surpassing existing datasets by more than 200 times in scale. Additionally, it is the only dataset with book-level annotations and punctuation annotations. Furthermore, extensive experiments demonstrate the high quality and practical utility of the proposed HisDoc1B. We believe that HisDoc1B could provide valuable resources to boost the advancement of research in this domain.

引用

页数：10

共 50 条

[11] Large-Scale Analysis of the Docker Hub Dataset
Zhao, Nannan
Tarasov, Vasily
Albahar, Hadeel
Anwar, Ali
Rupprecht, Lukas
Skourtis, Dimitrios
Warke, Amit S.
Mohamed, Mohamed
Butt, Ali R.
2019 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2019, : 215 - 224
[12] Oceanship: A Large-Scale Dataset for Underwater Audio Target Recognition
Li, Zeyu
Xiang, Suncheng
Yu, Tong
Gao, Jingsheng
Ruan, Jiacheng
Hu, Yanping
Liu, Ting
Fu, Yuzhuo
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT IV, ICIC 2024, 2024, 14865 : 475 - 486
[13] A Large-Scale 3D Object Recognition dataset
Solund, Thomas
Buch, Anders Glent
Kruger, Norbert
Aanaes, Henrik
PROCEEDINGS OF 2016 FOURTH INTERNATIONAL CONFERENCE ON 3D VISION (3DV), 2016, : 73 - 82
[14] A Large-scale Benchmark Dataset for Event Recognition in Surveillance Video
Oh, Sangmin
Hoogs, Anthony
Perera, Amitha
Cuntoor, Naresh
Chen, Chia-Chih
Lee, Jong Taek
Mukherjee, Saurajit
Aggarwal, J. K.
Lee, Hyungtae
Davis, Larry
Swears, Eran
Wang, Xioyang
Ji, Qiang
Reddy, Kishore
Shah, Mubarak
Vondrick, Carl
Pirsiavash, Hamed
Ramanan, Deva
Yuen, Jenny
Torralba, Antonio
Song, Bi
Fong, Anesco
Roy-Chowdhury, Amit
Desai, Mita
2011 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2011,
[15] LSSED: A LARGE-SCALE DATASET AND BENCHMARK FOR SPEECH EMOTION RECOGNITION
Fan, Weiquan
Xu, Xiangmin
Xing, Xiaofen
Chen, Weidong
Huang, Dongyan
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 641 - 645
[16] LEVEN: A Large-Scale Chinese Legal Event Detection Dataset
Yao, Feng
Xiao, Chaojun
Wang, Xiaozhi
Liu, Zhiyuan
Hou, Lei
Tu, Cunchao
Li, Juanzi
Liu, Yun
Shen, Weixing
Sun, Maosong
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 183 - 201
[17] ChID: A Large-scale Chinese IDiom Dataset for Cloze Test
Zheng, Chujie
Huang, Minlie
Sun, Aixin
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 778 - 787
[18] A Large-Scale Chinese Multimodal NER Dataset with Speech Clues
Sui, Dianbo
Tian, Zhengkun
Chen, Yubo
Liu, Kang
Zhao, Jun
59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 2807 - 2818
[19] Advancing music emotion recognition: large-scale dataset construction and evaluator impact analysis
Hu, Qiong
Murad, Masrah Azrifah Azmi
Li, Qi
MULTIMEDIA SYSTEMS, 2025, 31 (02)
[20] EMBEDDED LARGE-SCALE HANDWRITTEN CHINESE CHARACTER RECOGNITION
Chherawala, Youssouf
Dolfing, Hans J. G. A.
Dixon, Ryan S.
Bellegarda, Jerome R.
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8169 - 8173

← 1 2 3 4 5 →