A large-scale dataset for Chinese historical document recognition and analysis

被引:0
|
作者
Shi, Yongxin [1 ]
Peng, Dezhi [1 ,2 ]
Zhang, Yuyi [1 ]
Cao, Jiahuan [1 ]
Jin, Lianwen [1 ,3 ]
机构
[1] South China Univ Technol, Sch Elect & Informat Engn, Guangzhou 510641, Peoples R China
[2] Huawei Cloud, Shenzhen 518129, Peoples R China
[3] SCUT, Zhuhai Inst Modern Ind Innovat, Zhuhai 519175, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1038/s41597-025-04495-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The development of Chinese civilization has produced a vast collection of historical documents. Recognizing and analyzing these documents hold significant value for the research of ancient culture. Recently, researchers have tried to utilize deep-learning techniques to automate recognition and analysis. However, existing Chinese historical document datasets, which are heavily relied upon by deep-learning models, suffer from limited data scale, insufficient character category, and lack of book-level annotation. To fill this gap, we introduce HisDoc1B, a large-scale dataset for Chinese historical document recognition and analysis. The HisDoc1B comprises 40,281 books, over 3 million document images, and over 1 billion characters across 30,615 character categories. To the best of our knowledge, HisDoc1B is the largest dataset in the field, surpassing existing datasets by more than 200 times in scale. Additionally, it is the only dataset with book-level annotations and punctuation annotations. Furthermore, extensive experiments demonstrate the high quality and practical utility of the proposed HisDoc1B. We believe that HisDoc1B could provide valuable resources to boost the advancement of research in this domain.
引用
收藏
页数:10
相关论文
共 50 条
  • [41] Products-6K: A Large-Scale Groceries Product Recognition Dataset
    Georgiadis, Kostas
    Kordopatis-Zilos, Giorgos
    Kalaganis, Fotis P.
    Migkotzidis, Panagiotis
    Chatzilari, Elisavet
    Panakidou, Valasia
    Pantouvakis, Kyriakos
    Tortopidis, Savvas
    Papadopoulos, Symeon
    Nikolopoulos, Spiros
    Kompatsiaris, Ioannis
    THE 14TH ACM INTERNATIONAL CONFERENCE ON PERVASIVE TECHNOLOGIES RELATED TO ASSISTIVE ENVIRONMENTS, PETRA 2021, 2021, : 1 - 7
  • [42] I-Nema: a large-scale microscopic image dataset for nematode recognition
    Shenglin Lu
    Sheldon Fung
    Yihao Wang
    Xuequan Lu
    Wanli Ouyang
    Xue Qing
    Hongmei Li
    Neural Computing and Applications, 2025, 37 (4) : 2763 - 2773
  • [43] IP102: A Large-Scale Benchmark Dataset for Insect Pest Recognition
    Wu, Xiaoping
    Zhan, Chi
    Lai, Yu-Kun
    Cheng, Ming-Ming
    Yang, Jufeng
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8779 - 8788
  • [44] SER30K: A Large-Scale Dataset for Sticker Emotion Recognition
    Liu, Shengzhe
    Zhang, Xin
    Yang, Jufeng
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,
  • [45] COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis
    Tang, Yansong
    Ding, Dajun
    Rao, Yongming
    Zheng, Yu
    Zhang, Danyang
    Zhao, Lili
    Lu, Jiwen
    Zhou, Jie
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1207 - 1216
  • [46] MultiScene: A Large-Scale Dataset and Benchmark for Multiscene Recognition in Single Aerial Images
    Hua, Yuansheng
    Mou, Lichao
    Jin, Pu
    Zhu, Xiao Xiang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [47] Large-scale RDF Dataset Slicing
    Marx, Edgard
    Shekarpour, Saeedeh
    Auer, Soeren
    Ngomo, Axel-Cyrille Ngonga
    2013 IEEE SEVENTH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2013), 2013, : 228 - 235
  • [48] A large-scale television advertising dataset for detailed impression analysis
    Li Tao
    Shunsuke Nakamura
    Xueting Wang
    Tatsuya Kawahara
    Gen Tamura
    Toshihiko Yamasaki
    Multimedia Tools and Applications, 2024, 83 : 18779 - 18802
  • [49] A large-scale television advertising dataset for detailed impression analysis
    Tao, Li
    Nakamura, Shunsuke
    Wang, Xueting
    Kawahara, Tatsuya
    Tamura, Gen
    Yamasaki, Toshihiko
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (07) : 18779 - 18802
  • [50] A Large-Scale Dataset for Argument Quality Ranking: Construction and Analysis
    Gretz, Shai
    Friedman, Roni
    Cohen-Karlik, Edo
    Toledo, Assaf
    Lahav, Dan
    Aharonov, Ranit
    Slonim, Noam
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 7805 - 7813