Multi-Layout Invoice Document Dataset (MIDD): A Dataset for Named Entity Recognition

被引:5
|
作者
Baviskar, Dipali [1 ]
Ahirrao, Swati [1 ]
Kotecha, Ketan [2 ]
机构
[1] Symbiosis Int, Symbiosis Inst Technol, Pune 412115, Maharashtra, India
[2] Symbiosis Int, Symbiosis Ctr Appl Artificial Intelligence, Pune 412115, Maharashtra, India
关键词
Artificial Intelligence (AI); information extraction; Named Entity Recognition (NER); unstructured data;
D O I
10.3390/data6070078
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The day-to-day working of an organization produces a massive volume of unstructured data in the form of invoices, legal contracts, mortgage processing forms, and many more. Organizations can utilize the insights concealed in such unstructured documents for their operational benefit. However, analyzing and extracting insights from such numerous and complex unstructured documents is a tedious task. Hence, the research in this area is encouraging the development of novel frameworks and tools that can automate the key information extraction from unstructured documents. However, the availability of standard, best-quality, and annotated unstructured document datasets is a serious challenge for accomplishing the goal of extracting key information from unstructured documents. This work expedites the researcher's task by providing a high-quality, highly diverse, multi-layout, and annotated invoice documents dataset for extracting key information from unstructured documents. Researchers can use the proposed dataset for layout-independent unstructured invoice document processing and to develop an artificial intelligence (AI)-based tool to identify and extract named entities in the invoice documents. Our dataset includes 630 invoice document PDFs with four different layouts collected from diverse suppliers. As far as we know, our invoice dataset is the only openly available dataset comprising high-quality, highly diverse, multi-layout, and annotated invoice documents. DataSet: http://doi.org/10.5281/zenodo.5113009 DataSet License: License under which the dataset is made available (CC-BY-4.0).
引用
收藏
页数:10
相关论文
共 50 条
  • [31] A Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in Historical Newspapers
    Hamdi, Ahmed
    Pontes, Elvys Linhares
    Boros, Emanuela
    Thi Tuyet Hai Nguyen
    Hackl, Guenter
    Moreno, Jose G.
    Doucet, Antoine
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 2328 - 2334
  • [32] Towards Malay named entity recognition: an open-source dataset and a multi-task framework
    Fu, Yingwen
    Lin, Nankai
    Yang, Zhihe
    Jiang, Shengyi
    CONNECTION SCIENCE, 2023, 35 (01)
  • [33] FEW-NERD: A Few-shot Named Entity Recognition Dataset
    Ding, Ning
    Xu, Guangwei
    Chen, Yulin
    Wang, Xiaobin
    Han, Xu
    Xie, Pengjun
    Zheng, Hai-Tao
    Liu, Zhiyuan
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 3198 - 3213
  • [34] A comprehensive dataset and neural network approach for named entity recognition in the Uzbek language
    Mengliev, Davlatyor
    Barakhnin, Vladimir
    Eshkulov, Mukhriddin
    Ibragimov, Bahodir
    Madirimov, Shohrux
    DATA IN BRIEF, 2025, 58
  • [35] DNRTI: A Large-scale Dataset for Named Entity Recognition in Threat Intelligence
    Wang, Xuren
    Liu, Xinpei
    Ao, Shengqin
    Li, Ning
    Jiang, Zhengwei
    Xu, Zongyi
    Xiong, Zihan
    Xiong, Mengbo
    Zhang, Xiaoqing
    2020 IEEE 19TH INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2020), 2020, : 1842 - 1848
  • [36] LeNER-Br: A Dataset for Named Entity Recognition in Brazilian Legal Text
    Luz de Araujo, Pedro Henrique
    de Campos, Teofilo E.
    de Oliveira, Renato R. R.
    Stauffer, Matheus
    Couto, Samuel
    Bermejo, Paulo
    COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2018, 2018, 11122 : 313 - 323
  • [37] EPIC: An epidemiological investigation of COVID-19 dataset for Chinese named entity recognition
    Li, Pu
    Zhou, Guohao
    Guo, Yanbu
    Zhang, Suzhi
    Jiang, Yuncheng
    Tang, Yong
    INFORMATION PROCESSING & MANAGEMENT, 2024, 61 (01)
  • [38] TF-NERD: Tagalog Fine-grained Named Entity Recognition Dataset
    Ramos, Robin Kamille B.
    Vergara, John Paul C.
    PROCEEDINGS OF 2023 7TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, NLPIR 2023, 2023, : 222 - 227
  • [39] NERvous About My Health: Constructing a Bengali Medical Named Entity Recognition Dataset
    Khan, Alvi Aveen
    Kamal, Fida
    Nower, Nuzhat
    Ahmed, Tasnim
    Ahmed, Sabbir
    Chowdhury, Tareque Mohmud
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 5768 - 5774
  • [40] Named Entity Recognition with Conditional Random Fields on Turkish News Dataset: Revisiting the Features
    Cekinel, Recep Firat
    Agriman, Mustafa
    Karagoz, Pinar
    Yilmaz, Burcu
    2019 27TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2019,