Multi-Layout Invoice Document Dataset (MIDD): A Dataset for Named Entity Recognition

被引：5

作者：

Baviskar, Dipali ^{[1
]}

Ahirrao, Swati ^{[1
]}

Kotecha, Ketan ^{[2
]}

机构：

[1] Symbiosis Int, Symbiosis Inst Technol, Pune 412115, Maharashtra, India

[2] Symbiosis Int, Symbiosis Ctr Appl Artificial Intelligence, Pune 412115, Maharashtra, India

来源：

DATA | 2021年 / 6卷 / 07期

关键词：

Artificial Intelligence (AI); information extraction; Named Entity Recognition (NER); unstructured data;

D O I：

10.3390/data6070078

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The day-to-day working of an organization produces a massive volume of unstructured data in the form of invoices, legal contracts, mortgage processing forms, and many more. Organizations can utilize the insights concealed in such unstructured documents for their operational benefit. However, analyzing and extracting insights from such numerous and complex unstructured documents is a tedious task. Hence, the research in this area is encouraging the development of novel frameworks and tools that can automate the key information extraction from unstructured documents. However, the availability of standard, best-quality, and annotated unstructured document datasets is a serious challenge for accomplishing the goal of extracting key information from unstructured documents. This work expedites the researcher's task by providing a high-quality, highly diverse, multi-layout, and annotated invoice documents dataset for extracting key information from unstructured documents. Researchers can use the proposed dataset for layout-independent unstructured invoice document processing and to develop an artificial intelligence (AI)-based tool to identify and extract named entities in the invoice documents. Our dataset includes 630 invoice document PDFs with four different layouts collected from diverse suppliers. As far as we know, our invoice dataset is the only openly available dataset comprising high-quality, highly diverse, multi-layout, and annotated invoice documents. DataSet: http://doi.org/10.5281/zenodo.5113009 DataSet License: License under which the dataset is made available (CC-BY-4.0).

引用

页数：10

共 50 条

[21] NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval
Katz, Uri
Vetzler, Matan
Cohen, Amir D. N.
Goldberg, Yoav
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 3340 - 3354
[22] EduNER: a Chinese named entity recognition dataset for education research
Li, Xu
Wei, Chengkun
Jiang, Zhuoren
Meng, Wenlong
Ouyang, Fan
Zhang, Zihui
Chen, Wenzhi
NEURAL COMPUTING & APPLICATIONS, 2023, 35 (24): : 17717 - 17731
[23] Statistical dataset evaluation: A case study on named entity recognition
Wang, Chengwen
Dong, Qingxiu
Wang, Xiaochen
Sui, Zhifang
NATURAL LANGUAGE PROCESSING, 2025, 31 (01): : 90 - 110
[24] Dataset-aware multi-task learning approaches for biomedical named entity recognition
Zuo, Mei
Zhang, Yang
BIOINFORMATICS, 2020, 36 (15) : 4331 - 4338
[25] Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation
Mengliev, Davlatyor
Barakhnin, Vladimir
Abdurakhmonova, Nilufar
Eshkulov, Mukhriddin
DATA IN BRIEF, 2024, 54
[26] Dataset Enhancement and Multilingual Transfer for Named Entity Recognition in the Indonesian Language
Khairunnisa, Siti Oryza
Chen, Zhousi
Komachi, Mamoru
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (06)
[27] Research on College Academic Text Named Entity Recognition and Dataset Construction
He, Chen
Yuan, Yingchun
Wang, Kejian
Tao, Jia
Computer Engineering and Applications, 2023, 59 (22) : 322 - 328
[28] CachacaNER: a dataset for named entity recognition in texts about the cachaca beverage
Silva, Priscilla
Franco, Arthur
Santos, Thiago
Brito, Mozar
Pereira, Denilson
LANGUAGE RESOURCES AND EVALUATION, 2024, 58 (04) : 1315 - 1333
[29] Learning to Rank Context for Named Entity Recognition Using a Synthetic Dataset
Amalvy, Arthur
Labatut, Vincent
Dufour, Richard
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 10372 - 10382
[30] CLEANCONLL: A Nearly Noise-Free Named Entity Recognition Dataset
Ruecker, Susanna
Akbik, Alan
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8628 - 8645

← 1 2 3 4 5 →