AsNER - Annotated Dataset and Baseline for Assamese Named Entity recognition

被引：0

作者：

Pathak, Dhrubajyoti ^{[1
]}

Nandi, Sukumar ^{[1
]}

Sarmah, Priyankoo ^{[1
]}

机构：

[1] Indian Inst Technol Guwahati, North Guwahati, India

来源：

LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2022年

关键词：

NER dataset; Language Resources; Assamese NER; Assamese Language; Named Entity Recognition; NER model; AsNER;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

We present the AsNER, a named entity annotation dataset for low resource Assamese language with a baseline Assamese NER model. The dataset contains about 99k tokens comprised of text from the speech of the Prime Minister of India and Assamese play. It also contains person names, location names and addresses. The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing. We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition (NER) such as Fasttext, BERT, XLM-R, FLAIR, MuRIL etc. We implement several baseline approaches with state-of-the-art sequence tagging Bi-LSTM-CRF architecture. The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method. The annotated dataset and the top performing model are made publicly available.

引用

页码：6571 / 6577

页数：7

共 50 条

[31] Dataset Enhancement and Multilingual Transfer for Named Entity Recognition in the Indonesian Language
Khairunnisa, Siti Oryza
Chen, Zhousi
Komachi, Mamoru
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (06)
[32] Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation
Mengliev, Davlatyor
Barakhnin, Vladimir
Abdurakhmonova, Nilufar
Eshkulov, Mukhriddin
DATA IN BRIEF, 2024, 54
[33] Research on College Academic Text Named Entity Recognition and Dataset Construction
He, Chen
Yuan, Yingchun
Wang, Kejian
Tao, Jia
Computer Engineering and Applications, 2023, 59 (22) : 322 - 328
[34] CachacaNER: a dataset for named entity recognition in texts about the cachaca beverage
Silva, Priscilla
Franco, Arthur
Santos, Thiago
Brito, Mozar
Pereira, Denilson
LANGUAGE RESOURCES AND EVALUATION, 2024, 58 (04) : 1315 - 1333
[35] Learning to Rank Context for Named Entity Recognition Using a Synthetic Dataset
Amalvy, Arthur
Labatut, Vincent
Dufour, Richard
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 10372 - 10382
[36] CLEANCONLL: A Nearly Noise-Free Named Entity Recognition Dataset
Ruecker, Susanna
Akbik, Alan
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8628 - 8645
[37] Few-Shot Named Entity Recognition: An Empirical Baseline Study
Huang, Jiaxin
Lie, Chunyuan
Subudhi, Krishan
Jose, Damien
Balakrishnan, Shobana
Chen, Weizhu
Peng, Baolin
Gao, Jianfeng
Han, Jiawei
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 10408 - 10423
[38] An Arabic dataset for disease named entity recognition with multi-annotation schemes
Alshammari, Nasser (nashamri@ju.edu.sa), 1600, MDPI (05):
[39] An Arabic Dataset for Disease Named Entity Recognition with Multi-Annotation Schemes
Alshammari, Nasser
Alanazi, Saad
DATA, 2020, 5 (03) : 1 - 8
[40] FEW-NERD: A Few-shot Named Entity Recognition Dataset
Ding, Ning
Xu, Guangwei
Chen, Yulin
Wang, Xiaobin
Han, Xu
Xie, Pengjun
Zheng, Hai-Tao
Liu, Zhiyuan
59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 3198 - 3213

← 1 2 3 4 5 →