AsNER - Annotated Dataset and Baseline for Assamese Named Entity recognition

被引:0
|
作者
Pathak, Dhrubajyoti [1 ]
Nandi, Sukumar [1 ]
Sarmah, Priyankoo [1 ]
机构
[1] Indian Inst Technol Guwahati, North Guwahati, India
关键词
NER dataset; Language Resources; Assamese NER; Assamese Language; Named Entity Recognition; NER model; AsNER;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
We present the AsNER, a named entity annotation dataset for low resource Assamese language with a baseline Assamese NER model. The dataset contains about 99k tokens comprised of text from the speech of the Prime Minister of India and Assamese play. It also contains person names, location names and addresses. The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing. We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition (NER) such as Fasttext, BERT, XLM-R, FLAIR, MuRIL etc. We implement several baseline approaches with state-of-the-art sequence tagging Bi-LSTM-CRF architecture. The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method. The annotated dataset and the top performing model are made publicly available.
引用
收藏
页码:6571 / 6577
页数:7
相关论文
共 50 条
  • [31] Dataset Enhancement and Multilingual Transfer for Named Entity Recognition in the Indonesian Language
    Khairunnisa, Siti Oryza
    Chen, Zhousi
    Komachi, Mamoru
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (06)
  • [32] Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation
    Mengliev, Davlatyor
    Barakhnin, Vladimir
    Abdurakhmonova, Nilufar
    Eshkulov, Mukhriddin
    DATA IN BRIEF, 2024, 54
  • [33] Research on College Academic Text Named Entity Recognition and Dataset Construction
    He, Chen
    Yuan, Yingchun
    Wang, Kejian
    Tao, Jia
    Computer Engineering and Applications, 2023, 59 (22) : 322 - 328
  • [34] CachacaNER: a dataset for named entity recognition in texts about the cachaca beverage
    Silva, Priscilla
    Franco, Arthur
    Santos, Thiago
    Brito, Mozar
    Pereira, Denilson
    LANGUAGE RESOURCES AND EVALUATION, 2024, 58 (04) : 1315 - 1333
  • [35] Learning to Rank Context for Named Entity Recognition Using a Synthetic Dataset
    Amalvy, Arthur
    Labatut, Vincent
    Dufour, Richard
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 10372 - 10382
  • [36] CLEANCONLL: A Nearly Noise-Free Named Entity Recognition Dataset
    Ruecker, Susanna
    Akbik, Alan
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8628 - 8645
  • [37] Few-Shot Named Entity Recognition: An Empirical Baseline Study
    Huang, Jiaxin
    Lie, Chunyuan
    Subudhi, Krishan
    Jose, Damien
    Balakrishnan, Shobana
    Chen, Weizhu
    Peng, Baolin
    Gao, Jianfeng
    Han, Jiawei
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 10408 - 10423
  • [38] An Arabic dataset for disease named entity recognition with multi-annotation schemes
    Alshammari, Nasser (nashamri@ju.edu.sa), 1600, MDPI (05):
  • [39] An Arabic Dataset for Disease Named Entity Recognition with Multi-Annotation Schemes
    Alshammari, Nasser
    Alanazi, Saad
    DATA, 2020, 5 (03) : 1 - 8
  • [40] FEW-NERD: A Few-shot Named Entity Recognition Dataset
    Ding, Ning
    Xu, Guangwei
    Chen, Yulin
    Wang, Xiaobin
    Han, Xu
    Xie, Pengjun
    Zheng, Hai-Tao
    Liu, Zhiyuan
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 3198 - 3213