New Human-Annotated Dataset of Czech Health Records for Training Medical Concept Recognition Models

被引:0
|
作者
Anetta, Kristof [1 ]
Horak, Ales [1 ]
机构
[1] Masaryk Univ, Nat Language Proc Ctr, Fac Informat, Brno, Czech Republic
来源
TEXT, SPEECH, AND DIALOGUE, TSD 2024, PT I | 2024年 / 15048卷
关键词
medical text analysis; electronic health records; medical concept terms; medical concept dataset; named entity recognition;
D O I
10.1007/978-3-031-70563-2_9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Following the widespread successes of leveraging recent large language models (LLMs) in various NLP tasks, this paper focuses on medical text content understanding. Adapting a foundational LLM to the medical domain requires a special kind of datasets where core medical concepts are accurately annotated. This paper addresses the need of better medical concept recognition in free-text electronic health records in low-resourced Slavic languages and introduces CSEHR, a new human-annotated dataset of Czech oncology health records. It describes the dataset inception, management, considerations, processing, and finally presents baseline concept recognition model results. XLM-RoBERTa models trained on the dataset using 5-fold cross-validation achieved an average weighted F1 score of 0.672 in exact and 0.777 in partial medical concept recognition ranging from 0.335 to 0.857 per different concept classes. This paper then describes future plans of bootstrapping larger annotated corpora from the CSEHR dataset and of making the dataset publicly available. This endeavor is unique in the realm of Slavic languages and already at this stage it represents a major step in the field of Slavic medical concept recognition.
引用
收藏
页码:110 / 120
页数:11
相关论文
共 15 条
  • [1] A Human-Annotated Video Dataset for Training and Evaluation of 360-Degree Video Summarization Methods
    Kontostathis, Ioannis
    Apostolidis, Evlampios
    Mezaris, Vasileios
    PROCEEDINGS OF THE 2024 ACM INTERNATIONAL CONFERENCE ON INTERACTIVE MEDIA EXPERIENCES WORKSHOPS, IMXW 2024, 2024, : 71 - 79
  • [2] The plausibility machine commonsense (PMC) dataset: A massively crowdsourced human-annotated dataset for studying plausibility in large language models
    Nananukul, Navapat
    Shen, Ke
    Kejriwal, Mayank
    DATA IN BRIEF, 2024, 57
  • [3] An annotated image dataset for training mosquito species recognition system on human skin
    Ong, Song-Quan
    Ahmad, Hamdan
    SCIENTIFIC DATA, 2022, 9 (01)
  • [4] An annotated image dataset for training mosquito species recognition system on human skin
    Song-Quan Ong
    Hamdan Ahmad
    Scientific Data, 9
  • [5] Annotated dataset for training deep learning models to detect astrocytes in human brain tissue
    Olar, Alex
    Tyler, Teadora
    Hoppa, Paulina
    Frank, Erzsebet
    Csabai, Istvan
    Adorjan, Istvan
    Pollner, Peter
    SCIENTIFIC DATA, 2024, 11 (01)
  • [6] Annotated dataset for training deep learning models to detect astrocytes in human brain tissue
    Alex Olar
    Teadora Tyler
    Paulina Hoppa
    Erzsébet Frank
    István Csabai
    Istvan Adorjan
    Péter Pollner
    Scientific Data, 11
  • [7] A new dataset for human activity recognition and its classification with deep learning models
    Vurgun, Yasin
    Kiran, Mustafa Servet
    JOURNAL OF THE FACULTY OF ENGINEERING AND ARCHITECTURE OF GAZI UNIVERSITY, 2025, 40 (01): : 653 - 671
  • [8] Medical Surprise Anticipation and Recognition Capability: A New Concept for Better Health Care
    Laurencin, Cato T.
    McClinton, Aneesah
    JOURNAL OF RACIAL AND ETHNIC HEALTH DISPARITIES, 2019, 6 (05) : 869 - 873
  • [9] Medical Surprise Anticipation and Recognition Capability: A New Concept for Better Health Care
    Cato T. Laurencin
    Aneesah McClinton
    Journal of Racial and Ethnic Health Disparities, 2019, 6 : 869 - 873
  • [10] Fostering a New Vision of Health by Incorporating Human Rights Training Into Medical Education
    Anderson, Jamie E.
    Cappiello, Matthew
    Keller, Samuel
    Lee, Katherine Chia-Shyuan
    Mountjoy, Ashlin
    ACADEMIC MEDICINE, 2013, 88 (04) : 436 - 437