MédicoBERT: A Medical Language Model for Spanish Natural Language Processing Tasks with a Question-Answering Application Using Hyperparameter Optimization

被引:0
|
作者
Cuevas, Josue Padilla [1 ]
Reyes-Ortiz, Jose A. [2 ]
Cuevas-Rasgado, Alma D. [1 ]
Mora-Gutierrez, Roman A. [2 ]
Bravo, Maricela [2 ]
机构
[1] Univ Autonoma Estado Mexico CU, Comp Engn, Texcoco 56259, Mexico
[2] Autonomous Metropolitan Univ, Syst Dept, Mexico City 02200, Mexico
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 16期
关键词
LLM; BERT; pre-training model; question answering; fine-tuning; hyperparameter optimization; NLP benchmark; Spanish medical language modeling; M & eacute; dicoBERT;
D O I
10.3390/app14167031
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The increasing volume of medical information available in digital format presents a significant challenge for researchers seeking to extract relevant information. Manually analyzing voluminous data is a time-consuming process that constrains researchers' productivity. In this context, innovative and intelligent computational approaches to information search, such as large language models (LLMs), offer a promising solution. LLMs understand natural language questions and respond accurately to complex queries, even in the specialized domain of medicine. This paper presents M & eacute;dicoBERT, a medical language model in Spanish developed by adapting a general domain language model (BERT) to medical terminology and vocabulary related to diseases, treatments, symptoms, and medications. The model was pre-trained with 3 M medical texts containing 1.1 B words. Furthermore, with promising results, M & eacute;dicoBERT was adapted and evaluated to answer medical questions in Spanish. The question-answering (QA) task was fine-tuned using a Spanish corpus of over 34,000 medical questions and answers. A search was then conducted to identify the optimal hyperparameter configuration using heuristic methods and nonlinear regression models. The evaluation of M & eacute;dicoBERT was carried out using metrics such as perplexity to measure the adaptation of the language model to the medical vocabulary in Spanish, where it obtained a value of 4.28, and the average F1 metric for the task of answering medical questions, where it obtained a value of 62.35%. The objective of M & eacute;dicoBERT is to provide support for research in the field of natural language processing (NLP) in Spanish, with a particular emphasis on applications within the medical domain.
引用
收藏
页数:17
相关论文
共 50 条
  • [21] Spanish to Mexican Sign Language glosses corpus for natural language processing tasks
    Vania Lara-Ortiz
    Rita Q. Fuentes-Aguilar
    Isaac Chairez
    Scientific Data, 12 (1)
  • [22] QA-KGNet: A Language Model-Driven Knowledge Graph Question-Answering Model
    Qiao S.-J.
    Yang G.-P.
    Yu Y.
    Han N.
    Qin X.
    Qu L.-L.
    Ran L.-Q.
    Li H.
    Ruan Jian Xue Bao/Journal of Software, 2023, 34 (10):
  • [23] Real Life Application of a Question Answering System Using BERT Language Model
    Alloatti, Francesca
    Di Caro, Luigi
    Sportelli, Gianpiero
    20TH ANNUAL MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE (SIGDIAL 2019), 2019, : 250 - 253
  • [24] Mobile System Using Natural Language Annotations for Question Answering
    Khiyal, M. Sikandar Hayat
    Khan, Aihab
    Khalid, Sidra
    PROCEEDINGS OF THE 2009 INTERNATIONAL CONFERENCE ON COMPUTER TECHNOLOGY AND DEVELOPMENT, VOL 1, 2009, : 367 - 371
  • [25] MedLexSp - a medical lexicon for Spanish medical natural language processing
    Campillos-Llanos, Leonardo
    JOURNAL OF BIOMEDICAL SEMANTICS, 2023, 14 (01)
  • [26] MedLexSp – a medical lexicon for Spanish medical natural language processing
    Leonardo Campillos-Llanos
    Journal of Biomedical Semantics, 14
  • [27] Auto-generating question-answering datasets with domain-specific knowledge for language models in scientific tasks
    Li, Zongqian
    Cole, Jacqueline M.
    DIGITAL DISCOVERY, 2025,
  • [28] Natural Language Processing based Visual Question Answering Efficient: an EfficientDet Approach
    Gupta, Rahul
    Hooda, P. Arikshit
    Sanjeev
    Chikkara, Nikhil Kumar
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICICCS 2020), 2020, : 900 - 904
  • [29] Vision-Language Model for Visual Question Answering in Medical Imagery
    Bazi, Yakoub
    Al Rahhal, Mohamad Mahmoud
    Bashmal, Laila
    Zuair, Mansour
    BIOENGINEERING-BASEL, 2023, 10 (03):
  • [30] QUESTION ANSWERING FROM NATURAL-LANGUAGE MEDICAL DATA-BASES
    GRISHMAN, R
    HIRSCHMAN, L
    ARTIFICIAL INTELLIGENCE, 1978, 11 (1-2) : 25 - 43