MédicoBERT: A Medical Language Model for Spanish Natural Language Processing Tasks with a Question-Answering Application Using Hyperparameter Optimization

被引:0
|
作者
Cuevas, Josue Padilla [1 ]
Reyes-Ortiz, Jose A. [2 ]
Cuevas-Rasgado, Alma D. [1 ]
Mora-Gutierrez, Roman A. [2 ]
Bravo, Maricela [2 ]
机构
[1] Univ Autonoma Estado Mexico CU, Comp Engn, Texcoco 56259, Mexico
[2] Autonomous Metropolitan Univ, Syst Dept, Mexico City 02200, Mexico
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 16期
关键词
LLM; BERT; pre-training model; question answering; fine-tuning; hyperparameter optimization; NLP benchmark; Spanish medical language modeling; M & eacute; dicoBERT;
D O I
10.3390/app14167031
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The increasing volume of medical information available in digital format presents a significant challenge for researchers seeking to extract relevant information. Manually analyzing voluminous data is a time-consuming process that constrains researchers' productivity. In this context, innovative and intelligent computational approaches to information search, such as large language models (LLMs), offer a promising solution. LLMs understand natural language questions and respond accurately to complex queries, even in the specialized domain of medicine. This paper presents M & eacute;dicoBERT, a medical language model in Spanish developed by adapting a general domain language model (BERT) to medical terminology and vocabulary related to diseases, treatments, symptoms, and medications. The model was pre-trained with 3 M medical texts containing 1.1 B words. Furthermore, with promising results, M & eacute;dicoBERT was adapted and evaluated to answer medical questions in Spanish. The question-answering (QA) task was fine-tuned using a Spanish corpus of over 34,000 medical questions and answers. A search was then conducted to identify the optimal hyperparameter configuration using heuristic methods and nonlinear regression models. The evaluation of M & eacute;dicoBERT was carried out using metrics such as perplexity to measure the adaptation of the language model to the medical vocabulary in Spanish, where it obtained a value of 4.28, and the average F1 metric for the task of answering medical questions, where it obtained a value of 62.35%. The objective of M & eacute;dicoBERT is to provide support for research in the field of natural language processing (NLP) in Spanish, with a particular emphasis on applications within the medical domain.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] NATURAL LANGUAGE QUESTION-ANSWERING SYSTEMS . 1969
    SIMMONS, RF
    COMMUNICATIONS OF THE ACM, 1970, 13 (01) : 15 - &
  • [2] Natural language neural network and its application to question-answering system
    Sagara, Tsukasa
    Hagiwara, Masafumi
    NEUROCOMPUTING, 2014, 142 : 201 - 208
  • [3] Natural Language Neural Network and its Application to Question-Answering System
    Sagara, Tsukasa
    Hagiwara, Masafumi
    2012 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2012,
  • [4] Question-Answering System Design in Teaching and Learning, Based on Natural Language Processing
    Wang Ming
    Yuan Dachao
    PROCEEDINGS OF THE FOURTH NORTHEAST ASIA INTERNATIONAL SYMPOSIUM ON LANGUAGE, LITERATURE AND TRANSLATION, 2015, 2015, : 132 - 137
  • [5] REQUEST - NATURAL-LANGUAGE QUESTION-ANSWERING SYSTEM
    PLATH, WJ
    IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1976, 20 (04) : 326 - 335
  • [6] Performance of natural language classifiers in a question-answering system
    Bakis, R.
    Connors, D. P.
    Dube, P.
    Kapanipathi, P.
    Kumar, A.
    Malioutov, D.
    Venkatramani, C.
    IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 2017, 61 (4-5)
  • [7] COOL, a Context Outlooker, and Its Application to Question Answering and Other Natural Language Processing Tasks
    Zhu, Fangyi
    Ng, See-Kiong
    Bressan, Stephane
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 5314 - 5322
  • [8] Natural Language Processing Based Question Answering Using Vector Space Model
    Jayashree, R.
    Niveditha, N.
    PROCEEDINGS OF SIXTH INTERNATIONAL CONFERENCE ON SOFT COMPUTING FOR PROBLEM SOLVING, SOCPROS 2016, VOL 2, 2017, 547 : 368 - 375
  • [9] ViMedAQA: A Vietnamese Medical Abstractive Question-Answering Dataset and Findings of Large Language Model
    Tran, Minh-Nam
    Nguyen, Phu-Vinh
    Nguyen, Long
    Dinh, Dien
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 4: STUDENT RESEARCH WORKSHOP, 2024, : 270 - 278
  • [10] SEMANTIC GRAMMAR AND MEANING REPRESENTATION LANGUAGE IN A NATURAL QUESTION-ANSWERING SYSTEM
    RATHKE, C
    SONNTAG, B
    SCHOPPER, W
    ANGEWANDTE INFORMATIK, 1980, (04): : 155 - 157