Establishing vocabulary tests as a benchmark for evaluating large language models

被引:0
|
作者
Martinez, Gonzalo [1 ]
Conde, Javier [2 ]
Merino-Gomez, Elena [3 ]
Bermudez-Margaretto, Beatriz [4 ]
Hernandez, Jose Alberto [1 ]
Reviriego, Pedro [2 ]
Brysbaert, Marc [5 ]
机构
[1] Univ Carlos III Madrid, Dept Ingn Telemat, Leganes, Spain
[2] Univ Politecn Madrid, ETSI Telecomunicac, Madrid, Spain
[3] Univ Valladolid, Escuela Ingn Ind, Valladolid, Spain
[4] Univ Salamanca, Dept Psicol Basica Psicobiol & Metodol Las CC Com, Salamanca, Spain
[5] Univ Ghent, Dept Expt Psychol, Ghent, Belgium
来源
PLOS ONE | 2024年 / 19卷 / 12期
关键词
WORD RECOGNITION; ACQUISITION; LEXTALE;
D O I
10.1371/journal.pone.0308259
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama 2, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific tasks or domain-specific knowledge, they often neglect the fundamental linguistic aspects of language understanding. In this paper, we advocate for the revival of vocabulary tests as a valuable tool for assessing LLM performance. We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge. These findings shed light on the intricacies of LLM word representations, their learning mechanisms, and performance variations across models and languages. Moreover, the ability to automatically generate and perform vocabulary tests offers new opportunities to expand the approach and provide a more complete picture of LLMs' language skills.
引用
收藏
页数:17
相关论文
共 50 条
  • [31] Evaluating large language models as agents in the clinic
    Mehandru, Nikita
    Miao, Brenda Y.
    Almaraz, Eduardo Rodriguez
    Sushil, Madhumita
    Butte, Atul J.
    Alaa, Ahmed
    NPJ DIGITAL MEDICINE, 2024, 7 (01)
  • [32] HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models
    Li, Junyi
    Cheng, Xiaoxue
    Zhao, Wayne Xin
    Nie, Jian-Yun
    Wen, Ji-Rong
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 6449 - 6464
  • [33] Hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition
    Yazgan, A
    Saraclar, M
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 745 - 748
  • [34] GRASP: A Novel Benchmark for Evaluating Language GRounding and Situated Physics Understanding in Multimodal Language Models
    Jassimi, Serwan
    Holubar, Mario
    Richter, Annika
    Wolff, Cornelius
    Ohmer, Xenia
    Bruni, Elia
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 6297 - 6305
  • [35] ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code
    Feng, Jia
    Liu, Jiachen
    Gao, Cuiyun
    Chong, Chun Yong
    Wang, Chaozheng
    Gao, Shan
    Xia, Xin
    arXiv, 2024,
  • [36] ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code
    Feng, Jia
    Liu, Jiachen
    Gao, Cuiyun
    Chong, Chun Yong
    Wang, Chaozheng
    Gao, Shan
    Xia, Xin
    Proceedings - 2024 39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024, : 1895 - 1906
  • [37] VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
    Zhou, Wangchunshu
    Zeng, Yan
    Diao, Shizhe
    Zhang, Xinsong
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [38] Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models
    Kim, Seungduk
    Choi, Seungtaek
    Jeong, Myeongho
    arXiv,
  • [39] Transition movement models for large vocabulary continuous sign language recognition
    Gao, W
    Fang, GL
    Zhao, DB
    Chen, YQ
    SIXTH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, PROCEEDINGS, 2004, : 553 - 558
  • [40] Towards a benchmark dataset for large language models in the context of process automation
    Tizaoui, Tejennour
    Tan, Ruomu
    DIGITAL CHEMICAL ENGINEERING, 2024, 13