Establishing vocabulary tests as a benchmark for evaluating large language models

被引:0
|
作者
Martinez, Gonzalo [1 ]
Conde, Javier [2 ]
Merino-Gomez, Elena [3 ]
Bermudez-Margaretto, Beatriz [4 ]
Hernandez, Jose Alberto [1 ]
Reviriego, Pedro [2 ]
Brysbaert, Marc [5 ]
机构
[1] Univ Carlos III Madrid, Dept Ingn Telemat, Leganes, Spain
[2] Univ Politecn Madrid, ETSI Telecomunicac, Madrid, Spain
[3] Univ Valladolid, Escuela Ingn Ind, Valladolid, Spain
[4] Univ Salamanca, Dept Psicol Basica Psicobiol & Metodol Las CC Com, Salamanca, Spain
[5] Univ Ghent, Dept Expt Psychol, Ghent, Belgium
来源
PLOS ONE | 2024年 / 19卷 / 12期
关键词
WORD RECOGNITION; ACQUISITION; LEXTALE;
D O I
10.1371/journal.pone.0308259
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama 2, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific tasks or domain-specific knowledge, they often neglect the fundamental linguistic aspects of language understanding. In this paper, we advocate for the revival of vocabulary tests as a valuable tool for assessing LLM performance. We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge. These findings shed light on the intricacies of LLM word representations, their learning mechanisms, and performance variations across models and languages. Moreover, the ability to automatically generate and perform vocabulary tests offers new opportunities to expand the approach and provide a more complete picture of LLMs' language skills.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] A bilingual benchmark for evaluating large language models
    Alkaoud, Mohamed
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [2] MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models
    Cai, Yan
    Wang, Linlin
    Wang, Ye
    de Melo, Gerard
    Zhang, Ya
    Wang, Yanfeng
    He, Liang
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17709 - 17717
  • [3] DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation
    Doris, Anna C.
    Grandi, Daniele
    Tomich, Ryan
    Alam, Md Ferdous
    Ataei, Mohammadmehdi
    Cheong, Hyunmin
    Ahmed, Faez
    JOURNAL OF COMPUTING AND INFORMATION SCIENCE IN ENGINEERING, 2025, 25 (02)
  • [4] JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models
    Cao, Jialun
    Chen, Zhiyong
    Wu, Jiarong
    Cheung, Shing-Chi
    Xu, Chang
    Proceedings - 2024 39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024, : 870 - 882
  • [5] Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models
    Chen, Yuyan
    Wu, Chenwei
    Yan, Songzhou
    Liu, Panjun
    Zhou, Haoyu
    Xiao, Yanghua
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 3138 - 3167
  • [6] PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
    Valmeekam, Karthik
    Marquez, Matthew
    Olmo, Alberto
    Sreedharan, Sarath
    Kambhampati, Subbarao
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [7] Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large Language Models with the SOCKET Benchmark
    Choi, Minje
    Pei, Jiaxin
    Kumar, Sagar
    David, Shua
    Jurgens, Jurgen
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 11370 - 11403
  • [8] OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models
    Xu, Hainiu
    Zhao, Runcong
    Zhu, Lixing
    Du, Jinhua
    He, Yulan
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 8593 - 8623
  • [9] This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models
    Garcia-Ferrero, Iker
    Altuna, Begona
    Alvez, Javier
    Gonzalez-Dios, Itziar
    Rigau, German
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8596 - 8615
  • [10] Large Vocabulary SOUL Neural Network Language Models
    Le, Hai-Son
    Oparin, Ilya
    Messaoudi, Abdel
    Allauzen, Alexandre
    Gauvain, Jean-Luc
    Yvon, Francois
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 1480 - +