Establishing vocabulary tests as a benchmark for evaluating large language models

被引：0

作者：

Martinez, Gonzalo ^{[1
]}

Conde, Javier ^{[2
]}

Merino-Gomez, Elena ^{[3
]}

Bermudez-Margaretto, Beatriz ^{[4
]}

Hernandez, Jose Alberto ^{[1
]}

Reviriego, Pedro ^{[2
]}

Brysbaert, Marc ^{[5
]}

机构：

[1] Univ Carlos III Madrid, Dept Ingn Telemat, Leganes, Spain

[2] Univ Politecn Madrid, ETSI Telecomunicac, Madrid, Spain

[3] Univ Valladolid, Escuela Ingn Ind, Valladolid, Spain

[4] Univ Salamanca, Dept Psicol Basica Psicobiol & Metodol Las CC Com, Salamanca, Spain

[5] Univ Ghent, Dept Expt Psychol, Ghent, Belgium

来源：

PLOS ONE | 2024年 / 19卷 / 12期

关键词：

WORD RECOGNITION; ACQUISITION; LEXTALE;

D O I：

10.1371/journal.pone.0308259

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama 2, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific tasks or domain-specific knowledge, they often neglect the fundamental linguistic aspects of language understanding. In this paper, we advocate for the revival of vocabulary tests as a valuable tool for assessing LLM performance. We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge. These findings shed light on the intricacies of LLM word representations, their learning mechanisms, and performance variations across models and languages. Moreover, the ability to automatically generate and perform vocabulary tests offers new opportunities to expand the approach and provide a more complete picture of LLMs' language skills.

引用

页数：17

共 50 条

[1] A bilingual benchmark for evaluating large language models
Alkaoud, Mohamed
PEERJ COMPUTER SCIENCE, 2024, 10
[2] MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models
Cai, Yan
Wang, Linlin
Wang, Ye
de Melo, Gerard
Zhang, Ya
Wang, Yanfeng
He, Liang
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17709 - 17717
[3] DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation
Doris, Anna C.
Grandi, Daniele
Tomich, Ryan
Alam, Md Ferdous
Ataei, Mohammadmehdi
Cheong, Hyunmin
Ahmed, Faez
JOURNAL OF COMPUTING AND INFORMATION SCIENCE IN ENGINEERING, 2025, 25 (02)
[4] JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models
Cao, Jialun
Chen, Zhiyong
Wu, Jiarong
Cheung, Shing-Chi
Xu, Chang
Proceedings - 2024 39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024, : 870 - 882
[5] Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models
Chen, Yuyan
Wu, Chenwei
Yan, Songzhou
Liu, Panjun
Zhou, Haoyu
Xiao, Yanghua
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 3138 - 3167
[6] PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
Valmeekam, Karthik
Marquez, Matthew
Olmo, Alberto
Sreedharan, Sarath
Kambhampati, Subbarao
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[7] Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large Language Models with the SOCKET Benchmark
Choi, Minje
Pei, Jiaxin
Kumar, Sagar
David, Shua
Jurgens, Jurgen
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 11370 - 11403
[8] OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models
Xu, Hainiu
Zhao, Runcong
Zhu, Lixing
Du, Jinhua
He, Yulan
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 8593 - 8623
[9] This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models
Garcia-Ferrero, Iker
Altuna, Begona
Alvez, Javier
Gonzalez-Dios, Itziar
Rigau, German
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8596 - 8615
[10] Large Vocabulary SOUL Neural Network Language Models
Le, Hai-Son
Oparin, Ilya
Messaoudi, Abdel
Allauzen, Alexandre
Gauvain, Jean-Luc
Yvon, Francois
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 1480 - +

← 1 2 3 4 5 →