Establishing vocabulary tests as a benchmark for evaluating large language models

被引：0

作者：

Martinez, Gonzalo ^{[1
]}

Conde, Javier ^{[2
]}

Merino-Gomez, Elena ^{[3
]}

Bermudez-Margaretto, Beatriz ^{[4
]}

Hernandez, Jose Alberto ^{[1
]}

Reviriego, Pedro ^{[2
]}

Brysbaert, Marc ^{[5
]}

机构：

[1] Univ Carlos III Madrid, Dept Ingn Telemat, Leganes, Spain

[2] Univ Politecn Madrid, ETSI Telecomunicac, Madrid, Spain

[3] Univ Valladolid, Escuela Ingn Ind, Valladolid, Spain

[4] Univ Salamanca, Dept Psicol Basica Psicobiol & Metodol Las CC Com, Salamanca, Spain

[5] Univ Ghent, Dept Expt Psychol, Ghent, Belgium

来源：

PLOS ONE | 2024年 / 19卷 / 12期

关键词：

WORD RECOGNITION; ACQUISITION; LEXTALE;

D O I：

10.1371/journal.pone.0308259

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama 2, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific tasks or domain-specific knowledge, they often neglect the fundamental linguistic aspects of language understanding. In this paper, we advocate for the revival of vocabulary tests as a valuable tool for assessing LLM performance. We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge. These findings shed light on the intricacies of LLM word representations, their learning mechanisms, and performance variations across models and languages. Moreover, the ability to automatically generate and perform vocabulary tests offers new opportunities to expand the approach and provide a more complete picture of LLMs' language skills.

引用

页数：17

共 50 条

[31] Evaluating large language models as agents in the clinic
Mehandru, Nikita
Miao, Brenda Y.
Almaraz, Eduardo Rodriguez
Sushil, Madhumita
Butte, Atul J.
Alaa, Ahmed
NPJ DIGITAL MEDICINE, 2024, 7 (01)
[32] HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models
Li, Junyi
Cheng, Xiaoxue
Zhao, Wayne Xin
Nie, Jian-Yun
Wen, Ji-Rong
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 6449 - 6464
[33] Hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition
Yazgan, A
Saraclar, M
2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 745 - 748
[34] GRASP: A Novel Benchmark for Evaluating Language GRounding and Situated Physics Understanding in Multimodal Language Models
Jassimi, Serwan
Holubar, Mario
Richter, Annika
Wolff, Cornelius
Ohmer, Xenia
Bruni, Elia
PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 6297 - 6305
[35] ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code
Feng, Jia
Liu, Jiachen
Gao, Cuiyun
Chong, Chun Yong
Wang, Chaozheng
Gao, Shan
Xia, Xin
arXiv, 2024,
[36] ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code
Feng, Jia
Liu, Jiachen
Gao, Cuiyun
Chong, Chun Yong
Wang, Chaozheng
Gao, Shan
Xia, Xin
Proceedings - 2024 39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024, : 1895 - 1906
[37] VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Zhou, Wangchunshu
Zeng, Yan
Diao, Shizhe
Zhang, Xinsong
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[38] Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models
Kim, Seungduk
Choi, Seungtaek
Jeong, Myeongho
arXiv,
[39] Transition movement models for large vocabulary continuous sign language recognition
Gao, W
Fang, GL
Zhao, DB
Chen, YQ
SIXTH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, PROCEEDINGS, 2004, : 553 - 558
[40] Towards a benchmark dataset for large language models in the context of process automation
Tizaoui, Tejennour
Tan, Ruomu
DIGITAL CHEMICAL ENGINEERING, 2024, 13

← 1 2 3 4 5 →