共 50 条
- [2] MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17709 - 17717
- [4] JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models Proceedings - 2024 39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024, : 870 - 882
- [5] Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 3138 - 3167
- [6] PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
- [7] Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large Language Models with the SOCKET Benchmark 2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 11370 - 11403
- [8] OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 8593 - 8623
- [9] This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models 2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8596 - 8615
- [10] Large Vocabulary SOUL Neural Network Language Models 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 1480 - +