On the Evaluation of Speech Foundation Models for Spoken Language Understanding

被引:0
|
作者
Arora, Siddhant [1 ]
Pasad, Ankita [2 ]
Chien, Chung-Ming [2 ]
Han, Jionghao [1 ]
Sharma, Roshan [1 ]
Jung, Jee-weon [1 ]
Dhamyal, Hira [1 ]
Chen, William [1 ]
Shona, Suwon [3 ]
Lee, Hung-yi [4 ]
Livescu, Karen [2 ]
Watanabe, Shinji [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Toyota Technol Inst Chicago, Chicago, IL USA
[3] ASAPP, New York, NY USA
[4] Natl Taiwan Univ, Taipei, Taiwan
基金
美国国家科学基金会;
关键词
RECOGNITION;
D O I
暂无
中图分类号
学科分类号
摘要
The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for these SLU tasks. However, the community still lacks a fine-grained understanding of the comparative utility of different SFMs. Inspired by this, we ask: which SFMs offer the most benefits for these complex SLU tasks, and what is the most effective approach for incorporating these SFMs? To answer this, we perform an extensive evaluation of multiple supervised and self-supervised SFMs using several evaluation protocols: (i) frozen SFMs with a lightweight prediction head, (ii) frozen SFMs with a complex prediction head, and (iii) fine-tuned SFMs with a lightweight prediction head. Although the supervised SFMs are pretrained on much more speech recognition data (with labels), they do not always outperform self-supervised SFMs; the latter tend to perform at least as well as, and sometimes better than, supervised SFMs, especially on the sequence generation tasks in SLUE. While there is no universally optimal way of incorporating SFMs, the complex prediction head gives the best performance for most tasks, although it increases the inference time. We also introduce an open-source toolkit and performance leaderboard, SLUE-PERB, for these tasks and modeling strategies.
引用
收藏
页码:11923 / 11938
页数:16
相关论文
共 50 条
  • [41] SPEECH PRODUCTION AND SPOKEN LANGUAGE OF DEAF
    BRANNON, JB
    LANGUAGE AND SPEECH, 1966, 9 : 127 - &
  • [42] CUED SPEECH AND THE RECEPTION OF SPOKEN LANGUAGE
    NICHOLLS, GH
    LING, D
    JOURNAL OF SPEECH AND HEARING RESEARCH, 1982, 25 (02): : 262 - 269
  • [43] A STUDY OF THE SPOKEN LANGUAGE AND COLLOQUIAL SPEECH
    BORETTIDEMACCHIA, SH
    ESTUDIOS FILOLOGICOS, 1985, (20): : 115 - 126
  • [44] Spoken language understanding software for language learning
    Alam, Hassan
    Kumar, Aman
    Rahman, Fuad
    Hartono, Rachmat
    Tarnikova, Yuliya
    INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL II, 2007, : 107 - +
  • [45] New Perspectives on Spoken Language Understanding: Does Machine Need to Fully Understand Speech?
    Kawahara, Tatsuya
    2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, : 46 - 50
  • [46] TOWARDS REDUCING THE NEED FOR SPEECH TRAINING DATA TO BUILD SPOKEN LANGUAGE UNDERSTANDING SYSTEMS
    Thomas, Samuel
    Kuo, Hong-Kwang J.
    Kingsbury, Brian
    Saon, George
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7932 - 7936
  • [47] Confidence measure for speech-to-concept end-to-end spoken language understanding
    Caubriere, Antoine
    Esteve, Yannick
    Laurent, Antoine
    Morin, Emmanuel
    INTERSPEECH 2020, 2020, : 1590 - 1594
  • [48] Speech Model Pre-training for End-to-End Spoken Language Understanding
    Lugosch, Loren
    Ravanelli, Mirco
    Ignoto, Patrick
    Tomar, Vikrant Singh
    Bengio, Yoshua
    INTERSPEECH 2019, 2019, : 814 - 818
  • [49] EVALUATION OF SPOKEN LANGUAGE
    WILKINSO.A
    STRATTA, L
    EDUCATIONAL REVIEW, 1969, 21 (03) : 183 - &
  • [50] The spoken language corpus: a foundation for grammatical theory
    Halliday, MAK
    ADVANCES IN CORPUS LINGUISTICS, 2004, (49): : 11 - 38