On the Evaluation of Speech Foundation Models for Spoken Language Understanding

被引:0
|
作者
Arora, Siddhant [1 ]
Pasad, Ankita [2 ]
Chien, Chung-Ming [2 ]
Han, Jionghao [1 ]
Sharma, Roshan [1 ]
Jung, Jee-weon [1 ]
Dhamyal, Hira [1 ]
Chen, William [1 ]
Shona, Suwon [3 ]
Lee, Hung-yi [4 ]
Livescu, Karen [2 ]
Watanabe, Shinji [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Toyota Technol Inst Chicago, Chicago, IL USA
[3] ASAPP, New York, NY USA
[4] Natl Taiwan Univ, Taipei, Taiwan
基金
美国国家科学基金会;
关键词
RECOGNITION;
D O I
暂无
中图分类号
学科分类号
摘要
The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for these SLU tasks. However, the community still lacks a fine-grained understanding of the comparative utility of different SFMs. Inspired by this, we ask: which SFMs offer the most benefits for these complex SLU tasks, and what is the most effective approach for incorporating these SFMs? To answer this, we perform an extensive evaluation of multiple supervised and self-supervised SFMs using several evaluation protocols: (i) frozen SFMs with a lightweight prediction head, (ii) frozen SFMs with a complex prediction head, and (iii) fine-tuned SFMs with a lightweight prediction head. Although the supervised SFMs are pretrained on much more speech recognition data (with labels), they do not always outperform self-supervised SFMs; the latter tend to perform at least as well as, and sometimes better than, supervised SFMs, especially on the sequence generation tasks in SLUE. While there is no universally optimal way of incorporating SFMs, the complex prediction head gives the best performance for most tasks, although it increases the inference time. We also introduce an open-source toolkit and performance leaderboard, SLUE-PERB, for these tasks and modeling strategies.
引用
收藏
页码:11923 / 11938
页数:16
相关论文
共 50 条
  • [1] SPOKEN LANGUAGE UNDERSTANDING WITHOUT SPEECH RECOGNITION
    Chen, Yuan-Ping
    Price, Ryan
    Bangalore, Srinivas
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6189 - 6193
  • [2] Discriminative Models for Spoken Language Understanding
    Wang, Ye-Yi
    Acero, Alex
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 2426 - 2429
  • [3] SLUE: NEW BENCHMARK TASKS FOR SPOKEN LANGUAGE UNDERSTANDING EVALUATION ON NATURAL SPEECH
    Shon, Suwon
    Pasad, Ankita
    Wu, Felix
    Brusco, Pablo
    Artzi, Yoav
    Livescu, Karen
    Han, Kyu J.
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7927 - 7931
  • [4] USING SPEECH SYNTHESIS TO TRAIN END-TO-END SPOKEN LANGUAGE UNDERSTANDING MODELS
    Lugosch, Loren
    Meyer, Brett H.
    Nowrouzezahrai, Derek
    Ravanelli, Mirco
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8499 - 8503
  • [5] Impact Analysis of the Use of Speech and Language Models Pretrained by Self-Supersivion for Spoken Language Understanding
    Mdhaffar, Salima
    Pelloin, Valentin
    Caubriere, Antoine
    Laperriere, Gaelle
    Ghannay, Sahar
    Jabaian, Bassam
    Camelin, Nathalie
    Esteve, Yannick
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 2949 - 2956
  • [6] RNN TRANSDUCER MODELS FOR SPOKEN LANGUAGE UNDERSTANDING
    Thomas, Samuel
    Kuo, Hong-Kwang J.
    Saon, George
    Tuske, Zoltan
    Kingsbury, Brian
    Kurata, Gakuto
    Kons, Zvi
    Hoory, Ron
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7493 - 7497
  • [7] Evaluation of spoken language understanding and dialogue systems
    Hildebrandt, B
    Rautenstrauch, H
    Sagerer, G
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 685 - 688
  • [8] Spoken language understanding - Interpreting the signs given by a speech signal
    De Mori, Renato
    Bechet, Frederic
    Hakkani-Tuer, Dilek
    McTear, Michael
    Riccardi, Giuseppe
    Tu, Gokhan
    IEEE SIGNAL PROCESSING MAGAZINE, 2008, 25 (03) : 50 - 58
  • [9] Robust dependency parsing for Spoken Language Understanding of spontaneous speech
    Bechet, Frederic
    Nasr, Alexis
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1027 - +
  • [10] SPEECH AND SPOKEN LANGUAGE
    STOKER, R
    VOLTA REVIEW, 1991, 93 (03) : 127 - 128