On the Evaluation of Speech Foundation Models for Spoken Language Understanding

被引:0
|
作者
Arora, Siddhant [1 ]
Pasad, Ankita [2 ]
Chien, Chung-Ming [2 ]
Han, Jionghao [1 ]
Sharma, Roshan [1 ]
Jung, Jee-weon [1 ]
Dhamyal, Hira [1 ]
Chen, William [1 ]
Shona, Suwon [3 ]
Lee, Hung-yi [4 ]
Livescu, Karen [2 ]
Watanabe, Shinji [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Toyota Technol Inst Chicago, Chicago, IL USA
[3] ASAPP, New York, NY USA
[4] Natl Taiwan Univ, Taipei, Taiwan
基金
美国国家科学基金会;
关键词
RECOGNITION;
D O I
暂无
中图分类号
学科分类号
摘要
The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for these SLU tasks. However, the community still lacks a fine-grained understanding of the comparative utility of different SFMs. Inspired by this, we ask: which SFMs offer the most benefits for these complex SLU tasks, and what is the most effective approach for incorporating these SFMs? To answer this, we perform an extensive evaluation of multiple supervised and self-supervised SFMs using several evaluation protocols: (i) frozen SFMs with a lightweight prediction head, (ii) frozen SFMs with a complex prediction head, and (iii) fine-tuned SFMs with a lightweight prediction head. Although the supervised SFMs are pretrained on much more speech recognition data (with labels), they do not always outperform self-supervised SFMs; the latter tend to perform at least as well as, and sometimes better than, supervised SFMs, especially on the sequence generation tasks in SLUE. While there is no universally optimal way of incorporating SFMs, the complex prediction head gives the best performance for most tasks, although it increases the inference time. We also introduce an open-source toolkit and performance leaderboard, SLUE-PERB, for these tasks and modeling strategies.
引用
收藏
页码:11923 / 11938
页数:16
相关论文
共 50 条
  • [31] Can ChatGPT Detect Intent? Evaluating Large Language Models for Spoken Language Understanding
    He, Mutian
    Garner, Philip N.
    INTERSPEECH 2023, 2023, : 1109 - 1113
  • [32] WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models
    Gao, Heting
    Ni, Junrui
    Qian, Kaizhi
    Zhang, Yang
    Chang, Shiyu
    Hasegawa-Johnson, Mark
    INTERSPEECH 2022, 2022, : 2738 - 2742
  • [33] Combining Statistical and Syntactical Systems for Spoken Language Understanding with Graphical Models
    Schwaerzler, S.
    Geiger, J.
    Schenk, J.
    Al-Hames, M.
    Hoernler, B.
    Ruske, G.
    Rigoll, G.
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 1590 - 1593
  • [34] Augmenting Slot Values and Contexts for Spoken Language Understanding with Pretrained Models
    Lin, Haitao
    Xiang, Lu
    Zhou, Yu
    Zhang, Jiajun
    Zong, Chengqing
    INTERSPEECH 2021, 2021, : 4703 - 4707
  • [35] PERFORMANCE EVALUATION OF SPHINX AND HTK SPEECH RECOGNIZERS FOR SPOKEN ARABIC LANGUAGE
    Al-Anzi, Fawaz S.
    AbuZeina, Dia
    INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2019, 15 (03): : 1009 - 1021
  • [36] Performance Evaluation of Speech CODECs against the Change in the Spoken Language and Accent
    Micheal, Michael N.
    Messiha, Nagy W.
    Mansour, Hala A.
    2013 30TH NATIONAL RADIO SCIENCE CONFERENCE (NRSC2013), 2013, : 409 - 422
  • [37] JOINT LANGUAGE MODELS FOR AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING
    Bayer, Ali Orkan
    Riccardi, Giuseppe
    2012 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2012), 2012, : 199 - 203
  • [38] A Large-Scale Evaluation of Speech Foundation Models
    Yang, Shu-wen
    Chang, Heng-Jui
    Huang, Zili
    Liu, Andy T.
    Lai, Cheng-, I
    Wu, Haibin
    Shi, Jiatong
    Chang, Xuankai
    Tsai, Hsiang-Sheng
    Huang, Wen-Chin
    Feng, Tzu-hsun
    Chi, Po-Han
    Lin, Yist Y.
    Chuang, Yung-Sung
    Huang, Tzu-Hsien
    Tseng, Wei-Cheng
    Lakhotia, Kushal
    Li, Shang-Wen
    Mohamed, Abdelrahman
    Watanabe, Shinji
    Lee, Hung-yi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2884 - 2899
  • [39] END-TO-END SPOKEN LANGUAGE UNDERSTANDING WITHOUT MATCHED LANGUAGE SPEECH MODEL PRETRAINING DATA
    Price, Ryan
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7979 - 7983
  • [40] Multilinguality in speech and spoken language systems
    Waibel, A
    Geutner, P
    Tomokiyo, LM
    Schultz, T
    Woszczyna, M
    PROCEEDINGS OF THE IEEE, 2000, 88 (08) : 1297 - 1313