COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning

被引:0
|
作者
Pan, Jing [1 ]
Wu, Jian [1 ]
Gaur, Yashesh [1 ]
Sivasankaran, Sunit [1 ]
Chen, Zhuo [1 ]
Liu, Shujie [1 ]
Li, Jinyu [1 ]
机构
[1] Microsoft, One Microsoft Way, Redmond, WA 98052 USA
来源
关键词
multi modality; large language model; speechin-context learning; instruction tuning;
D O I
10.21437/Interspeech.2024-1346
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a cost-effective method to integrate speech into a large language model (LLM), resulting in a Contextual Speech Model with Instruction-following/in-context-learning Capabilities (COSMIC) multi-modal LLM. Using GPT-3.5, we generate Speech Comprehension Test Question-Answer (SQA) pairs from speech transcriptions for supervised instruction tuning. With under 30 million trainable parameters and only 450 hours of English speech data, COSMIC demonstrates emerging capabilities in instruction-following and in-context learning. Equipped with such capabilities, COSMIC achieves a maximum 33.18 BLEU score in 0-shot EN-to-X speech to text translation (S2TT) and a significant boost in the 1-shot setting. Additionally, there is an average 25.8% relative Word Error Rate (WER) reduction for 1-shot cross-domain adaptation. COSMIC exhibits a significant automatic speech recognition (ASR) accuracy gain in contextual biasing tasks due to its instruction-following capability.
引用
收藏
页码:4164 / 4168
页数:5
相关论文
共 43 条
  • [31] Efficient Data Selection for Speech Recognition Based on Prior Confidence Estimation Using Speech and Context Independent Models
    Kobashikawa, Satoshi
    Asami, Taichi
    Yamaguchi, Yoshikazu
    Masataki, Hirokazu
    Takahashi, Satoshi
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 238 - 241
  • [32] JMI at SemEval 2024 Task 3: Two-step approach for multimodal ECAC using in-context learning with GPT and instruction-tuned Llama models
    Arefa
    Ansari, Mohammed Abbas
    Saxena, Chandni
    Ahmad, Tanvir
    PROCEEDINGS OF THE 18TH INTERNATIONAL WORKSHOP ON SEMANTIC EVALUATION, SEMEVAL-2024, 2024, : 1561 - 1576
  • [33] Evaluating Arabic Emotion Recognition Task Using ChatGPT Models: A Comparative Analysis between Emotional Stimuli Prompt, Fine-Tuning, and In-Context Learning
    Nfaoui, El Habib
    Elfaik, Hanane
    JOURNAL OF THEORETICAL AND APPLIED ELECTRONIC COMMERCE RESEARCH, 2024, 19 (02): : 1118 - 1141
  • [34] Pre-Learning Environment Representations for Data-Efficient Neural Instruction Following
    Gaddy, David
    Klein, Dan
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1946 - 1956
  • [35] Efficient Use of Training Data for Sinhala Speech Recognition using Active Learning
    Nadungodage, Thilini
    Weerasinghe, Ruvan
    Niranjan, Mahesan
    2013 INTERNATIONAL CONFERENCE ON ADVANCES IN ICT FOR EMERGING REGIONS (ICTER), 2013, : 149 - 153
  • [36] Efficient and Stable Adversarial Learning Using Unpaired Data for Unsupervised Multichannel Speech Separation
    Nakagome, Yu
    Togami, Masahito
    Ogawa, Tetsuji
    Kobayashi, Tetsunori
    INTERSPEECH 2021, 2021, : 3051 - 3055
  • [37] Self-Supervised Learning With Data-Efficient Supervised Fine-Tuning for Crowd Counting
    Wang, Rui
    Hao, Yixue
    Hu, Long
    Chen, Jincai
    Chen, Min
    Wu, Di
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1538 - 1546
  • [38] Efficient learning of multiple context-free languages with multidimensional substitutability from positive data
    Yoshinaka, Ryo
    THEORETICAL COMPUTER SCIENCE, 2011, 412 (19) : 1821 - 1831
  • [39] Efficient and Accurate Object Extraction from Scanned Maps by Leveraging External Data and Learning Representative Context
    Duan, Weiwei
    31ST ACM SIGSPATIAL INTERNATIONAL CONFERENCE ON ADVANCES IN GEOGRAPHIC INFORMATION SYSTEMS, ACM SIGSPATIAL GIS 2023, 2023, : 19 - 20
  • [40] DeepSpectrumLite: A Power-Efficient Transfer Learning Framework for Embedded Speech and Audio Processing From Decentralized Data
    Amiriparian, Shahin
    Huebner, Tobias
    Karas, Vincent
    Gerczuk, Maurice
    Ottl, Sandra
    Schuller, Bjoern W.
    FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2022, 5