Audio Representation Learning by Distilling Video as Privileged Information

被引:2
|
作者
Hajavi A. [1 ]
Etemad A. [1 ]
机构
[1] Queen's University at Kingston, Department of Electrical and Computer Engineering, Kingston, K7L 3N6, ON
来源
关键词
Audiovisual representation learning; deep learning; knowledge distillation; learning using privileged information (LUPI); multimodal data;
D O I
10.1109/TAI.2023.3243596
中图分类号
学科分类号
摘要
Deep audio representation learning using multimodal audiovisual data often leads to a better performance compared to unimodal approaches. However, in real-world scenarios, both modalities are not always available at the time of inference, leading to performance degradation by models trained for multimodal inference. In this article, we propose a novel approach for deep audio representation learning using audiovisual data when the video modality is absent at inference. For this purpose, we adopt teacher-student knowledge distillation under the framework of learning using privileged information (LUPI). While the previous methods proposed for LUPI use soft labels generated by the teacher, in our proposed method, we use embeddings learned by the teacher to train the student network. We integrate our method in two different settings: sequential data where the features are divided into multiple segments throughout time, and nonsequential data where the entire features are treated as one whole segment. In the nonsequential setting, both the teacher and student networks are comprised of an encoder component and a task header. We use the embeddings produced by the encoder component of the teacher to train the encoder of the student, while the task header of the student is trained using ground-truth labels. In the sequential setting, the networks have an additional aggregation component that is placed between the encoder and the task header. We use two sets of embeddings produced by the encoder and the aggregation component of the teacher to train the student. Similar to the nonsequential setting, the task header of the student network is trained using ground-truth labels. We test our framework on two different audiovisual tasks, namely, speaker recognition and speech emotion recognition. Through these experiments, we show that by treating the video modality as privileged information for the main goal of audio representation learning, our method results in considerable improvements over sole audio-based recognition as well as prior works that use LUPI. © 2020 IEEE.
引用
收藏
页码:446 / 456
页数:10
相关论文
共 50 条
  • [41] Distilling Discrimination and Generalization Knowledge for Event Detection via Δ-Representation Learning
    Lu, Yaojie
    Lin, Hongyu
    Han, Xianpei
    Sun, Le
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 4366 - 4376
  • [42] Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
    Erol, Mehmet Hamza
    Senocak, Arda
    Feng, Jiu
    Chung, Joon Son
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2975 - 2979
  • [43] Supervised Representation Learning for Audio Scene Classification
    Rakotomamonjy, Alain
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (06) : 1253 - 1265
  • [44] Learning protein binding affinity using privileged information
    Wajid Arshad Abbasi
    Amina Asif
    Asa Ben-Hur
    Fayyaz ul Amir Afsar Minhas
    BMC Bioinformatics, 19
  • [45] Retaining Privileged Information for Multi-Task Learning
    Tang, Fengyi
    Xiao, Cao
    Wang, Fei
    Zhou, Jiayu
    Lehman, Li-wei H.
    KDD'19: PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2019, : 1369 - 1377
  • [46] Learning protein binding affinity using privileged information
    Abbasi, Wajid Arshad
    Asif, Amina
    Ben-Hur, Asa
    Minhas, Fayyaz ul Amir Afsar
    BMC BIOINFORMATICS, 2018, 19
  • [47] A novel extreme learning machine using privileged information
    Zhang, Wenbo
    Ji, Hongbing
    Liao, Guisheng
    Zhang, Yongquan
    NEUROCOMPUTING, 2015, 168 : 823 - 828
  • [48] A new method for positive and unlabeled learning with privileged information
    Liu, Bo
    Liu, Qian
    Xiao, Yanshan
    APPLIED INTELLIGENCE, 2022, 52 (03) : 2465 - 2479
  • [49] A new method for positive and unlabeled learning with privileged information
    Bo Liu
    Qian Liu
    Yanshan Xiao
    Applied Intelligence, 2022, 52 : 2465 - 2479
  • [50] Learning with privileged information for multi-Label classification
    Wang, Shangfei
    Chen, Shiyu
    Chen, Tanfang
    Shi, Xiaoxiao
    PATTERN RECOGNITION, 2018, 81 : 60 - 70