Audio Representation Learning by Distilling Video as Privileged Information

被引:2
|
作者
Hajavi A. [1 ]
Etemad A. [1 ]
机构
[1] Queen's University at Kingston, Department of Electrical and Computer Engineering, Kingston, K7L 3N6, ON
来源
关键词
Audiovisual representation learning; deep learning; knowledge distillation; learning using privileged information (LUPI); multimodal data;
D O I
10.1109/TAI.2023.3243596
中图分类号
学科分类号
摘要
Deep audio representation learning using multimodal audiovisual data often leads to a better performance compared to unimodal approaches. However, in real-world scenarios, both modalities are not always available at the time of inference, leading to performance degradation by models trained for multimodal inference. In this article, we propose a novel approach for deep audio representation learning using audiovisual data when the video modality is absent at inference. For this purpose, we adopt teacher-student knowledge distillation under the framework of learning using privileged information (LUPI). While the previous methods proposed for LUPI use soft labels generated by the teacher, in our proposed method, we use embeddings learned by the teacher to train the student network. We integrate our method in two different settings: sequential data where the features are divided into multiple segments throughout time, and nonsequential data where the entire features are treated as one whole segment. In the nonsequential setting, both the teacher and student networks are comprised of an encoder component and a task header. We use the embeddings produced by the encoder component of the teacher to train the encoder of the student, while the task header of the student is trained using ground-truth labels. In the sequential setting, the networks have an additional aggregation component that is placed between the encoder and the task header. We use two sets of embeddings produced by the encoder and the aggregation component of the teacher to train the student. Similar to the nonsequential setting, the task header of the student network is trained using ground-truth labels. We test our framework on two different audiovisual tasks, namely, speaker recognition and speech emotion recognition. Through these experiments, we show that by treating the video modality as privileged information for the main goal of audio representation learning, our method results in considerable improvements over sole audio-based recognition as well as prior works that use LUPI. © 2020 IEEE.
引用
收藏
页码:446 / 456
页数:10
相关论文
共 50 条
  • [31] Extracting Privileged Information for Enhancing Classifier Learning
    Yao, Yazhou
    Shen, Fumin
    Zhang, Jian
    Liu, Li
    Tang, Zhenmin
    Shao, Ling
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (01) : 436 - 450
  • [32] Information Bottleneck Learning Using Privileged Information for Visual Recognition
    Motiian, Saeid
    Piccirilli, Marco
    Adjeroh, Donald A.
    Doretto, Gianfranco
    2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 1496 - 1505
  • [33] SEEING AND HEARING TOO: AUDIO REPRESENTATION FOR VIDEO CAPTIONING
    Chuang, Shun-Po
    Wan, Chia-Hung
    Huang, Pang-Chi
    Yang, Chi-Yu
    Lee, Hung-Yi
    2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 381 - 388
  • [34] Modern coding algorithms of video and audio information
    Popovich, P
    Shvaichenko, V
    MODERN PROBLEMS OF RADIO ENGINEERING, TELECOMMUNICATIONS AND COMPUTER SCIENCE, PROCEEDINGS, 2004, : 181 - 184
  • [35] Surveillance Robot Utilizing Video and Audio Information
    Xinyu Wu
    Haitao Gong
    Pei Chen
    Zhi Zhong
    Yangsheng Xu
    Journal of Intelligent and Robotic Systems, 2009, 55 : 403 - 421
  • [36] REMOTE ACCESS AUDIO/VIDEO INFORMATION SYSTEM
    CROSSMAN, DM
    LIBRARY TRENDS, 1971, 19 (04) : 437 - &
  • [37] Surveillance Robot Utilizing Video and Audio Information
    Wu, Xinyu
    Gong, Haitao
    Chen, Pei
    Zhong, Zhi
    Xu, Yangsheng
    JOURNAL OF INTELLIGENT & ROBOTIC SYSTEMS, 2009, 55 (4-5) : 403 - 421
  • [38] Multivariate mutual information for audio video fusion
    Dilpazir, Hammad
    Muhammad, Zia
    Minhas, Qurratulain
    Ahmed, Faheem
    Malik, Hafiz
    Mahmood, Hasan
    SIGNAL IMAGE AND VIDEO PROCESSING, 2016, 10 (07) : 1265 - 1272
  • [39] Distilling a Hierarchical Policy for Planning and Control via Representation and Reinforcement Learning
    Ha, Jung-Su
    Park, Young-Jin
    Chae, Hyeok-Joo
    Park, Soon-Seo
    Choi, Han-Lim
    2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 4459 - 4466
  • [40] Multivariate mutual information for audio video fusion
    Hammad Dilpazir
    Zia Muhammad
    Qurratulain Minhas
    Faheem Ahmed
    Hafiz Malik
    Hasan Mahmood
    Signal, Image and Video Processing, 2016, 10 : 1265 - 1272