Audio Representation Learning by Distilling Video as Privileged Information

被引:2
|
作者
Hajavi A. [1 ]
Etemad A. [1 ]
机构
[1] Queen's University at Kingston, Department of Electrical and Computer Engineering, Kingston, K7L 3N6, ON
来源
关键词
Audiovisual representation learning; deep learning; knowledge distillation; learning using privileged information (LUPI); multimodal data;
D O I
10.1109/TAI.2023.3243596
中图分类号
学科分类号
摘要
Deep audio representation learning using multimodal audiovisual data often leads to a better performance compared to unimodal approaches. However, in real-world scenarios, both modalities are not always available at the time of inference, leading to performance degradation by models trained for multimodal inference. In this article, we propose a novel approach for deep audio representation learning using audiovisual data when the video modality is absent at inference. For this purpose, we adopt teacher-student knowledge distillation under the framework of learning using privileged information (LUPI). While the previous methods proposed for LUPI use soft labels generated by the teacher, in our proposed method, we use embeddings learned by the teacher to train the student network. We integrate our method in two different settings: sequential data where the features are divided into multiple segments throughout time, and nonsequential data where the entire features are treated as one whole segment. In the nonsequential setting, both the teacher and student networks are comprised of an encoder component and a task header. We use the embeddings produced by the encoder component of the teacher to train the encoder of the student, while the task header of the student is trained using ground-truth labels. In the sequential setting, the networks have an additional aggregation component that is placed between the encoder and the task header. We use two sets of embeddings produced by the encoder and the aggregation component of the teacher to train the student. Similar to the nonsequential setting, the task header of the student network is trained using ground-truth labels. We test our framework on two different audiovisual tasks, namely, speaker recognition and speech emotion recognition. Through these experiments, we show that by treating the video modality as privileged information for the main goal of audio representation learning, our method results in considerable improvements over sole audio-based recognition as well as prior works that use LUPI. © 2020 IEEE.
引用
收藏
页码:446 / 456
页数:10
相关论文
共 50 条
  • [1] DiPCAN: Distilling Privileged Information for Crowd-Aware Navigation
    Monaci, Gianluca
    Aractingi, Michel
    Silander, Tomi
    ROBOTICS: SCIENCE AND SYSTEM XVIII, 2022,
  • [2] Distilling Privileged Multimodal Information for Expression Recognition using Optimal Transport
    Aslam, Muhammad Haseeb
    Zeeshan, Muhammad Osama
    Belharbi, Soufiane
    Pedersoli, Marco
    Koerich, Alessandro Lameiras
    Bacon, Simon
    Granger, Eric
    2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,
  • [3] Video genre categorization and representation using audio-visual information
    Ionescu, Bogdan
    Seyerlehner, Klaus
    Rasche, Christoph
    Vertan, Constantin
    Lambert, Patrick
    JOURNAL OF ELECTRONIC IMAGING, 2012, 21 (02)
  • [4] Distilling Reinforcement Learning Tricks for Video Games
    Kanervisto, Anssi
    Scheller, Christian
    Schraner, Yanick
    Hautamaki, Ville
    2021 IEEE CONFERENCE ON GAMES (COG), 2021, : 1088 - 1091
  • [5] Distilling Audio-Visual Knowledge by Compositional Contrastive Learning
    Chen, Yanbei
    Xian, Yongqin
    Koepke, A. Sophia
    Shan, Ying
    Akata, Zeynep
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 7012 - 7021
  • [6] Learning Video Segmentation and Presentation System Based on Visual and Audio Information
    Liao, Yi Chun
    Huang, Chun-Hong
    2008 FIRST IEEE INTERNATIONAL CONFERENCE ON UBI-MEDIA COMPUTING AND WORKSHOPS, PROCEEDINGS, 2008, : 580 - +
  • [7] Distilling Localization for Self-Supervised Representation Learning
    Zhao, Nanxuan
    Wu, Zhirong
    Lau, Rynson W. H.
    Lin, Stephen
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 10990 - 10998
  • [8] Privileged information learning with weak labels
    Xiao, Yanshan
    Ye, Zexin
    Zhao, Liang
    Kong, Xiangjun
    Liu, Bo
    Polat, Kemal
    Alhudhaif, Adi
    APPLIED SOFT COMPUTING, 2023, 142
  • [9] Learning to Rank Using Privileged Information
    Sharmanska, Viktoriia
    Quadrianto, Novi
    Lampert, Christoph H.
    2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 825 - 832
  • [10] On Efficiency of Learning Under Privileged Information
    Nouretdinov, Ilia
    CONFORMAL AND PROBABILISTIC PREDICTION WITH APPLICATIONS, VOL 179, 2022, 179