Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

被引:0
|
作者
Geng, Shijie [1 ]
Gao, Peng [2 ]
Chatterjee, Moitreya [3 ]
Hori, Chiori [4 ]
Le Roux, Jonathan [4 ]
Zhang, Yongfeng [1 ]
Li, Hongsheng [2 ]
Cherian, Anoop [4 ]
机构
[1] Rutgers State Univ, Piscataway, NJ 08854 USA
[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[3] Univ Illinois, Urbana, IL USA
[4] Mitsubishi Elect Res Labs MERL, Cambridge, MA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. This task thus poses a challenging multi-modal representation learning and reasoning scenario, advancements into which could influence several human-machine interaction applications. To solve this task, we introduce a semantics-controlled multi-modal shuffled Transformer reasoning framework, consisting of a sequence of Transformer modules, each taking a modality as input and producing representations conditioned on the input question. Our proposed Transformer variant uses a shuffling scheme on their multi-head outputs, demonstrating better regularization. To encode fine-grained visual information, we present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing spatio-semantic graph representations for every frame, and an inter-frame aggregation module capturing temporal cues. Our entire pipeline is trained end-to-end. We present experiments on the benchmark AVSD dataset, both on answer generation and selection tasks. Our results demonstrate state-of-the-art performances on all evaluation metrics.
引用
收藏
页码:1415 / 1423
页数:9
相关论文
共 50 条
  • [31] Multi-modal Representation Learning for Successive POI Recommendation
    Li, Lishan
    Liu, Ying
    Wu, Jianping
    He, Lin
    Ren, Gang
    ASIAN CONFERENCE ON MACHINE LEARNING, VOL 101, 2019, 101 : 441 - 456
  • [32] Joint Representation Learning for Multi-Modal Transportation Recommendation
    Liu, Hao
    Li, Ting
    Hu, Renjun
    Fu, Yanjie
    Gu, Jingjing
    Xiong, Hui
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 1036 - 1043
  • [33] Deep contrastive representation learning for multi-modal clustering
    Lu, Yang
    Li, Qin
    Zhang, Xiangdong
    Gao, Quanxue
    NEUROCOMPUTING, 2024, 581
  • [34] Supervised Multi-modal Dictionary Learning for Clothing Representation
    Zhao, Qilu
    Wang, Jiayan
    Li, Zongmin
    PROCEEDINGS OF THE FIFTEENTH IAPR INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS - MVA2017, 2017, : 51 - 54
  • [35] Enhanced Topic Modeling with Multi-modal Representation Learning
    Zhang, Duoyi
    Wang, Yue
    Abul Bashar, Md
    Nayak, Richi
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2023, PT I, 2023, 13935 : 393 - 404
  • [36] Editorial for Special Issue on Multi-modal Representation Learning
    Fan, Deng-Ping
    Barnes, Nick
    Cheng, Ming-Ming
    Van Gool, Luc
    MACHINE INTELLIGENCE RESEARCH, 2024, 21 (04) : 615 - 616
  • [37] Multi-Modal Knowledge Representation Learning via Webly-Supervised Relationships Mining
    Nian, Fudong
    Bao, Bing-Kun
    Li, Teng
    Xu, Changsheng
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 411 - 419
  • [38] Dynamic Tracking of State Anxiety via Multi-Modal Data and Machine Learning
    Ding, Yue
    Liu, Jingjing
    Zhang, Xiaochen
    Yang, Zhi
    FRONTIERS IN PSYCHIATRY, 2022, 13
  • [39] MGDR: Multi-modal Graph Disentangled Representation for Brain Disease Prediction
    Jiang, Bo
    Li, Yapeng
    Wan, Xixi
    Chen, Yuan
    Tu, Zhengzheng
    Zhao, Yumiao
    Tang, Jin
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT II, 2024, 15002 : 302 - 312
  • [40] Representation and Fusion Based on Knowledge Graph in Multi-Modal Semantic Communication
    Xing, Chenlin
    Lv, Jie
    Luo, Tao
    Zhang, Zhilong
    IEEE WIRELESS COMMUNICATIONS LETTERS, 2024, 13 (05) : 1344 - 1348