Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

被引：0

作者：

Geng, Shijie ^{[1
]}

Gao, Peng ^{[2
]}

Chatterjee, Moitreya ^{[3
]}

Hori, Chiori ^{[4
]}

Le Roux, Jonathan ^{[4
]}

Zhang, Yongfeng ^{[1
]}

Li, Hongsheng ^{[2
]}

Cherian, Anoop ^{[4
]}

机构：

[1] Rutgers State Univ, Piscataway, NJ 08854 USA

[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[3] Univ Illinois, Urbana, IL USA

[4] Mitsubishi Elect Res Labs MERL, Cambridge, MA USA

来源：

THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2021年 / 35卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. This task thus poses a challenging multi-modal representation learning and reasoning scenario, advancements into which could influence several human-machine interaction applications. To solve this task, we introduce a semantics-controlled multi-modal shuffled Transformer reasoning framework, consisting of a sequence of Transformer modules, each taking a modality as input and producing representations conditioned on the input question. Our proposed Transformer variant uses a shuffling scheme on their multi-head outputs, demonstrating better regularization. To encode fine-grained visual information, we present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing spatio-semantic graph representations for every frame, and an inter-frame aggregation module capturing temporal cues. Our entire pipeline is trained end-to-end. We present experiments on the benchmark AVSD dataset, both on answer generation and selection tasks. Our results demonstrate state-of-the-art performances on all evaluation metrics.

引用

页码：1415 / 1423

页数：9

共 50 条

[31] Multi-modal Representation Learning for Successive POI Recommendation
Li, Lishan
Liu, Ying
Wu, Jianping
He, Lin
Ren, Gang
ASIAN CONFERENCE ON MACHINE LEARNING, VOL 101, 2019, 101 : 441 - 456
[32] Joint Representation Learning for Multi-Modal Transportation Recommendation
Liu, Hao
Li, Ting
Hu, Renjun
Fu, Yanjie
Gu, Jingjing
Xiong, Hui
THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 1036 - 1043
[33] Deep contrastive representation learning for multi-modal clustering
Lu, Yang
Li, Qin
Zhang, Xiangdong
Gao, Quanxue
NEUROCOMPUTING, 2024, 581
[34] Supervised Multi-modal Dictionary Learning for Clothing Representation
Zhao, Qilu
Wang, Jiayan
Li, Zongmin
PROCEEDINGS OF THE FIFTEENTH IAPR INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS - MVA2017, 2017, : 51 - 54
[35] Enhanced Topic Modeling with Multi-modal Representation Learning
Zhang, Duoyi
Wang, Yue
Abul Bashar, Md
Nayak, Richi
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2023, PT I, 2023, 13935 : 393 - 404
[36] Editorial for Special Issue on Multi-modal Representation Learning
Fan, Deng-Ping
Barnes, Nick
Cheng, Ming-Ming
Van Gool, Luc
MACHINE INTELLIGENCE RESEARCH, 2024, 21 (04) : 615 - 616
[37] Multi-Modal Knowledge Representation Learning via Webly-Supervised Relationships Mining
Nian, Fudong
Bao, Bing-Kun
Li, Teng
Xu, Changsheng
PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 411 - 419
[38] Dynamic Tracking of State Anxiety via Multi-Modal Data and Machine Learning
Ding, Yue
Liu, Jingjing
Zhang, Xiaochen
Yang, Zhi
FRONTIERS IN PSYCHIATRY, 2022, 13
[39] MGDR: Multi-modal Graph Disentangled Representation for Brain Disease Prediction
Jiang, Bo
Li, Yapeng
Wan, Xixi
Chen, Yuan
Tu, Zhengzheng
Zhao, Yumiao
Tang, Jin
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT II, 2024, 15002 : 302 - 312
[40] Representation and Fusion Based on Knowledge Graph in Multi-Modal Semantic Communication
Xing, Chenlin
Lv, Jie
Luo, Tao
Zhang, Zhilong
IEEE WIRELESS COMMUNICATIONS LETTERS, 2024, 13 (05) : 1344 - 1348

← 1 2 3 4 5 →