More Diverse Training, Better Compositionality! Evidence from Multimodal Language Learning

被引：0

作者：

Volquardsen, Caspar ^{[1
]}

Lee, Jae Hee ^{[1
]}

Weber, Cornelius ^{[1
]}

Wermter, Stefan ^{[1
]}

机构：

[1] Univ Hamburg, Dept Informat, Knowledge Technol, Hamburg, Germany

来源：

ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III | 2022年 / 13531卷

关键词：

Compositional generalization; Computer vision; Multimodality; Sequence-to-sequence; Robotics;

D O I：

10.1007/978-3-031-15934-3_35

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Artificial neural networks still fall short of human-level generalization and require a very large number of training examples to succeed. Model architectures that further improve generalization capabilities are therefore still an open research question. We created a multimodal dataset from simulation for measuring the compositional generalization of neural networks in multimodal language learning. The dataset consists of sequences showing a robot arm interacting with objects on a table in a simple 3D environment, with the goal of describing the interaction. Compositional object features, multiple actions, and distracting objects pose challenges to the model. We show that an LSTM-encoder-decoder architecture jointly trained together with a vision-encoder surpasses previous performance and handles multiple visible objects. Visualization of important input dimensions shows that a model that is trained with multiple objects, but not a model trained on just one object, has learnt to ignore irrelevant objects. Furthermore we show that additional modalities in the input improve the overall performance. We conclude that the underlying training data has a significant influence on the model's capability to generalize compositionally.

引用

页码：417 / 428

页数：12

共 50 条

[31] Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
Sun, Yuchong
Xue, Hongwei
Song, Ruihua
Liu, Bei
Yang, Huan
Fu, Jianlong
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[32] Learning language-independent representations of verbs and adjectives from multimodal retrieval
Hansen, Victor Petren Bach
Sogaard, Anders
2018 14TH INTERNATIONAL CONFERENCE ON SIGNAL IMAGE TECHNOLOGY & INTERNET BASED SYSTEMS (SITIS), 2018, : 427 - 434
[33] Enhancing Multimodal Sentiment Analysis via Learning from Large Language Model
Pang, Ning
Wu, Wansen
Hu, Yue
Xu, Kai
Yin, Quanjun
Qin, Long
2024 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME 2024, 2024,
[34] More female, better corporate performance? Evidence from Chinese listed companies
Zhang, Zhen
Wu, Yifan
He, Dongwei
FINANCE RESEARCH LETTERS, 2024, 63
[35] Less Information, More Comparison, and Better Performance: Evidence from a Field Experiment
Eyring, Henry
Ferguson, Patrick J.
Koppers, Sebastian
JOURNAL OF ACCOUNTING RESEARCH, 2021, 59 (02) : 657 - 711
[36] More economic growth with the better public health? Evidence from Western China
Zhao, Jing
Zuo, Xiaoru
Chang, Chun-Ping
ECONOMIC CHANGE AND RESTRUCTURING, 2023, 56 (02) : 1083 - 1112
[37] More economic growth with the better public health? Evidence from Western China
Jing Zhao
Xiaoru Zuo
Chun-Ping Chang
Economic Change and Restructuring, 2023, 56 : 1083 - 1112
[38] FROM COGNITIVE-BEHAVIORAL TO EVIDENCE-BASED PSYCHOTHERAPIES: MORE AND BETTER
David, Daniel
Tatar, Aurora Szentagotai
Cristea, Ioana-Alina
JOURNAL OF EVIDENCE-BASED PSYCHOTHERAPIES, 2014, 14 (01): : 1 - 2
[39] Learning from Perturbations: Diverse and Informative Dialogue Generation with Inverse Adversarial Training
Zhou, Wangchunshu
Li, Qifei
Li, Chenle
59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 694 - 703
[40] Training on multimodal mobile-assisted language learning: a suggested model for pre-service EFL teachers*
Gonen, Safiye Ipek Kuru
Zeybek, Gulin
COMPUTER ASSISTED LANGUAGE LEARNING, 2024, 37 (07) : 2202 - 2223

← 1 2 3 4 5 →