Cross-modal Representation Learning for Understanding Manufacturing Procedure

被引:0
|
作者
Hashimoto, Atsushi [1 ]
Nishimura, Taichi [2 ]
Ushiku, Yoshitaka [1 ]
Kameko, Hirotaka [2 ]
Mori, Shinsuke [2 ]
机构
[1] OMRON SINIC X Corp, Tokyo, Japan
[2] Kyoto Univ, Kyoto, Japan
关键词
Procedural text generation; Image captioning; Video captioning; Understanding manufacturing activity;
D O I
10.1007/978-3-031-06047-2_4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Assembling, biochemical experiments, and cooking are representatives that create a new value from multiple materials through multiple processes. If a machine can computationally understand such manufacturing tasks, we will have various options of human-machine collaboration on those tasks, from video scene retrieval to robots that act for on behalf of humans. As one form of such understanding, this paper introduces a series of our studies that aim to associate visual observation of the processes and the procedural texts that instruct such processes. In those studies, captioning is the key task, where input is image sequence or video clips and our methods are still state-of-the-arts. Through the explanation of such techniques, we overview machine learning technologies that deal with the contextual information of manufacturing tasks.
引用
收藏
页码:44 / 57
页数:14
相关论文
共 50 条
  • [21] Robust Cross-Modal Representation Learning with Progressive Self-Distillation
    Andonian, Alex
    Chen, Shixing
    Hamid, Raffay
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16409 - 16420
  • [22] Cross-Modal Guided Visual Representation Learning for Social Image Retrieval
    Guan, Ziyu
    Zhao, Wanqing
    Liu, Hongmin
    Nakashima, Yuta
    Noboru, Babaguchi
    He, Xiaofei
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (03) : 2186 - 2198
  • [23] CrossFormer: Cross-Modal Representation Learning via Heterogeneous Graph Transformer
    Liang, Xiao
    Yang, Erkun
    Deng, Cheng
    Yang, Yanhua
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (12)
  • [24] Cross-modal Representation Learning for Zero-shot Action Recognition
    Lin, Chung-Ching
    Lin, Kevin
    Wang, Lijuan
    Liu, Zicheng
    Li, Linjie
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19946 - 19956
  • [25] Infant cross-modal learning
    Chow, Hiu Mei
    Tsui, Angeline Sin-Mei
    Ma, Yuen Ki
    Yat, Mei Ying
    Tseng, Chia-huei
    I-PERCEPTION, 2014, 5 (04): : 463 - 463
  • [26] IMPROVING CROSS-MODAL UNDERSTANDING IN VISUAL DIALOG VIA CONTRASTIVE LEARNING
    Chen, Feilong
    Chen, Xiuyi
    Xu, Shuang
    Xu, Bo
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7937 - 7941
  • [27] Deep discriminative image feature learning for cross-modal semantics understanding
    Zhang, Hong
    Liu, Fangming
    Li, Bo
    Zhang, Ling
    Zhu, Yihai
    Wang, Ziwei
    KNOWLEDGE-BASED SYSTEMS, 2021, 216
  • [28] Contrastive Cross-Modal Representation Learning Based Active Learning for Visual Question Answer
    Zhang B.-C.
    Li L.
    Zha Z.-J.
    Huang Q.-M.
    Jisuanji Xuebao/Chinese Journal of Computers, 2022, 45 (08): : 1730 - 1745
  • [29] Deep Learning and Shared Representation Space Learning Based Cross-Modal Multimedia Retrieval
    Zou, Hui
    Du, Ji-Xiang
    Zhai, Chuan-Min
    Wang, Jing
    INTELLIGENT COMPUTING THEORIES AND APPLICATION, ICIC 2016, PT II, 2016, 9772 : 322 - 331
  • [30] Cross-modal representation of identity in the primate hippocampus
    Tyree, Timothy J.
    Metke, Michael
    Miller, Cory T.
    SCIENCE, 2023, 382 (6669) : 417 - 423