Learning Object Context for Dense Captioning

被引:0
|
作者
Li, Xiangyang [1 ,2 ]
Jiang, Shuqiang [1 ,2 ]
Han, Jungong [3 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] Univ Lancaster, Sch Comp & Commun, Lancaster, England
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dense captioning is a challenging task which not only detects visual elements in images but also generates natural language sentences to describe them. Previous approaches do not leverage object information in images for this task. However, objects provide valuable cues to help predict the locations of caption regions as caption regions often highly overlap with objects (i.e. caption regions are usually parts of objects or combinations of them). Meanwhile, objects also provide important information for describing a target caption region as the corresponding description not only depicts its properties, but also involves its interactions with objects in the image. In this work, we propose a novel scheme with an object context encoding Long Short-Term Memory (LSTM) network to automatically learn complementary object context for each caption region, transferring knowledge from objects to caption regions. All contextual objects are arranged as a sequence and progressively fed into the context encoding module to obtain context features. Then both the learned object context features and region features are used to predict the bounding box offsets and generate the descriptions. The context learning procedure is in conjunction with the optimization of both location prediction and caption generation, thus enabling the object context encoding LSTM to capture and aggregate useful object context. Experiments on benchmark datasets demonstrate the superiority of our proposed approach over the state-of-the-art methods.
引用
收藏
页码:8650 / 8657
页数:8
相关论文
共 50 条
  • [31] Dense Video Captioning for Incomplete Videos
    Dang, Xuan
    Wang, Guolong
    Xiong, Kun
    Qin, Zheng
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2021, PT V, 2021, 12895 : 665 - 676
  • [32] Adversarial Reinforcement Learning With Object-Scene Relational Graph for Video Captioning
    Hua, Xia
    Wang, Xinqing
    Rui, Ting
    Shao, Faming
    Wang, Dong
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 2004 - 2016
  • [33] ECOL-R: Encouraging Copying in Novel Object Captioning with Reinforcement Learning
    Wang, Yufei
    Wood, Ian D.
    Wan, Stephen
    Johnson, Mark
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1222 - 1234
  • [34] Scan2Cap: Context-aware Dense Captioning in RGB-D Scans
    Chen, Dave Zhenyu
    Gholami, Ali
    Niesner, Matthias
    Chang, Angel X.
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3192 - 3202
  • [35] Object semantic analysis for image captioning
    Sen Du
    Hong Zhu
    Guangfeng Lin
    Dong Wang
    Jing Shi
    Jing Wang
    Multimedia Tools and Applications, 2023, 82 : 43179 - 43206
  • [36] Object Modifier Generation for Image Captioning
    Liao, Lidou
    Song, Yonghong
    Zhang, Yuanlin
    2020 CHINESE AUTOMATION CONGRESS (CAC 2020), 2020, : 52 - 57
  • [37] Natural Language Navigation for Robotic Systems: Integrating GPT and Dense Captioning Models with Object Detection in Autonomous Inspections
    Choudhury, Nilay R.
    Wen, Yining
    Chen, Kaiwen
    CONSTRUCTION RESEARCH CONGRESS 2024: ADVANCED TECHNOLOGIES, AUTOMATION, AND COMPUTER APPLICATIONS IN CONSTRUCTION, 2024, : 972 - 980
  • [38] Object semantic analysis for image captioning
    Du, Sen
    Zhu, Hong
    Lin, Guangfeng
    Wang, Dong
    Shi, Jing
    Wang, Jing
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (28) : 43179 - 43206
  • [39] Image Captioning with Object Detection and Localization
    Yang, Zhongliang
    Zhang, Yu-Jin
    Rehman, Sadaqat Ur
    Huang, Yongfeng
    IMAGE AND GRAPHICS (ICIG 2017), PT II, 2017, 10667 : 109 - 118
  • [40] Object Captioning and Retrieval with Natural Language
    Anh Nguyen
    Tran, Quang D.
    Thanh-Toan Do
    Reid, Ian
    Caldwell, Darwin G.
    Tsagarakis, Nikos G.
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 2584 - 2592