Learning Object Context for Dense Captioning

被引:0
|
作者
Li, Xiangyang [1 ,2 ]
Jiang, Shuqiang [1 ,2 ]
Han, Jungong [3 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] Univ Lancaster, Sch Comp & Commun, Lancaster, England
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dense captioning is a challenging task which not only detects visual elements in images but also generates natural language sentences to describe them. Previous approaches do not leverage object information in images for this task. However, objects provide valuable cues to help predict the locations of caption regions as caption regions often highly overlap with objects (i.e. caption regions are usually parts of objects or combinations of them). Meanwhile, objects also provide important information for describing a target caption region as the corresponding description not only depicts its properties, but also involves its interactions with objects in the image. In this work, we propose a novel scheme with an object context encoding Long Short-Term Memory (LSTM) network to automatically learn complementary object context for each caption region, transferring knowledge from objects to caption regions. All contextual objects are arranged as a sequence and progressively fed into the context encoding module to obtain context features. Then both the learned object context features and region features are used to predict the bounding box offsets and generate the descriptions. The context learning procedure is in conjunction with the optimization of both location prediction and caption generation, thus enabling the object context encoding LSTM to capture and aggregate useful object context. Experiments on benchmark datasets demonstrate the superiority of our proposed approach over the state-of-the-art methods.
引用
收藏
页码:8650 / 8657
页数:8
相关论文
共 50 条
  • [21] Dense Captioning of Natural Scenes in Spanish
    Gomez-Garay, Alejandro
    Raducanu, Bogdan
    Salas, Joaquin
    PATTERN RECOGNITION, 2018, 10880 : 145 - 154
  • [22] Weakly Supervised Dense Video Captioning
    Shen, Zhiqiang
    Li, Jianguo
    Su, Zhou
    Li, Minjun
    Chen, Yurong
    Jiang, Yu-Gang
    Xue, Xiangyang
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5159 - 5167
  • [23] Context constraint in object learning and recognition
    Liu, Z.
    PERCEPTION, 1996, 25 : 92 - 92
  • [24] Pyramid context learning for object detection
    Ding, Pengxin
    Zhang, Jianping
    Zhou, Huan
    Zou, Xiang
    Wang, Minghui
    JOURNAL OF SUPERCOMPUTING, 2020, 76 (12): : 9374 - 9387
  • [25] Context Learning Network for Object Detection
    Leng, Jiaxu
    Liu, Ying
    Zhang, Tianlin
    Quan, Pei
    2018 18TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2018, : 667 - 673
  • [26] Multimodal Pretraining for Dense Video Captioning
    Huang, Gabriel
    Pang, Bo
    Zhu, Zhenhai
    Rivera, Clara
    Soricut, Radu
    1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 470 - 490
  • [27] An Efficient Framework for Dense Video Captioning
    Suin, Maitreya
    Rajagopalan, A. N.
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 12039 - 12046
  • [28] Image Graph Production by Dense Captioning
    Sahba, Amin
    Das, Arun
    Rad, Paul
    Jamshidi, Mo
    2018 WORLD AUTOMATION CONGRESS (WAC), 2018, : 193 - 198
  • [29] Pyramid context learning for object detection
    Pengxin Ding
    Jianping Zhang
    Huan Zhou
    Xiang Zou
    Minghui Wang
    The Journal of Supercomputing, 2020, 76 : 9374 - 9387
  • [30] Dense-Captioning Events in Videos
    Krishna, Ranjay
    Hata, Kenji
    Ren, Frederic
    Fei-Fei, Li
    Niebles, Juan Carlos
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 706 - 715