Multi-view pedestrian captioning with an attention topic CNN model

被引:7
|
作者
Liu, Quan [1 ,3 ,4 ]
Chen, Yingying [1 ,2 ]
Wang, Jinqiao [1 ,2 ]
Zhang, Sijiong [1 ,3 ,4 ]
机构
[1] Univ Chinese Acad Sci, 95 Zhongguancun East Rd, Beijing 100190, Peoples R China
[2] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China
[3] Chinese Acad Sci, Nanjing Inst Astron Opt & Technol, Natl Astron Observ, Nanjing 210042, Jiangsu, Peoples R China
[4] Chinese Acad Sci, Nanjing Inst Astron Opt & Technol, Key Lab Astron Opt & Technol, Nanjing 210042, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Pedestrian description; Multi-view captions;
D O I
10.1016/j.compind.2018.01.015
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Image captioning is a fundamental task connecting computer vision and natural language processing. Recent researches usually concentrate on generic image captioning or video captioning among thousands of classes. However, they fail to cover detailed semantics and cannot effectively deal with a specific class of objects, such as pedestrian. Pedestrian captioning plays a critical role for analysis, identification and retrieval in massive collections of video data. Therefore, in this paper, we propose a novel approach to generate multi-view captions for pedestrian images with a topic attention mechanism on global and local semantic regions. Firstly, we detect different local parts of pedestrian and utilize a deep convolutional neural network (CNN) to extract a series of features from these local regions and the whole image. Then, we aggregate these features with a topic attention CNN model to produce a representative vector richly expressing the image from a different view at each time step. This feature vector is taken as input to a hierarchical recurrent neural network to generate multi-view captions for pedestrian images. Finally, a new dataset named CASIA_Pedestrian including 5000 pedestrian images and sentences pairs is collected to evaluate the performance of pedestrian captioning. Experiments and comparison results show the superiority of our proposed approach. (C) 2018 Elsevier B.V. All rights reserved.
引用
收藏
页码:47 / 53
页数:7
相关论文
共 50 条
  • [1] Multi-view Attention with Memory Assistant for Image Captioning
    Fu, You
    Fang, Siyu
    Wang, Rui
    Yi, Xiulong
    Yu, Jianzhi
    Hua, Rong
    2022 IEEE 6TH ADVANCED INFORMATION TECHNOLOGY, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (IAEAC), 2022, : 436 - 440
  • [2] Concept and Attention-Based CNN for Question Retrieval in Multi-View Learning
    Wang, Pengwei
    Ji, Lei
    Yan, Jun
    Dou, Dejing
    De Silva, Nisansa
    Zhang, Yong
    Jin, Lianwen
    ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2018, 9 (04)
  • [3] A Multi-view CNN for SAR ATR
    Banas, Katherine M.
    Kreucher, Chris
    2024 IEEE RADAR CONFERENCE, RADARCONF 2024, 2024,
  • [4] Multi-camera Pedestrian Detection with a Multi-view Bayesian Network Model
    Peng, Peixi
    Tian, Yonghong
    Wang, Yaowei
    Huang, Tiejun
    PROCEEDINGS OF THE BRITISH MACHINE VISION CONFERENCE 2012, 2012,
  • [5] Multi-layer multi-view topic model for classifying advertising video
    Hou, Sujuan
    Chen, Ling
    Tao, Dacheng
    Zhou, Shangbo
    Liu, Wenjie
    Zheng, Yuanjie
    PATTERN RECOGNITION, 2017, 68 : 66 - 81
  • [6] Multi-view topic model learning to generate audience metadata automatically
    Park, Wonjoo
    Son, Jeong-Woo
    Lee, Sang-Yun
    Kim, Sun-Joong
    2018 32ND INTERNATIONAL CONFERENCE ON INFORMATION NETWORKING (ICOIN), 2018, : 562 - 564
  • [7] Multi-view Target Transformation for Pedestrian Detection
    Lee, Wei-Yu
    Jovanov, Ljubomir
    Philips, Wilfried
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW), 2023, : 90 - 99
  • [8] Multi-View Frequency-Attention Alternative to CNN Frontends for Automatic Speech Recognition
    Alastruey, Belen
    Drude, Lukas
    Heymann, Jahn
    Wiesler, Simon
    INTERSPEECH 2023, 2023, : 4973 - 4977
  • [9] An Attention Based Multi-view Model for Sarcasm Cause Detection
    Liu, Hejing
    Li, Qiudan
    Tang, Zaichuan
    Bai, Jie
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 15833 - 15834
  • [10] Multimodal Transformer With Multi-View Visual Representation for Image Captioning
    Yu, Jun
    Li, Jing
    Yu, Zhou
    Huang, Qingming
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (12) : 4467 - 4480