Multi-view pedestrian captioning with an attention topic CNN model

被引:7
|
作者
Liu, Quan [1 ,3 ,4 ]
Chen, Yingying [1 ,2 ]
Wang, Jinqiao [1 ,2 ]
Zhang, Sijiong [1 ,3 ,4 ]
机构
[1] Univ Chinese Acad Sci, 95 Zhongguancun East Rd, Beijing 100190, Peoples R China
[2] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China
[3] Chinese Acad Sci, Nanjing Inst Astron Opt & Technol, Natl Astron Observ, Nanjing 210042, Jiangsu, Peoples R China
[4] Chinese Acad Sci, Nanjing Inst Astron Opt & Technol, Key Lab Astron Opt & Technol, Nanjing 210042, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Pedestrian description; Multi-view captions;
D O I
10.1016/j.compind.2018.01.015
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Image captioning is a fundamental task connecting computer vision and natural language processing. Recent researches usually concentrate on generic image captioning or video captioning among thousands of classes. However, they fail to cover detailed semantics and cannot effectively deal with a specific class of objects, such as pedestrian. Pedestrian captioning plays a critical role for analysis, identification and retrieval in massive collections of video data. Therefore, in this paper, we propose a novel approach to generate multi-view captions for pedestrian images with a topic attention mechanism on global and local semantic regions. Firstly, we detect different local parts of pedestrian and utilize a deep convolutional neural network (CNN) to extract a series of features from these local regions and the whole image. Then, we aggregate these features with a topic attention CNN model to produce a representative vector richly expressing the image from a different view at each time step. This feature vector is taken as input to a hierarchical recurrent neural network to generate multi-view captions for pedestrian images. Finally, a new dataset named CASIA_Pedestrian including 5000 pedestrian images and sentences pairs is collected to evaluate the performance of pedestrian captioning. Experiments and comparison results show the superiority of our proposed approach. (C) 2018 Elsevier B.V. All rights reserved.
引用
收藏
页码:47 / 53
页数:7
相关论文
共 50 条
  • [41] Multi-view Stereo Network with Attention Thin Volume
    Wan, Zihang
    Xu, Chao
    Hu, Jing
    Xiao, Jian
    Meng, Zhaopeng
    Chen, Jitai
    PRICAI 2022: TRENDS IN ARTIFICIAL INTELLIGENCE, PT III, 2022, 13631 : 410 - 423
  • [42] MANNER: MULTI-VIEW ATTENTION NETWORK FOR NOISE ERASURE
    Park, Hyun Joon
    Ha Kang, Byung
    Shin, Wooseok
    Kim, Jin Sob
    Han, Sung Won
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7842 - 7846
  • [43] Multi-source Neural Topic Modeling in Multi-view Embedding Spaces
    Gupta, Pankaj
    Chaudhary, Yatin
    Schueze, Hinrich
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 4205 - 4217
  • [44] Multi-View Guided Multi-View Stereo
    Poggi, Matteo
    Conti, Andrea
    Mattoccia, Stefano
    2022 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2022, : 8391 - 8398
  • [45] Multi-view graph convolutional networks with attention mechanism
    Yao, Kaixuan
    Liang, Jiye
    Liang, Jianqing
    Li, Ming
    Cao, Feilong
    ARTIFICIAL INTELLIGENCE, 2022, 307
  • [46] Multi-view Attention Networks for Visual Question Answering
    Li, Min
    Bai, Zongwen
    Deng, Jie
    2024 6TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING, ICNLP 2024, 2024, : 788 - 794
  • [47] Multi-view Graph Attention Network for Travel Recommendation
    Chen, Lei
    Cao, Jie
    Wang, Youquan
    Liang, Weichao
    Zhu, Guixiang
    EXPERT SYSTEMS WITH APPLICATIONS, 2022, 191
  • [48] Multi-view Mixed Attention for Contrastive Learning on Hypergraphs
    Lee, Jongsoo
    Chae, Dong-Kyu
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 2543 - 2547
  • [49] Action Recognition with a Multi-View Temporal Attention Network
    Dengdi Sun
    Zhixiang Su
    Zhuanlian Ding
    Bin Luo
    Cognitive Computation, 2022, 14 : 1082 - 1095
  • [50] Monocular depth estimation with multi-view attention autoencoder
    Jung, Geunho
    Yoon, Sang Min
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (23) : 33759 - 33770