Unsupervised Learning of Important Objects from First-Person Videos

被引:26
|
作者
Bertasius, Gedas [1 ]
Park, Hyun Soo [2 ]
Yu, Stella X. [3 ]
Shi, Jianbo [1 ]
机构
[1] Univ Penn, Philadelphia, PA 19104 USA
[2] Univ Minnesota, Minneapolis, MN 55455 USA
[3] Univ Calif Berkeley, ICSI, Berkeley, CA USA
来源
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) | 2017年
关键词
D O I
10.1109/ICCV.2017.216
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A first-person camera, placed at a person's head, captures, which objects are important to the camera wearer. Most prior methods for this task learn to detect such important objects from the manually labeled first-person data in a supervised fashion. However, important objects are strongly related to the camera wearer's internal state such as his intentions and attention, and thus, only the person wearing the camera can provide the importance labels. Such a constraint makes the annotation process costly and limited in scalability. In this work, we show that we can detect important objects in first-person images without the supervision by the camera wearer or even third-person labelers. We formulate an important detection problem as an interplay between the 1) segmentation and 2) recognition agents. The segmentation agent first proposes a possible important object segmentation mask for each image, and then feeds it to the recognition agent, which learns to predict an important object mask using visual semantics and spatial features. We implement such an interplay between both agents via an alternating cross-pathway supervision scheme inside our proposed Visual-Spatial Network (VSN). Our VSN consists of spatial ("where") and visual ("what") pathways, one of which learns common visual semantics while the other focuses on the spatial location cues. Our unsupervised learning is accomplished via a cross-pathway supervision, where one pathway feeds its predictions to a segmentation agent, which proposes a candidate important object segmentation mask that is then used by the other pathway as a supervisory signal. We show our method's success on two different important object datasets, where our method achieves similar or better results as the supervised methods.
引用
收藏
页码:1974 / 1982
页数:9
相关论文
共 50 条
  • [41] Supervised saliency maps for first-person videos based on sparse coding
    Li, Yujie
    Kanemura, Atsunori
    Asoh, Hideki
    Miyanishi, Taiki
    Kawanabe, Motoaki
    2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 2000 - 2005
  • [42] EgoScanning: Quickly Scanning First-Person Videos with Egocentric Elastic Timelines
    Higuch, Keita
    Yonetani, Ryo
    Sato, Yoichi
    PROCEEDINGS OF THE 2017 ACM SIGCHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI'17), 2017, : 6536 - 6546
  • [43] Discovering Objects of Joint Attention via First-Person Sensing
    Kera, Hiroshi
    Yonetani, Ryo
    Higuchi, Keita
    Sato, Yoichi
    PROCEEDINGS OF 29TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, (CVPRW 2016), 2016, : 361 - 369
  • [44] Perception from the First-Person Perspective
    Howell, Robert J.
    EUROPEAN JOURNAL OF PHILOSOPHY, 2016, 24 (01) : 187 - 213
  • [45] Robot-Centric Activity Recognition from First-Person RGB-D Videos
    Xia, Lu
    Gori, Ilaria
    Aggarwal, J. K.
    Ryoo, M. S.
    2015 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2015, : 357 - 364
  • [46] Robot-Centric Activity Prediction from First-Person Videos: What Will They Do to Me?
    Ryoo, M. S.
    Fuchs, Thomas J.
    Xia, Lu
    Aggarwa, J. K.
    Matthies, Larry
    PROCEEDINGS OF THE 2015 ACM/IEEE INTERNATIONAL CONFERENCE ON HUMAN-ROBOT INTERACTION (HRI'15), 2015, : 295 - 302
  • [47] Multi-Sensor Integration for Key-Frame Extraction From First-Person Videos
    Li, Yujie
    Kanemura, Atsunori
    Asoh, Hideki
    Miyanishi, Taiki
    Kawanabe, Motoaki
    IEEE ACCESS, 2020, 8 (08): : 122281 - 122291
  • [48] Supervised Saliency Mapping for First-Person Videos With an Inverse Sparse Coding Framework
    Li, Yujie
    Akaho, Shotaro
    Asoh, Hideki
    Tan, Benying
    IEEE ACCESS, 2019, 7 : 12547 - 12556
  • [49] Temporal Localization and Spatial Segmentation of Joint Attention in Multiple First-Person Videos
    Huang, Yifei
    Cai, Minjie
    Kera, Hiroshi
    Yonetani, Ryo
    Higuchi, Keita
    Sato, Yoichi
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, : 2313 - 2321
  • [50] First-Person Vision
    Kanade, Takeo
    Hebert, Martial
    PROCEEDINGS OF THE IEEE, 2012, 100 (08) : 2442 - 2453