Application and Prospect of Deep Learning in Video Object Segmentation

被引:0
|
作者
Chen J. [1 ]
Chen Y.-S. [1 ]
Li W.-H. [2 ]
Tian Y. [1 ]
Liu Z. [3 ]
He Y. [4 ]
机构
[1] Department of Education and Information Technology, Central China Normal University, Wuhan
[2] Visual Learning Lab, Heidelberg University, Heidelberg
[3] National Engineering Laboratory for Educational Big Data, Central China Normal University, Wuhan
[4] Graduate School at Shenzhen, Tsinghua University, Shenzhen
来源
关键词
Deep learning; Interactive methods; Semi-supervised methods; Unsupervised methods; Video object segmentation;
D O I
10.11897/SP.J.1016.2021.00609
中图分类号
学科分类号
摘要
Video object segmentation refers to the technology by which the positions of all pixels belonging to the particular foreground objects in each frame of a given video sequence can be found out and labeled. This technology is one of the most important research topics in the field of computer vision. And it plays an important role in many applications of computer vision, such as 3D reconstruction, automatic driving, video editing, and so on. With the improvement of computing power, deep learning has attracted more and more attention and made significant progress in the task of video object segmentation. Firstly, this paper introduces the main task of video object segmentation and summarizes the main challenges that the task is facing. Secondly, a brief overview of the open datasets for video object segmentation task is given. Then the relevant benchmarks and common performance evaluation criteria are introduced. Thirdly, the research status of video object segmentation is summarized. The relevant methods are introduced and analyzed in detail. And these methods fall in one of the three following categories: the first ones are semi-supervised methods. Namely, the detailed artificial truth annotation of the interested objects in the first frame image of video sequence is given. And the interested objects in the remaining video sequence frames are segmented automatically. At present, in the video object segmentation task of a single instance, the Jaccard score of semi-supervised methods can reach more than 0.8 by taking the DAVIS16 dataset as an example. In the multi-instance video object segmentation task, for example, the DAVIS18 dataset which is widely used, the Jaccard score of semi-supervised methods has reached over 0.7. The second ones are unsupervised methods, which can identify and segment the foreground objects in video by the certain rules or models, without any manual labeling prior information. The third ones are interactive methods, based on the method of interactive rough artificial prior information. In these methods, the rough artificial prior information, such as point, bounding box, and scribble, is obtained from the interactive modules. And video object segmentation is carried out by multiple manual participations, but only a small amount of work at each time. The condition of the third kind of methods can be considered as the compromise of the former two. Compared with the first one, although it requires manual participation, it only requires a small amount of labeling work. Compared with the second one, it appropriately adds some manual labeling information to the images of some frames in the video sequence, which makes the methods more targeted for the interested objects. The best Jaccard scores of the unsupervised methods and the interactive methods can both reach 0.8 in the DAVIS16 dataset. But there are few unsupervised methods that deal with the multi instance problem of the DAVIS18 dataset. The best interactive methods can only reach 0.64 for Jaccard score in the DAVIS18 interactive dataset. Finally, the applications of deep learning in video object segmentation task are concluded, and some promising ideas are proposed from four different aspects. © 2021, Science Press. All right reserved.
引用
收藏
页码:609 / 631
页数:22
相关论文
共 119 条
  • [1] Sikora T., The MPEG-4 video standard verification model, IEEE Transactions on Circuits and Systems for Video Technology, 7, 1, pp. 19-31, (1997)
  • [2] Huang Kai-Qi, Ren Wei-Qiang, Tan Tie-Niu, A review on image object classification and detection, Chinese Journal of Computers, 37, 6, pp. 1225-1240, (2014)
  • [3] Ren S, He K, Girshick R, Et al., Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 6, pp. 1137-1149, (2017)
  • [4] Chen L-C, Papandreou G, Kokkinos I, Et al., DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs, IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 4, pp. 834-848, (2018)
  • [5] Long J, Shelhamer E, Darrell T., Fully convolutional networks for semantic segmentation, Proceedings of the IEEE Conference on CVPR, pp. 3431-3440, (2015)
  • [6] Ronneberger O, Fischer P, Brox T., U-Net: Convolutional networks for biomedical image segmentation, Proceedings of the International Conference on MICCAI, pp. 234-241, (2015)
  • [7] Badrinarayanan V, Handa A, Cipolla R., SegNet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling, (2015)
  • [8] Noh H, Hong S, Han B., Learning deconvolution network for semantic segmentation, Proceedings of the IEEE ICCV, pp. 1520-1528, (2015)
  • [9] Huang G, Liu Z, van der Maaten L, Et al., Densely connected convolutional networks, Proceedings of the IEEE Conference on CVPR, pp. 4700-4708, (2017)
  • [10] Jegou S, Drozdzal M, Vazquez D, Et al., The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation, Proceedings of the IEEE Conference on CVPR Workshops, pp. 11-19, (2017)