OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework

被引:0
|
作者
Li, Wanyun [1 ]
Guo, Pinxue [2 ]
Zhou, Xinyu [1 ]
Hong, Lingyi [1 ]
Het, Yangji [1 ]
Zhang, Xiangyu [1 ]
Zhang, Wei [1 ]
Zhang, Wenqiang [1 ,3 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai Key Lab Intelligent Informat Proc, Shanghai, Peoples R China
[2] Fudan Univ, Acad Engn & Technol, Shanghai Engn Res Ctr AI & Robot, Shanghai, Peoples R China
[3] Fudan Univ, Acad Engn & Technol, Engn Res Ctr AI & Robot, Minist Educ, Shanghai, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
D O I
10.1007/978-3-031-73636-0_2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Contemporary Video Object Segmentation (VOS) approaches typically consist stages of feature extraction, matching, memory management, and multiple objects aggregation. Recent advanced models either employ a discrete modeling for these components in a sequential manner, or optimize a combined pipeline through substructure aggregation. However, these existing explicit staged approaches prevent the VOS framework from being optimized as a unified whole, leading to the limited capacity and suboptimal performance in tackling complex videos. In this paper, we propose OneVOS, a novel framework that unifies the core components of VOS with All-in-One Transformer. Specifically, to unify all aforementioned modules into a vision transformer, we model all the features of frames, masks and memory for multiple objects as transformer tokens, and integrally accomplish feature extraction, matching and memory management of multiple objects through the flexible attention mechanism. Furthermore, a Unidirectional Hybrid Attention is proposed through a double decoupling of the original attention operation, to rectify semantic errors and ambiguities of stored tokens in OneVOS framework. Finally, to alleviate the storage burden and expedite inference, we propose the Dynamic Token Selector, which unveils the working mechanism of OneVOS and naturally leads to a more efficient version of OneVOS. Extensive experiments demonstrate the superiority of OneVOS, achieving state-of-the-art performance across 7 datasets, particularly excelling in complex LVOS and MOSE datasets with 70.1% and 66.4% J&F scores, surpassing previous state-of-the-art methods by 4.2% and 7.0%, respectively. Code is available at: https://github.com/L599wy/OneVOS.
引用
收藏
页码:20 / 40
页数:21
相关论文
共 50 条
  • [21] Learning spatiotemporal relationships with a unified framework for video object segmentation
    Mei, Jianbiao
    Wang, Mengmeng
    Yang, Yu
    Li, Zizhang
    Liu, Yong
    APPLIED INTELLIGENCE, 2024, 54 (08) : 6138 - 6153
  • [22] A novel framework for semi-automatic video object segmentation
    Li, N
    Li, SP
    Liu, WY
    Chen, C
    2002 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOL III, PROCEEDINGS, 2002, : 811 - 814
  • [23] A Unified Transformer Framework for Group-Based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection
    Su, Yukun
    Deng, Jingliang
    Sun, Ruizhou
    Lin, Guosheng
    Su, Hanjing
    Wu, Qingyao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 313 - 325
  • [24] Exploring All-In-One Knowledge Distillation Framework for Neural Machine Translation
    Miao, Zhongjian
    Zhang, Wen
    Su, Jinsong
    Li, Xiang
    Luan, Jian
    Chen, Yidong
    Wang, Bin
    Zhang, Min
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 2929 - 2940
  • [25] AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond
    Zhou, Zixiang
    Wan, Yu
    Wang, Baoyuan
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 1357 - 1366
  • [26] SpikeMotion: A Transformer Framework for High-Throughput Video Segmentation on FPGA
    Udeji, Uchechukwu Leo
    Margala, Martin
    2024 IEEE 67TH INTERNATIONAL MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS, MWSCAS 2024, 2024, : 818 - 822
  • [27] NetIDE: All-in-one framework for next generation, composed SDN applications
    Aranda Gutierrez, P. A.
    Rojas, E.
    Schwabe, A.
    Stritzke, C.
    Doriguzzi-Corin, R.
    Leckey, A.
    Petralia, G.
    Marsico, A.
    Phemius, K.
    Tamurejo, S.
    2016 IEEE NETSOFT CONFERENCE AND WORKSHOPS (NETSOFT), 2016, : 355 - 356
  • [28] All-in-one "HairNet": A Deep Neural Model for Joint Hair Segmentation and Characterization
    Borza, Diana
    Yaghoubi, Ehsan
    Neves, Joao
    Proenca, Hugo
    IEEE/IAPR INTERNATIONAL JOINT CONFERENCE ON BIOMETRICS (IJCB 2020), 2020,
  • [29] Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation
    Yan, Shilin
    Zhang, Renrui
    Guo, Ziyu
    Chen, Wenchao
    Zhang, Wei
    Li, Hongyang
    Qiao, Yu
    Dong, Hao
    He, Zhongjiang
    Gao, Peng
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 6449 - 6457
  • [30] An Unified Recurrent Video Object Segmentation Framework for Various Surveillance Environments
    Patil, Prashant W.
    Dudhane, Akshay
    Kulkarni, Ashutosh
    Murala, Subrahmanyam
    Gonde, Anil Balaji
    Gupta, Sunil
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 7889 - 7902