Attentive Excitation and Aggregation for Bilingual Referring Image Segmentation

被引：4

作者：

Zhou, Qianli ^{[1
]}

Hui, Tianrui ^{[2
,5
]}

Wang, Rong ^{[1
]}

Hu, Haimiao ^{[3
]}

Liu, Si ^{[4
]}

机构：

[1] Peoples Publ Secur Univ China, 1 Muxidi Nanli, Beijing, Peoples R China

[2] Chinese Acad Sci, Inst Informat Engn, 89 Minzhuang Rd, Beijing, Peoples R China

[3] Beihang Univ, 37 Xueyuan Rd, Beijing, Peoples R China

[4] Beihang Univ, Inst Artificial Intelligence, 37 Xueyuan Rd, Beijing, Peoples R China

[5] Univ Chinese Acad Sci, Sch Cyber Secur, 19 Yuquan Rd, Beijing, Peoples R China

来源：

ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY | 2021年 / 12卷 / 02期

基金：

北京市自然科学基金; 中国国家自然科学基金;

关键词：

Bilingual referring segmentation; channel excitation; spatial aggregation;

D O I：

10.1145/3446345

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The goal of referring image segmentation is to identify the object matched with an input natural language expression. Previous methods only support English descriptions, whereas Chinese is also broadly used around theworld, which limits the potential application of this task. Therefore, we propose to extend existing datasets with Chinese descriptions and preprocessing tools for training and evaluating bilingual referring segmentation models. In addition, previous methods also lack the ability to collaboratively learn channel-wise and spatial-wise cross-modal attention to well align visual and linguistic modalities. To tackle these limitations, we propose a Linguistic Excitation module to excite image channels guided by language information and a Linguistic Aggregation module to aggregate multimodal information based on image-language relationships. Since different levels of features from the visual backbone encode rich visual information, we also propose a Cross-Level Attentive Fusion module to fuse multilevel features gated by language information. Extensive experiments on four English and Chinese benchmarks show that our bilingual referring image segmentation model outperforms previous methods.

引用

页数：17

共 50 条

[41] Global and Local Interactive Perception Network for Referring Image Segmentation
Liu, Jing
Tan, Hongchen
Hu, Yongli
Sun, Yanfeng
Wang, Huasheng
Yin, Baocai
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 35 (12) : 1 - 14
[42] Comprehensive Multi-Modal Interactions for Referring Image Segmentation
Jain, Kanishk
Gandhi, Vineet
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 3427 - 3435
[43] Vision-Aware Language Reasoning for Referring Image Segmentation
Fayou Xu
Bing Luo
Chao Zhang
Li Xu
Mingxing Pu
Bo Li
Neural Processing Letters, 2023, 55 : 11313 - 11331
[44] See-Through-Text Grouping for Referring Image Segmentation
Chen, Ding-Jie
Jia, Songhao
Lo, Yi-Chen
Chen, Hwann-Tzong
Liu, Tyng-Luh
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7453 - 7462
[45] Bottom-Up Shift and Reasoning for Referring Image Segmentation
Yang, Sibei
Xia, Meng
Li, Guanbin
Zhou, Hong-Yu
Yu, Yizhou
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11261 - 11270
[46] De-noising mask transformer for referring image segmentation
Wang, Yehui
Lei, Fang
Wang, Baoyan
Zhang, Qiang
Zhen, Xiantong
Zhang, Lei
IMAGE AND VISION COMPUTING, 2025, 154
[47] Text-Vision Relationship Alignment for Referring Image Segmentation
Mingxing Pu
Bing Luo
Chao Zhang
Li Xu
Fayou Xu
Mingming Kong
Neural Processing Letters, 56
[48] Shatter and Gather: Learning Referring Image Segmentation with Text Supervision
Kim, Dongwon
Kim, Namyup
Lan, Cuiling
Kwak, Suha
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15501 - 15511
[49] Face image segmentation by detecting attentive regions of artificial neural network
Itoh, S
Ishiguro, S
Yamauchi, K
Ishii, N
ICONIP'98: THE FIFTH INTERNATIONAL CONFERENCE ON NEURAL INFORMATION PROCESSING JOINTLY WITH JNNS'98: THE 1998 ANNUAL CONFERENCE OF THE JAPANESE NEURAL NETWORK SOCIETY - PROCEEDINGS, VOLS 1-3, 1998, : 364 - 368
[50] LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
Yang, Zhao
Wang, Jiaqi
Tang, Yansong
Chen, Kai
Zhao, Hengshuang
Torr, Philip H. S.
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18134 - 18144

← 1 2 3 4 5 →