Self-Supervision Interactive Alignment for Remote Sensing Image-Audio Retrieval

被引：6

作者：

Huang, Jinghao ^{[1
,2
]}

Chen, Yaxiong ^{[1
,2
,3
,4
]}

Xiong, Shengwu ^{[1
,2
,3
,4
]}

Lu, Xiaoqiang ^{[5
]}

机构：

[1] Wuhan Univ Technol, Sanya Sci & Educ Innovat Pk, Sanya 572000, Peoples R China

[2] Wuhan Univ Technol, Sch Comp Sci & Artificial Intelligence, Wuhan 430070, Peoples R China

[3] Shanghai Artificial Intelligence Lab, Shanghai 200232, Peoples R China

[4] Wuhan Univ Technol, Chongqing Res Inst, Chongqing 401122, Peoples R China

[5] Fuzhou Univ, Coll Phys & Informat Engn, Fuzhou 350108, Peoples R China

来源：

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING | 2023年 / 61卷

关键词：

Remote sensing; Semantics; Visualization; Transformers; Task analysis; Unsupervised learning; Technological innovation; Cross-modal remote sensing (RS) retrieval; interactive alignment (IA); self-supervised learning; similarity preservation; SPARSE;

D O I：

10.1109/TGRS.2023.3264006

中图分类号：

P3 [地球物理学]; P59 [地球化学];

学科分类号：

0708 ; 070902 ;

摘要：

Cross-modal remote sensing image-audio (RSIA) retrieval aims to use audio or remote sensing images (RSIs) as queries to retrieve relevant RSIs or corresponding audios. Although many approaches leverage labeled samples to achieve good performance, the performance cost of labeled samples is high, because cross-modal remote sensing (RS) labeled samples usually require huge labor resources. Therefore, unsupervised cross-modal learning is very important in real-world applications. In this article, we propose a novel unsupervised cross-modal RSIA retrieval approach, named self-supervision interactive alignment (SSIA), which can take advantage of large amounts of unlabeled samples to learn the salient information, cross-modal alignment, and the similarity between RSIs and audios. Since self-supervised learning lacks the supervision of label information, we leverage the similarity between the input RSI information and audio information as the supervision information. Besides, to perform cross-modal alignment, a novel interactive alignment (IA) module is designed to explore fine correspondence relation for RSIs and audios. Moreover, we design an audio-guided image de-redundant module to reduce the redundant information of visual information, which can capture salient information of RSIs. Extensive experiments on four widely used RSIA datasets testify that the SSIA performance gains better RSIA retrieval performance than other compared approaches.

引用

页数：14

共 50 条

[1] Multimodal Fusion Remote Sensing Image-Audio Retrieval
Yang, Rui
Wang, Shuang
Sun, Yingzhi
Zhang, Huan
Liao, Yu
Gu, Yu
Hou, Biao
Jiao, Licheng
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2022, 15 : 6220 - 6235
[2] Fine Aligned Discriminative Hashing for Remote Sensing Image-Audio Retrieval
Chen, Yaxiong
Huang, Jinghao
Xiong, Shengwu
Lu, Xiaoqiang
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
[3] Cross-Modal Remote Sensing Image-Audio Retrieval With Adaptive Learning for Aligning Correlation
Huang, Jinghao
Chen, Yaxiong
Xiong, Shengwu
Lu, Xiaoqiang
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
[4] Self-supervision assisted multimodal remote sensing image classification with coupled self-looping convolution networks
Pande, Shivam
Banerjee, Biplab
NEURAL NETWORKS, 2023, 164 : 1 - 20
[5] InsCLR: Improving Instance Retrieval with Self-Supervision
Deng, Zelu
Zhong, Yujie
Guo, Sheng
Huang, Weilin
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 516 - 524
[6] Pre-Training Audio Representations With Self-Supervision
Tagliasacchi, Marco
Gfeller, Beat
Quitry, Felix de Chaumont
Roblek, Dominik
IEEE SIGNAL PROCESSING LETTERS, 2020, 27 : 600 - 604
[7] Semantic alignment with self-supervision for class incremental learning
Fu, Zhiling
Wang, Zhe
Xu, Xinlei
Yang, Mengping
Chi, Ziqiu
Ding, Weichao
KNOWLEDGE-BASED SYSTEMS, 2023, 282
[8] PVASS-MDD: Predictive Visual-Audio Alignment Self-Supervision for Multimodal Deepfake Detection
Yu, Yang
Liu, Xiaolong
Ni, Rongrong
Yang, Siyuan
Zhao, Yao
Kot, Alex C.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (08) : 6926 - 6936
[9] Self-Supervision, Remote Sensing and Abstraction: Representation Learning Across 3 Million Locations
Seneviratne, Sachith
Nice, Kerry A.
Wijnands, Jasper S.
Stevenson, Mark
Thompson, Jason
2021 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA 2021), 2021, : 189 - 196
[10] Interactive learning and probabilistic retrieval in remote sensing image archives
Schröder, M
Rehrauer, H
Seidel, K
Datcu, M
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2000, 38 (05): : 2288 - 2298

← 1 2 3 4 5 →