Multimodal learning with only image data: A deep unsupervised model for street view image retrieval by fusing visual and scene text features of images

被引:0
|
作者
Wu, Shangyou [1 ]
Yu, Wenhao [1 ,2 ]
Zhang, Yifan [1 ]
Huang, Mengqiu [1 ]
机构
[1] China Univ Geosci, Sch Geog & Informat Engn, Wuhan, Peoples R China
[2] China Univ Geosci, Natl Engn Res Ctr Geog Informat Syst, Wuhan, Peoples R China
基金
中国国家自然科学基金;
关键词
RECOGNITION;
D O I
10.1111/tgis.13146
中图分类号
P9 [自然地理学]; K9 [地理];
学科分类号
0705 ; 070501 ;
摘要
As one of the classic tasks in information retrieval, the core of image retrieval is to identify the images sharing similar features with a query image, aiming to enable users to find the required information from a large number of images conveniently. Street view image retrieval, in particular, finds extensive applications in many fields, such as improvements to navigation and mapping services, formulation of urban development planning scheme, and analysis of historical evolution of buildings. However, the intricate foreground and background details in street view images, coupled with a lack of attribute annotations, render it among the most challenging issues in practical applications. Current image retrieval research mainly uses the visual model that is completely dependent on the image visual features, and the multimodal learning model that necessitates additional data sources (e.g., annotated text). Yet, creating annotated datasets is expensive, and street view images, which contain a large amount of scene texts themselves, are often unannotated. Therefore, this paper proposes a deep unsupervised learning algorithm that combines visual and text features from image data for improving the accuracy of street view image retrieval. Specifically, we employ text detection algorithms to identify scene text, utilize the Pyramidal Histogram of Characters encoding predictor model to extract text information from images, deploy deep convolutional neural networks for visual feature extraction, and incorporate a contrastive learning module for image retrieval. Upon testing across three street view image datasets, the results demonstrate that our model holds certain advantages over the state-of-the-art multimodal models pre-trained on extensive datasets, characterized by fewer parameters and lower floating point operations. Code and data are available at .
引用
收藏
页码:486 / 508
页数:23
相关论文
共 29 条
  • [21] Image Augmentation Approaches for Building Dimension Estimation in Street View Images Using Object Detection and Instance Segmentation Based on Deep Learning
    Hwang, Dongjin
    Kim, Jae-Jun
    Moon, Sungkon
    Wang, Seunghyeon
    APPLIED SCIENCES-BASEL, 2025, 15 (05):
  • [22] MACHINE LEARNING AND LANDSCAPE QUALITY. REPRESENTING VISUAL INFORMATION USING DEEP LEARNING-BASED IMAGE SEGMENTATION FROM STREET VIEW PHOTOS
    Bianconi, Fabio
    Filippucci, Marco
    Seccaroni, Marco
    Rolando, Andrea
    D'Ulva, Domenico
    SCIRES-IT-SCIENTIFIC RESEARCH AND INFORMATION TECHNOLOGY, 2023, 13 (01): : 117 - 134
  • [23] Efficient rotated and scaled digital image retrieval model using deep learning-based hybrid features extraction
    D. N. Hire
    A. V. Patil
    Priya Charles
    Multimedia Tools and Applications, 2024, 83 : 34733 - 34758
  • [24] Efficient rotated and scaled digital image retrieval model using deep learning-based hybrid features extraction
    Hire, D. N.
    Patil, A. V.
    Charles, Priya
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (12) : 34733 - 34758
  • [25] Image or Text: Which One is More Influential? A Deep-learning Approach for Visual and Textual Data Analysis in the Digital Economy
    Wang, Ying
    Song, Jaeki
    COMMUNICATIONS OF THE ASSOCIATION FOR INFORMATION SYSTEMS, 2020, 47 : 165 - 187
  • [26] Snow and Cloud Classification in Historical SPOT Images: An Image Emulation Approach for Training a Deep Learning Model Without Reference Data
    Dumont, Zacharie Barrou
    Gascoin, Simon
    Inglada, Jordi
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 5541 - 5552
  • [27] City-scale roadside electric vehicle parking and charging capacity: A deep learning augmented street-view-image data mining and analytic framework
    Pu, Yifan
    Zhu, Rui
    Wang, Shu
    You, Linlin
    Zhong, Teng
    Xu, Yanqing
    Qin, Zheng
    APPLIED ENERGY, 2025, 389
  • [28] Automatic Findings Generation for Distress Images Using In-Context Few-Shot Learning of Visual Language Model Based on Image Similarity and Text Diversity
    Watanabe, Yuto
    Ogawa, Naoki
    Maeda, Keisuke
    Ogawa, Takahiro
    Haseyama, Miki
    JOURNAL OF ROBOTICS AND MECHATRONICS, 2024, 36 (02) : 353 - 364
  • [29] Development of a deep learning model for cell type mapping in colorectal cancer using H&E images leveraging image-based spatial transcriptomics data
    Cook, Seungho
    Lee, Dongjoo
    Lim, Myunghyun
    Lee, Jae Eun
    Lee, Daeseung
    Im, Hyung-Jun
    Pyo, Jung-Soo
    Na, Kwon Joong
    Choi, Hongyoon
    CANCER RESEARCH, 2024, 84 (06)