Efficient Medical Images Text Detection with Vision-Language Pre-training Approach

被引：0

作者：

Li, Tianyang ^{[1
,2
]}

Bai, Jinxu ^{[1
]}

Wang, Qingzhu ^{[1
]}

Xu, Hanwen ^{[1
]}

机构：

[1] Northeast Elect Power Univ, Comp Sci, Jilin, Peoples R China

[2] Jiangxi New Energy Technol Inst, Nanchang, Jiangxi, Peoples R China

来源：

ASIAN CONFERENCE ON MACHINE LEARNING, VOL 222 | 2023年 / 222卷

关键词：

vision-language pre-training; medical text detection; feature enhancement; differentiable binarization;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text detection in medical images is a critical task, essential for automating the extraction of valuable information from diverse healthcare documents. Conventional text detection methods, predominantly based on segmentation, encounter substantial challenges when confronted with text-rich images, extreme aspect ratios, and multi-oriented text. In response to these complexities, this paper introduces an innovative text detection system aimed at enhancing its efficacy. Our proposed system comprises two fundamental components: the Efficient Feature Enhancement Module (EFEM) and the Multi-Scale Feature Fusion Module (MSFM), both serving as integral elements of the segmentation head. The EFEM incorporates a spatial attention mechanism to improve segmentation performance by introducing multi-level information. The MSFM merges features from the EFEM at different depths and scales to generate final segmentation features. In conjunction with our segmentation methodology, our post-processing module employs a differentiable binarization technique, facilitating adaptive threshold adjustment to enhance text detection precision. To further bolster accuracy and robustness, we introduce the integration of a vision-language pre-training model. Through extensive pretraining on large-scale visual language understanding tasks, this model amasses a wealth of rich visual and semantic representations. When seamlessly integrated with the segmentation module, the pretraining model effectively leverages its potent representation capabilities. Our proposed model undergoes rigorous evaluation on medical text image datasets, consistently demonstrating exceptional performance. Benchmark experiments reaffirm its efficacy.

引用

页数：16

共 50 条

[21] Multimodal detection of hateful memes by applying a vision-language pre-training model
Chen, Yuyang
Pan, Feng
PLOS ONE, 2022, 17 (09):
[22] Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
Radenovic, Filip
Dubey, Abhimanyu
Kadian, Abhishek
Mihaylov, Todor
Vandenhende, Simon
Patel, Yash
Wen, Yi
Ramanathan, Vignesh
Mahajan, Dhruv
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6967 - 6977
[23] Transferable Multimodal Attack on Vision-Language Pre-training Models
Wang, Haodi
Dong, Kai
Zhu, Zhilei
Qin, Haotong
Liu, Aishan
Fang, Xiaolin
Wang, Jiakai
Liu, Xianglong
45TH IEEE SYMPOSIUM ON SECURITY AND PRIVACY, SP 2024, 2024, : 1722 - 1740
[24] Enhancing Dynamic Image Advertising with Vision-Language Pre-training
Wen, Zhoufutu
Zhao, Xinyu
Jin, Zhipeng
Yang, Yi
Jia, Wei
Chen, Xiaodong
Li, Shuanglong
Liu, Lin
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 3310 - 3314
[25] Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision
Wang, Tzu-Jui Julius
Laaksonen, Jorma
Langer, Tomas
Arponen, Heikki
Bishop, Tom E.
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 1073 - 1083
[26] Superpixel semantics representation and pre-training for vision-language tasks
Zhang, Siyu
Chen, Yeming
Sun, Yaoru
Wang, Fang
Yang, Jun
Bai, Lizhi
Gao, Shangce
NEUROCOMPUTING, 2025, 615
[27] Too Large; Data Reduction for Vision-Language Pre-Training
Wang, Alex Jinpeng
Lin, Kevin Qinghong
Zhang, David Junhao
Lei, Stan Weixian
Shou, Mike Zheng
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3124 - 3134
[28] Scaling Up Vision-Language Pre-training for Image Captioning
Hu, Xiaowei
Gan, Zhe
Wang, Jianfeng
Yang, Zhengyuan
Liu, Zicheng
Lu, Yumao
Wang, Lijuan
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17959 - 17968
[29] Towards Adversarial Attack on Vision-Language Pre-training Models
Zhang, Jiaming
Yi, Qi
Sang, Jitao
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5005 - 5013
[30] MAFA: Managing False Negatives for Vision-Language Pre-training
Byun, Jaeseok
Kim, Dohoon
Moon, Taesup
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 27304 - 27314

← 1 2 3 4 5 →