Efficient Medical Images Text Detection with Vision-Language Pre-training Approach

被引：0

作者：

Li, Tianyang ^{[1
,2
]}

Bai, Jinxu ^{[1
]}

Wang, Qingzhu ^{[1
]}

Xu, Hanwen ^{[1
]}

机构：

[1] Northeast Elect Power Univ, Comp Sci, Jilin, Peoples R China

[2] Jiangxi New Energy Technol Inst, Nanchang, Jiangxi, Peoples R China

来源：

ASIAN CONFERENCE ON MACHINE LEARNING, VOL 222 | 2023年 / 222卷

关键词：

vision-language pre-training; medical text detection; feature enhancement; differentiable binarization;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text detection in medical images is a critical task, essential for automating the extraction of valuable information from diverse healthcare documents. Conventional text detection methods, predominantly based on segmentation, encounter substantial challenges when confronted with text-rich images, extreme aspect ratios, and multi-oriented text. In response to these complexities, this paper introduces an innovative text detection system aimed at enhancing its efficacy. Our proposed system comprises two fundamental components: the Efficient Feature Enhancement Module (EFEM) and the Multi-Scale Feature Fusion Module (MSFM), both serving as integral elements of the segmentation head. The EFEM incorporates a spatial attention mechanism to improve segmentation performance by introducing multi-level information. The MSFM merges features from the EFEM at different depths and scales to generate final segmentation features. In conjunction with our segmentation methodology, our post-processing module employs a differentiable binarization technique, facilitating adaptive threshold adjustment to enhance text detection precision. To further bolster accuracy and robustness, we introduce the integration of a vision-language pre-training model. Through extensive pretraining on large-scale visual language understanding tasks, this model amasses a wealth of rich visual and semantic representations. When seamlessly integrated with the segmentation module, the pretraining model effectively leverages its potent representation capabilities. Our proposed model undergoes rigorous evaluation on medical text image datasets, consistently demonstrating exceptional performance. Benchmark experiments reaffirm its efficacy.

引用

页数：16

共 50 条

[41] ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
Wang, Weihan
Yang, Zhen
Xu, Bin
Li, Juanzi
Sun, Yankui
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3135 - 3146
[42] EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
Mu, Yao
Zhang, Qinglong
Hu, Mengkang
Wang, Wenhai
Ding, Mingyu
Jin, Jun
Wang, Bin
Dai, Jifeng
Qiao, Yu
Luo, Ping
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[43] Fine-Grained Semantically Aligned Vision-Language Pre-Training
Li, Juncheng
He, Xin
Wei, Longhui
Qian, Long
Zhu, Linchao
Xie, Lingxi
Zhuang, Yueting
Tian, Qi
Tang, Siliang
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[44] Source-Free Domain Adaptation Guided by Vision and Vision-Language Pre-training
Zhang, Wenyu
Shen, Li
Foo, Chuan-Sheng
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (02) : 844 - 866
[45] BUS : Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization.
Jiang, Chaoya
Xu, Haiyang
Ye, Wei
Ye, Qinghao
Li, Chenliang
Yan, Ming
Bi, Bin
Zhang, Shikun
Huang, Fei
Huang, Songfang
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2888 - 2898
[46] Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval
Yao, Tao
Peng, Shouyong
Wang, Lili
Li, Ying
Sun, Yujuan
APPLIED INTELLIGENCE, 2024, 54 (23) : 12230 - 12245
[47] IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training
Huang, Xinyu
Zhang, Youcai
Cheng, Ying
Tian, Weiwei
Zhao, Ruiwei
Feng, Rui
Zhang, Yuejie
Li, Yaqian
Guo, Yandong
Zhang, Xiaobo
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4573 - 4583
[48] MAKE: Vision-Language Pre-training based Product Retrieval in Taobao Search
Zheng, Xiaoyang
Wang, Zilong
Li, Sen
Xu, Ke
Zhuang, Tao
Liu, Qingwen
Zeng, Xiaoyi
COMPANION OF THE WORLD WIDE WEB CONFERENCE, WWW 2023, 2023, : 356 - 360
[49] VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
Bao, Hangbo
Wang, Wenhui
Dong, Li
Liu, Qiang
Mohammed, Owais Khan
Aggarwal, Kriti
Som, Subhojit
Piao, Songhao
Wei, Furu
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[50] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
Liu, Zikang
Chen, Sihan
Guo, Longteng
Li, Handong
He, Xingjian
Liu, Jing
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5120 - 5131

← 1 2 3 4 5 →