Efficient Medical Images Text Detection with Vision-Language Pre-training Approach

被引:0
|
作者
Li, Tianyang [1 ,2 ]
Bai, Jinxu [1 ]
Wang, Qingzhu [1 ]
Xu, Hanwen [1 ]
机构
[1] Northeast Elect Power Univ, Comp Sci, Jilin, Peoples R China
[2] Jiangxi New Energy Technol Inst, Nanchang, Jiangxi, Peoples R China
关键词
vision-language pre-training; medical text detection; feature enhancement; differentiable binarization;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text detection in medical images is a critical task, essential for automating the extraction of valuable information from diverse healthcare documents. Conventional text detection methods, predominantly based on segmentation, encounter substantial challenges when confronted with text-rich images, extreme aspect ratios, and multi-oriented text. In response to these complexities, this paper introduces an innovative text detection system aimed at enhancing its efficacy. Our proposed system comprises two fundamental components: the Efficient Feature Enhancement Module (EFEM) and the Multi-Scale Feature Fusion Module (MSFM), both serving as integral elements of the segmentation head. The EFEM incorporates a spatial attention mechanism to improve segmentation performance by introducing multi-level information. The MSFM merges features from the EFEM at different depths and scales to generate final segmentation features. In conjunction with our segmentation methodology, our post-processing module employs a differentiable binarization technique, facilitating adaptive threshold adjustment to enhance text detection precision. To further bolster accuracy and robustness, we introduce the integration of a vision-language pre-training model. Through extensive pretraining on large-scale visual language understanding tasks, this model amasses a wealth of rich visual and semantic representations. When seamlessly integrated with the segmentation module, the pretraining model effectively leverages its potent representation capabilities. Our proposed model undergoes rigorous evaluation on medical text image datasets, consistently demonstrating exceptional performance. Benchmark experiments reaffirm its efficacy.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] Multimodal detection of hateful memes by applying a vision-language pre-training model
    Chen, Yuyang
    Pan, Feng
    PLOS ONE, 2022, 17 (09):
  • [22] Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
    Radenovic, Filip
    Dubey, Abhimanyu
    Kadian, Abhishek
    Mihaylov, Todor
    Vandenhende, Simon
    Patel, Yash
    Wen, Yi
    Ramanathan, Vignesh
    Mahajan, Dhruv
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6967 - 6977
  • [23] Transferable Multimodal Attack on Vision-Language Pre-training Models
    Wang, Haodi
    Dong, Kai
    Zhu, Zhilei
    Qin, Haotong
    Liu, Aishan
    Fang, Xiaolin
    Wang, Jiakai
    Liu, Xianglong
    45TH IEEE SYMPOSIUM ON SECURITY AND PRIVACY, SP 2024, 2024, : 1722 - 1740
  • [24] Enhancing Dynamic Image Advertising with Vision-Language Pre-training
    Wen, Zhoufutu
    Zhao, Xinyu
    Jin, Zhipeng
    Yang, Yi
    Jia, Wei
    Chen, Xiaodong
    Li, Shuanglong
    Liu, Lin
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 3310 - 3314
  • [25] Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision
    Wang, Tzu-Jui Julius
    Laaksonen, Jorma
    Langer, Tomas
    Arponen, Heikki
    Bishop, Tom E.
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 1073 - 1083
  • [26] Superpixel semantics representation and pre-training for vision-language tasks
    Zhang, Siyu
    Chen, Yeming
    Sun, Yaoru
    Wang, Fang
    Yang, Jun
    Bai, Lizhi
    Gao, Shangce
    NEUROCOMPUTING, 2025, 615
  • [27] Too Large; Data Reduction for Vision-Language Pre-Training
    Wang, Alex Jinpeng
    Lin, Kevin Qinghong
    Zhang, David Junhao
    Lei, Stan Weixian
    Shou, Mike Zheng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3124 - 3134
  • [28] Scaling Up Vision-Language Pre-training for Image Captioning
    Hu, Xiaowei
    Gan, Zhe
    Wang, Jianfeng
    Yang, Zhengyuan
    Liu, Zicheng
    Lu, Yumao
    Wang, Lijuan
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17959 - 17968
  • [29] Towards Adversarial Attack on Vision-Language Pre-training Models
    Zhang, Jiaming
    Yi, Qi
    Sang, Jitao
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5005 - 5013
  • [30] MAFA: Managing False Negatives for Vision-Language Pre-training
    Byun, Jaeseok
    Kim, Dohoon
    Moon, Taesup
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 27304 - 27314