Efficient Medical Images Text Detection with Vision-Language Pre-training Approach

被引:0
|
作者
Li, Tianyang [1 ,2 ]
Bai, Jinxu [1 ]
Wang, Qingzhu [1 ]
Xu, Hanwen [1 ]
机构
[1] Northeast Elect Power Univ, Comp Sci, Jilin, Peoples R China
[2] Jiangxi New Energy Technol Inst, Nanchang, Jiangxi, Peoples R China
关键词
vision-language pre-training; medical text detection; feature enhancement; differentiable binarization;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text detection in medical images is a critical task, essential for automating the extraction of valuable information from diverse healthcare documents. Conventional text detection methods, predominantly based on segmentation, encounter substantial challenges when confronted with text-rich images, extreme aspect ratios, and multi-oriented text. In response to these complexities, this paper introduces an innovative text detection system aimed at enhancing its efficacy. Our proposed system comprises two fundamental components: the Efficient Feature Enhancement Module (EFEM) and the Multi-Scale Feature Fusion Module (MSFM), both serving as integral elements of the segmentation head. The EFEM incorporates a spatial attention mechanism to improve segmentation performance by introducing multi-level information. The MSFM merges features from the EFEM at different depths and scales to generate final segmentation features. In conjunction with our segmentation methodology, our post-processing module employs a differentiable binarization technique, facilitating adaptive threshold adjustment to enhance text detection precision. To further bolster accuracy and robustness, we introduce the integration of a vision-language pre-training model. Through extensive pretraining on large-scale visual language understanding tasks, this model amasses a wealth of rich visual and semantic representations. When seamlessly integrated with the segmentation module, the pretraining model effectively leverages its potent representation capabilities. Our proposed model undergoes rigorous evaluation on medical text image datasets, consistently demonstrating exceptional performance. Benchmark experiments reaffirm its efficacy.
引用
收藏
页数:16
相关论文
共 50 条
  • [41] ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
    Wang, Weihan
    Yang, Zhen
    Xu, Bin
    Li, Juanzi
    Sun, Yankui
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3135 - 3146
  • [42] EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
    Mu, Yao
    Zhang, Qinglong
    Hu, Mengkang
    Wang, Wenhai
    Ding, Mingyu
    Jin, Jun
    Wang, Bin
    Dai, Jifeng
    Qiao, Yu
    Luo, Ping
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [43] Fine-Grained Semantically Aligned Vision-Language Pre-Training
    Li, Juncheng
    He, Xin
    Wei, Longhui
    Qian, Long
    Zhu, Linchao
    Xie, Lingxi
    Zhuang, Yueting
    Tian, Qi
    Tang, Siliang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [44] Source-Free Domain Adaptation Guided by Vision and Vision-Language Pre-training
    Zhang, Wenyu
    Shen, Li
    Foo, Chuan-Sheng
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (02) : 844 - 866
  • [45] BUS : Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization.
    Jiang, Chaoya
    Xu, Haiyang
    Ye, Wei
    Ye, Qinghao
    Li, Chenliang
    Yan, Ming
    Bi, Bin
    Zhang, Shikun
    Huang, Fei
    Huang, Songfang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2888 - 2898
  • [46] Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval
    Yao, Tao
    Peng, Shouyong
    Wang, Lili
    Li, Ying
    Sun, Yujuan
    APPLIED INTELLIGENCE, 2024, 54 (23) : 12230 - 12245
  • [47] IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training
    Huang, Xinyu
    Zhang, Youcai
    Cheng, Ying
    Tian, Weiwei
    Zhao, Ruiwei
    Feng, Rui
    Zhang, Yuejie
    Li, Yaqian
    Guo, Yandong
    Zhang, Xiaobo
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4573 - 4583
  • [48] MAKE: Vision-Language Pre-training based Product Retrieval in Taobao Search
    Zheng, Xiaoyang
    Wang, Zilong
    Li, Sen
    Xu, Ke
    Zhuang, Tao
    Liu, Qingwen
    Zeng, Xiaoyi
    COMPANION OF THE WORLD WIDE WEB CONFERENCE, WWW 2023, 2023, : 356 - 360
  • [49] VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
    Bao, Hangbo
    Wang, Wenhui
    Dong, Li
    Liu, Qiang
    Mohammed, Owais Khan
    Aggarwal, Kriti
    Som, Subhojit
    Piao, Songhao
    Wei, Furu
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [50] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
    Liu, Zikang
    Chen, Sihan
    Guo, Longteng
    Li, Handong
    He, Xingjian
    Liu, Jing
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5120 - 5131