Vision language model for interpretable and fine-grained detection of safety compliance in diverse workplaces

被引:0
|
作者
Chen, Zhiling [1 ]
Chen, Hanning [2 ]
Imani, Mohsen [2 ]
Chen, Ruimin [1 ]
Imani, Farhad [1 ]
机构
[1] Univ Connecticut, Sch Mech Aerosp & Mfg Engn, Storrs, CT 06269 USA
[2] Univ Calif Irvine, Dept Comp Sci, Irvine, CA USA
基金
美国国家科学基金会;
关键词
Personal protective equipment; Zero-shot object detection; Vision language model; Large language model; CONSTRUCTION; IDENTIFICATION;
D O I
10.1016/j.eswa.2024.125769
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Workplace accidents due to personal protective equipment (PPE) non-compliance raise serious safety concerns and lead to legal liabilities, financial penalties, and reputational damage. While object detection models have shown the capability to address this issue by identifying safety gear, most existing models, such as YOLO, Faster R-CNN, and SSD, are limited in verifying the fine-grained attributes of PPE across diverse workplace scenarios. Vision language models (VLMs) are gaining traction for detection tasks by leveraging the synergy between visual and textual information, offering a promising solution to traditional object detection limitations in PPE recognition. Nonetheless, VLMs face challenges inconsistently verifying PPE attributes due to the complexity and variability of workplace environments, requiring them to interpret context-specific language and visual cues simultaneously. We introduce Clip2Safety, an interpretable detection framework for diverse workplace safety compliance, which comprises four main modules: scene recognition, visual prompt, safety gear detection, and fine-grained verification. Scene recognition identifies the current scenario to determine the necessary safety gear. Visual prompt formulates specific visual cues needed for the detection process. Safety gear detection identifies whether the required safety gear is being worn according to the specified scenario. Lastly, fine-grained verification assesses whether the worn safety equipment meets the fine-grained attribute requirements. We conduct real-world case studies across six different scenarios. The results show that Clip2Safety not only demonstrates an accuracy improvement over state-of-the-art question-answering based VLMs but also achieves inference times that are 21x faster.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] Neural Prototype Trees for Interpretable Fine-grained Image Recognition
    Nauta, Meike
    van Bree, Ron
    Seifert, Christin
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 14928 - 14938
  • [22] ConceptHash: Interpretable Fine-Grained Hashing via Concept Discovery
    Ng, Kam Woh
    Zhu, Xiatian
    Song, Yi-Zhe
    Xiang, Tao
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2024, : 1211 - 1223
  • [23] Transitivity Recovering Decompositions: Interpretable and Robust Fine-Grained Relationships
    Chaudhuri, Abhra
    Mancini, Massimiliano
    Akata, Zeynep
    Dutta, Anjan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [24] Towards Fine-Grained Recognition: Joint Learning for Object Detection and Fine-Grained Classification
    Wang, Qiaosong
    Rasmussen, Christopher
    ADVANCES IN VISUAL COMPUTING, ISVC 2019, PT II, 2019, 11845 : 332 - 344
  • [25] Beyond Binary Classification: A Fine-Grained Safety Dataset for Large Language Models
    Yu, Jia
    Li, Long
    Lan, Zhenzhong
    IEEE ACCESS, 2024, 12 : 64717 - 64726
  • [26] Fine-Grained Visual Prompt Learning of Vision-Language Models for Image Recognition
    Sun, Hongbo
    He, Xiangteng
    Zhou, Jiahuan
    Peng, Yuxin
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5828 - 5836
  • [27] LORE: a model for the detection of fine-grained locative references in tweets
    Jose Fernandez-Martinez, Nicolas
    Perinan-Pascual, Carlos
    ONOMAZEIN, 2021, (52): : 195 - 225
  • [28] Fine-grained multi-modal prompt learning for vision-language models
    Liu, Yunfei
    Deng, Yunziwei
    Liu, Anqi
    Liu, Yanan
    Li, Shengyang
    NEUROCOMPUTING, 2025, 636
  • [29] MAMO: Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling
    Zhao, Zijia
    Guo, Longteng
    He, Xingjian
    Shao, Shuai
    Yuan, Zehuan
    Liu, Jing
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 1528 - 1538
  • [30] c-RNN: A Fine-Grained Language Model for Image Captioning
    Gengshi Huang
    Haifeng Hu
    Neural Processing Letters, 2019, 49 : 683 - 691