Vision language model for interpretable and fine-grained detection of safety compliance in diverse workplaces

被引：0

作者：

Chen, Zhiling ^{[1
]}

Chen, Hanning ^{[2
]}

Imani, Mohsen ^{[2
]}

Chen, Ruimin ^{[1
]}

Imani, Farhad ^{[1
]}

机构：

[1] Univ Connecticut, Sch Mech Aerosp & Mfg Engn, Storrs, CT 06269 USA

[2] Univ Calif Irvine, Dept Comp Sci, Irvine, CA USA

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2025年 / 265卷

基金：

美国国家科学基金会;

关键词：

Personal protective equipment; Zero-shot object detection; Vision language model; Large language model; CONSTRUCTION; IDENTIFICATION;

D O I：

10.1016/j.eswa.2024.125769

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Workplace accidents due to personal protective equipment (PPE) non-compliance raise serious safety concerns and lead to legal liabilities, financial penalties, and reputational damage. While object detection models have shown the capability to address this issue by identifying safety gear, most existing models, such as YOLO, Faster R-CNN, and SSD, are limited in verifying the fine-grained attributes of PPE across diverse workplace scenarios. Vision language models (VLMs) are gaining traction for detection tasks by leveraging the synergy between visual and textual information, offering a promising solution to traditional object detection limitations in PPE recognition. Nonetheless, VLMs face challenges inconsistently verifying PPE attributes due to the complexity and variability of workplace environments, requiring them to interpret context-specific language and visual cues simultaneously. We introduce Clip2Safety, an interpretable detection framework for diverse workplace safety compliance, which comprises four main modules: scene recognition, visual prompt, safety gear detection, and fine-grained verification. Scene recognition identifies the current scenario to determine the necessary safety gear. Visual prompt formulates specific visual cues needed for the detection process. Safety gear detection identifies whether the required safety gear is being worn according to the specified scenario. Lastly, fine-grained verification assesses whether the worn safety equipment meets the fine-grained attribute requirements. We conduct real-world case studies across six different scenarios. The results show that Clip2Safety not only demonstrates an accuracy improvement over state-of-the-art question-answering based VLMs but also achieves inference times that are 21x faster.

引用

页数：15

共 50 条

[21] Neural Prototype Trees for Interpretable Fine-grained Image Recognition
Nauta, Meike
van Bree, Ron
Seifert, Christin
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 14928 - 14938
[22] ConceptHash: Interpretable Fine-Grained Hashing via Concept Discovery
Ng, Kam Woh
Zhu, Xiatian
Song, Yi-Zhe
Xiang, Tao
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2024, : 1211 - 1223
[23] Transitivity Recovering Decompositions: Interpretable and Robust Fine-Grained Relationships
Chaudhuri, Abhra
Mancini, Massimiliano
Akata, Zeynep
Dutta, Anjan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[24] Towards Fine-Grained Recognition: Joint Learning for Object Detection and Fine-Grained Classification
Wang, Qiaosong
Rasmussen, Christopher
ADVANCES IN VISUAL COMPUTING, ISVC 2019, PT II, 2019, 11845 : 332 - 344
[25] Beyond Binary Classification: A Fine-Grained Safety Dataset for Large Language Models
Yu, Jia
Li, Long
Lan, Zhenzhong
IEEE ACCESS, 2024, 12 : 64717 - 64726
[26] Fine-Grained Visual Prompt Learning of Vision-Language Models for Image Recognition
Sun, Hongbo
He, Xiangteng
Zhou, Jiahuan
Peng, Yuxin
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5828 - 5836
[27] LORE: a model for the detection of fine-grained locative references in tweets
Jose Fernandez-Martinez, Nicolas
Perinan-Pascual, Carlos
ONOMAZEIN, 2021, (52): : 195 - 225
[28] Fine-grained multi-modal prompt learning for vision-language models
Liu, Yunfei
Deng, Yunziwei
Liu, Anqi
Liu, Yanan
Li, Shengyang
NEUROCOMPUTING, 2025, 636
[29] MAMO: Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling
Zhao, Zijia
Guo, Longteng
He, Xingjian
Shao, Shuai
Yuan, Zehuan
Liu, Jing
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 1528 - 1538
[30] c-RNN: A Fine-Grained Language Model for Image Captioning
Gengshi Huang
Haifeng Hu
Neural Processing Letters, 2019, 49 : 683 - 691

← 1 2 3 4 5 →