ChatMap: A Wearable Platform Based on the Multi-modal Foundation Model to Augment Spatial Cognition for People with Blindness and Low Vision

被引：0

作者：

Hao, Yu ^{[1
,2
]}

Magay, Alexey ^{[1
,2
]}

Huang, Hao ^{[1
,2
]}

Yuan, Shuaihang ^{[1
,2
]}

Wen, Congcong ^{[1
,2
]}

Fang, Yi ^{[1
,2
]}

机构：

[1] NYU Tandon, Embodied AI & Robot AIR Lab, Brooklyn, NY 11201 USA

[2] NYU Abu Dhabi, Abu Dhabi, U Arab Emirates

来源：

2024 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS 2024 | 2024年

关键词：

TECHNOLOGIES;

D O I：

10.1109/IROS58592.2024.10801606

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Spatial cognition refers to the ability to gain knowledge about their surroundings and utilize this information to identify their location, acquire resources, and navigate their way back to familiar places. People with blindness and low vision (pBLV) face significant challenges with spatial cognition due to the reliance on visual input. Without the full range of visual cues, pBLV individuals often find it difficult to grasp a comprehensive understanding of their environment, leading to obstacles in scene recognition and precise object localization, especially in unfamiliar environments. This limitation extends to their ability to independently detect and avoid potential tripping hazards, making navigation and interaction with their environment more challenging. In this paper, we present a pioneering wearable platform tailored to enhance the spatial cognition of pBLV through the integration of multi-modal foundation model. The proposed platform integrates a wearable camera with audio module and leverages the advanced capabilities of vision language foundation model (i.e., GPT-4 and GPT-4V), for the nuanced processing of visual and textual data. Specifically, we employ vision language models to bridge the gap between visual information and the proprioception of visually impaired users, offering more intelligible guidance by aligning visual data with the natural perception of space and movement. Then we apply prompt engineering to guide the large language model to act as an assistant tailored specifically for pBLV users to produce accurate answers. Another innovation in our model is the incorporation of a chain of thought reasoning process, which enhances the accuracy and interpretability of the model, facilitating the generation of more precise responses to complex user inquiries across diverse environmental contexts. To assess the practical impact of our proposed wearable platform, we carried out a series of real-world experiments across three tasks that are commonly challenging for people with blindness and low vision: risk assessment, object localization, and scene recognition. Additionally, through an ablation study conducted on the VizWiz dataset, we rigorously assess the contribution of each individual module, substantiating the integral role in the model's overall performance.

引用

页码：129 / 134

页数：6

共 8 条

[1] A Multi-Modal Foundation Model to Assist People with Blindness and Low Vision in Environmental Interaction
Hao, Yu
Yang, Fan
Huang, Hao
Yuan, Shuaihang
Rangan, Sundeep
Rizzo, John-Ross
Wang, Yao
Fang, Yi
JOURNAL OF IMAGING, 2024, 10 (05)
[2] Enhancement of spatial cognition and brain connectivity in people with low vision and blindness
Likova, Lora
INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2019, 60 (09)
[3] Construction of a multi-modal digital human education platform based on GAN and vision transformer
Xuliang Yang
Aimin Pan
Rodolfo C. Raga
Scientific Reports, 15 (1)
[4] A Multi-modal Toolkit to Support DIY Assistive Technology Creation for Blind and Low Vision People
He, Liwen
Li, Yifan
Fan, Mingming
He, Liang
Zhao, Yuhang
ADJUNCT PROCEEDINGS OF THE 36TH ANNUAL ACM SYMPOSIUM ON USER INTERFACE SOFTWARE & TECHNOLOGY, UIST 2023 ADJUNCT, 2023,
[5] Q-BENCH+: A Benchmark for Multi-Modal Foundation Models on Low-Level Vision From Single Images to Pairs
Zhang, Zicheng
Wu, Haoning
Zhang, Erli
Zhai, Guangtao
Lin, Weisi
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 10404 - 10418
[6] TransFusion: Multi-Modal Robust Fusion for 3D Object Detection in Foggy Weather Based on Spatial Vision Transformer
Zhang, Cheng
Wang, Hai
Cai, Yingfeng
Chen, Long
Li, Yicheng
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024, 25 (09) : 10652 - 10666
[7] Sustainable Low-Carbon Layout of Land around Rail Transit Stations Based on Multi-Modal Spatial Data
Liu, Weiwei
Zhang, Jin
Jin, Liang
Dong, Jieshuang
Alfarraj, Osama
Tolba, Amr
Wang, Qian
He, Yihao
SUSTAINABILITY, 2023, 15 (12)
[8] TF-F-GAN: A GAN-based model to predict the assembly physical fields under multi-modal variables fusion on vision transformer
Liu, Yuming
Yu, Wencai
Lin, Qingyuan
Wang, Wei
Ge, Ende
Su, Aihua
Zhao, Yong
ADVANCED ENGINEERING INFORMATICS, 2024, 62

← 1 →