ChatMap: A Wearable Platform Based on the Multi-modal Foundation Model to Augment Spatial Cognition for People with Blindness and Low Vision

被引:0
|
作者
Hao, Yu [1 ,2 ]
Magay, Alexey [1 ,2 ]
Huang, Hao [1 ,2 ]
Yuan, Shuaihang [1 ,2 ]
Wen, Congcong [1 ,2 ]
Fang, Yi [1 ,2 ]
机构
[1] NYU Tandon, Embodied AI & Robot AIR Lab, Brooklyn, NY 11201 USA
[2] NYU Abu Dhabi, Abu Dhabi, U Arab Emirates
来源
2024 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS 2024 | 2024年
关键词
TECHNOLOGIES;
D O I
10.1109/IROS58592.2024.10801606
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Spatial cognition refers to the ability to gain knowledge about their surroundings and utilize this information to identify their location, acquire resources, and navigate their way back to familiar places. People with blindness and low vision (pBLV) face significant challenges with spatial cognition due to the reliance on visual input. Without the full range of visual cues, pBLV individuals often find it difficult to grasp a comprehensive understanding of their environment, leading to obstacles in scene recognition and precise object localization, especially in unfamiliar environments. This limitation extends to their ability to independently detect and avoid potential tripping hazards, making navigation and interaction with their environment more challenging. In this paper, we present a pioneering wearable platform tailored to enhance the spatial cognition of pBLV through the integration of multi-modal foundation model. The proposed platform integrates a wearable camera with audio module and leverages the advanced capabilities of vision language foundation model (i.e., GPT-4 and GPT-4V), for the nuanced processing of visual and textual data. Specifically, we employ vision language models to bridge the gap between visual information and the proprioception of visually impaired users, offering more intelligible guidance by aligning visual data with the natural perception of space and movement. Then we apply prompt engineering to guide the large language model to act as an assistant tailored specifically for pBLV users to produce accurate answers. Another innovation in our model is the incorporation of a chain of thought reasoning process, which enhances the accuracy and interpretability of the model, facilitating the generation of more precise responses to complex user inquiries across diverse environmental contexts. To assess the practical impact of our proposed wearable platform, we carried out a series of real-world experiments across three tasks that are commonly challenging for people with blindness and low vision: risk assessment, object localization, and scene recognition. Additionally, through an ablation study conducted on the VizWiz dataset, we rigorously assess the contribution of each individual module, substantiating the integral role in the model's overall performance.
引用
收藏
页码:129 / 134
页数:6
相关论文
共 8 条
  • [1] A Multi-Modal Foundation Model to Assist People with Blindness and Low Vision in Environmental Interaction
    Hao, Yu
    Yang, Fan
    Huang, Hao
    Yuan, Shuaihang
    Rangan, Sundeep
    Rizzo, John-Ross
    Wang, Yao
    Fang, Yi
    JOURNAL OF IMAGING, 2024, 10 (05)
  • [2] Enhancement of spatial cognition and brain connectivity in people with low vision and blindness
    Likova, Lora
    INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2019, 60 (09)
  • [3] Construction of a multi-modal digital human education platform based on GAN and vision transformer
    Xuliang Yang
    Aimin Pan
    Rodolfo C. Raga
    Scientific Reports, 15 (1)
  • [4] A Multi-modal Toolkit to Support DIY Assistive Technology Creation for Blind and Low Vision People
    He, Liwen
    Li, Yifan
    Fan, Mingming
    He, Liang
    Zhao, Yuhang
    ADJUNCT PROCEEDINGS OF THE 36TH ANNUAL ACM SYMPOSIUM ON USER INTERFACE SOFTWARE & TECHNOLOGY, UIST 2023 ADJUNCT, 2023,
  • [5] Q-BENCH+: A Benchmark for Multi-Modal Foundation Models on Low-Level Vision From Single Images to Pairs
    Zhang, Zicheng
    Wu, Haoning
    Zhang, Erli
    Zhai, Guangtao
    Lin, Weisi
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 10404 - 10418
  • [6] TransFusion: Multi-Modal Robust Fusion for 3D Object Detection in Foggy Weather Based on Spatial Vision Transformer
    Zhang, Cheng
    Wang, Hai
    Cai, Yingfeng
    Chen, Long
    Li, Yicheng
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024, 25 (09) : 10652 - 10666
  • [7] Sustainable Low-Carbon Layout of Land around Rail Transit Stations Based on Multi-Modal Spatial Data
    Liu, Weiwei
    Zhang, Jin
    Jin, Liang
    Dong, Jieshuang
    Alfarraj, Osama
    Tolba, Amr
    Wang, Qian
    He, Yihao
    SUSTAINABILITY, 2023, 15 (12)
  • [8] TF-F-GAN: A GAN-based model to predict the assembly physical fields under multi-modal variables fusion on vision transformer
    Liu, Yuming
    Yu, Wencai
    Lin, Qingyuan
    Wang, Wei
    Ge, Ende
    Su, Aihua
    Zhao, Yong
    ADVANCED ENGINEERING INFORMATICS, 2024, 62