Enhancing Visual Information Extraction with Large Language Models Through Layout-Aware Instruction Tuning

被引：0

作者：

Li, Teng ^{[1
]}

Wang, Jiapeng ^{[1
]}

Jin, Lianwen ^{[1
]}

机构：

[1] South China Univ Technol, Guangzhou, Peoples R China

来源：

PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT VII | 2025年 / 15037卷

基金：

中国国家自然科学基金;

关键词：

Visual Information Extraction; Large Language Model; Instruction Tuning;

D O I：

10.1007/978-981-97-8511-7_20

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, leveraging large language models (LLMs) for visually-rich document information extraction has made significant progress. Previous studies have simplified the task of visual information extraction into a document visual question answering task. This task involves a question-answer session that yields a single entity result at a time, serving as a means of validating the document understanding capabilities of large language models (LLMs). However, these methods encounter significant challenges in computational efficiency and cost when addressing the document digitization requirements for extracting multiple entities from a single document. This scenario is common in practical applications of visual information extraction. This paper builds upon large language model and incorporates document layout information through a document layout modeling branch. We also design a layout-aware and task-specific instruction set. To further enhance the model's proficiency in learning document layout information, we initially augment the tokenizer's vocabulary. Subsequently, the entire model undergoes fine-tuning to ensure improved adaptability to the expanded vocabulary and effective extraction of document layout features. By harnessing the exceptional language comprehension capabilities of LLMs, our model is capable of executing comprehensive entity extraction for an entire document in a single pass. Benefiting from the characteristics of generative large language models, we can accomplish multiple downstream tasks of visual information extraction using an individual model. Our experimental results demonstrate consistent improvement over the baseline model across a range of document visual information extraction tasks.

引用

页码：276 / 289

页数：14

共 50 条

[11] GraphGPT: Graph Instruction Tuning for Large Language Models
Tang, Jiabin
Yang, Yuhao
Wei, Wei
Shi, Lei
Su, Lixin
Cheng, Suqi
Yin, Dawei
Huang, Chao
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 491 - 500
[12] Phased Instruction Fine-Tuning for Large Language Models
Pang, Wei
Zhou, Chuan
Zhou, Xiao-Hua
Wang, Xiaojie
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 5735 - 5748
[13] Enhancing Relation Extraction Through Augmented Data: Large Language Models Unleashed
Ali, Manzoor
Nisar, Muhammad Sohail
Saleem, Muhammad
Moussallem, Diego
Ngomo, Axel-Cyrille Ngonga
NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PT II, NLDB 2024, 2024, 14763 : 68 - 78
[14] BioInstruct: instruction tuning of large language models for biomedical natural language processing
Tran, Hieu
Yang, Zhichao
Yao, Zonghai
Yu, Hong
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (09) : 1821 - 1832
[15] Demystifying Instruction Mixing for Fine-tuning Large Language Models
Wang, Renxi
Li, Haonan
Wu, Minghao
Wang, Yuxia
Han, Xudong
Zhang, Chiyu
Baldwin, Timothy
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 4: STUDENT RESEARCH WORKSHOP, 2024, : 86 - 93
[16] Tuna: Instruction Tuning using Feedback from Large Language Models
Li, Haoran
Liu, Yiran
Zhang, Xingxing
Lu, Wei
Wei, Furu
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 15146 - 15163
[17] An Empirical Study of Instruction-tuning Large Language Models in Chinese
Si, Qingyi
Wang, Tong
Lin, Zheng
Zhang, Xu
Cao, Yanan
Wang, Weiping
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 4086 - 4107
[18] Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
Luo, Gen
Zhou, Yiyi
Ren, Tianhe
Chen, Shengxin
Sun, Xiaoshuai
Ji, Rongrong
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[19] Large language models for generative information extraction: a survey
Xu, Derong
Chen, Wei
Peng, Wenjun
Zhang, Chao
Xu, Tong
Zhao, Xiangyu
Wu, Xian
Zheng, Yefeng
Wang, Yang
Chen, Enhong
FRONTIERS OF COMPUTER SCIENCE, 2024, 18 (06)
[20] Extraction of Subjective Information from Large Language Models
Kobayashi, Atsuya
Yamaguchi, Saneyasu
2024 IEEE 48TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE, COMPSAC 2024, 2024, : 1612 - 1617

← 1 2 3 4 5 →