STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models

被引:0
|
作者
Ma, Mingyu Derek [1 ]
Wang, Xiaoxuan [1 ]
Kung, Po-Nien [1 ]
Brantingham, P. Jeffrey [2 ]
Peng, Nanyun [1 ]
Wang, Wei [1 ]
机构
[1] Univ Calif Los Angeles, Dept Comp Sci, Los Angeles, CA 90024 USA
[2] Univ Calif Los Angeles, Dept Anthropol, Los Angeles, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Information extraction tasks such as event extraction require an in-depth understanding of the output structure and sub-task dependencies. They heavily rely on task-specific training data in the form of (passage, target structure) pairs to obtain reasonable performance. However, obtaining such data through human annotation is costly, leading to a pressing need for low-resource information extraction approaches that require minimal human labeling for real-world applications. Fine-tuning supervised models with synthesized training data would be a generalizable method, but the existing data generation methods either still rely on large-scale ground-truth data or cannot be applied to complicated IE tasks due to their poor performance. To address these challenges, we propose STAR, a data generation method that leverages Large Language Models (LLMs) to synthesize data instances given limited seed demonstrations, thereby boosting low-resource information extraction performance. Our approach involves generating target structures (Y) followed by generating passages (X), all accomplished with the aid of LLMs. We design fine-grained step-by-step instructions to obtain the initial data instances. We further reduce errors and improve data quality through self-reflection error identification and self-refinement with iterative revision. Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks, even surpassing the effectiveness of human-curated data. Human assessment of the data quality shows STAR-generated data exhibit higher passage quality and better align with the task definitions compared with the human-curated data.
引用
收藏
页码:18751 / 18759
页数:9
相关论文
共 35 条
  • [1] Exploring Large Language Models for Low-Resource IT Information Extraction
    Bhavya, Bhavya
    Isaza, Paulina Toro
    Deng, Yu
    Nidd, Michael
    Azad, Amar Prakash
    Shwartz, Larisa
    Zhai, ChengXiang
    2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 1203 - 1212
  • [2] Hybrid Approach Text Generation for Low-Resource Language
    Rakhimova, Diana
    Adali, Esref
    Karibayeva, Aidana
    ADVANCES IN COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2024, PART I, 2024, 2165 : 256 - 268
  • [3] Reliable Data Generation and Selection for Low-Resource Relation Extraction
    Yu, Junjie
    Wang, Xing
    Chen, Wenliang
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19440 - 19448
  • [4] Large Language Models and Low-Resource Languages: An Examination of Armenian NLP
    Avetisyan, Hayastan
    Broneske, David
    13TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING AND THE 3RD CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, IJCNLP-AACL 2023, 2023, : 199 - 210
  • [5] Faithful Low-Resource Data-to-Text Generation through Cycle Training
    Wang, Zhuoer
    Collins, Marcus
    Vedula, Nikhita
    Filice, Simone
    Malmasi, Shervin
    Rokhlenko, Oleg
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 2847 - 2867
  • [6] A Zero-Shot Framework for Low-Resource Relation Extraction via Distant Supervision and Large Language Models
    Han, Peisheng
    Liang, Geng
    Wang, Yongfei
    ELECTRONICS, 2025, 14 (03):
  • [7] Structured information extraction from scientific text with large language models
    John Dagdelen
    Alexander Dunn
    Sanghoon Lee
    Nicholas Walker
    Andrew S. Rosen
    Gerbrand Ceder
    Kristin A. Persson
    Anubhav Jain
    Nature Communications, 15
  • [8] Structured information extraction from scientific text with large language models
    Dagdelen, John
    Dunn, Alexander
    Lee, Sanghoon
    Walker, Nicholas
    Rosen, Andrew S.
    Ceder, Gerbrand
    Persson, Kristin A.
    Jain, Anubhav
    NATURE COMMUNICATIONS, 2024, 15 (01)
  • [9] Getting More Data for Low-resource Morphological Inflection: Language Models and Data Augmentation
    Sorokin, Alexey
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3978 - 3983
  • [10] A Systematic Review on Semantic Role Labeling for Information Extraction in Low-Resource Data
    Ariyanto, Amelia Devi Putri
    Purwitasari, Diana
    Fatichah, Chastine
    IEEE ACCESS, 2024, 12 : 57917 - 57946