STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models

被引:0
|
作者
Ma, Mingyu Derek [1 ]
Wang, Xiaoxuan [1 ]
Kung, Po-Nien [1 ]
Brantingham, P. Jeffrey [2 ]
Peng, Nanyun [1 ]
Wang, Wei [1 ]
机构
[1] Univ Calif Los Angeles, Dept Comp Sci, Los Angeles, CA 90024 USA
[2] Univ Calif Los Angeles, Dept Anthropol, Los Angeles, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Information extraction tasks such as event extraction require an in-depth understanding of the output structure and sub-task dependencies. They heavily rely on task-specific training data in the form of (passage, target structure) pairs to obtain reasonable performance. However, obtaining such data through human annotation is costly, leading to a pressing need for low-resource information extraction approaches that require minimal human labeling for real-world applications. Fine-tuning supervised models with synthesized training data would be a generalizable method, but the existing data generation methods either still rely on large-scale ground-truth data or cannot be applied to complicated IE tasks due to their poor performance. To address these challenges, we propose STAR, a data generation method that leverages Large Language Models (LLMs) to synthesize data instances given limited seed demonstrations, thereby boosting low-resource information extraction performance. Our approach involves generating target structures (Y) followed by generating passages (X), all accomplished with the aid of LLMs. We design fine-grained step-by-step instructions to obtain the initial data instances. We further reduce errors and improve data quality through self-reflection error identification and self-refinement with iterative revision. Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks, even surpassing the effectiveness of human-curated data. Human assessment of the data quality shows STAR-generated data exhibit higher passage quality and better align with the task definitions compared with the human-curated data.
引用
收藏
页码:18751 / 18759
页数:9
相关论文
共 35 条
  • [31] Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation
    Comini, Giulia
    Huybrechts, Goeric
    Ribeiro, Manuel Sam
    Gabrys, Adam
    Lorenzo-Trueba, Jaime
    INTERSPEECH 2022, 2022, : 1946 - 1950
  • [32] Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models
    Xu, Ran
    Cui, Hejie
    Yu, Yue
    Kan, Xuan
    Shi, Wenqi
    Zhuang, Yuchen
    Wang, May D.
    Jin, Wei
    Ho, Joyce C.
    Yang, Carl
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 15496 - 15523
  • [33] Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions
    Chung, John Joon Young
    Kamar, Ece
    Amershi, Saleema
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 575 - 593
  • [34] Exploring large language models for the generation of synthetic training samples for aspect-based sentiment analysis in low resource settings
    Hellwig, Nils Constantin
    Fehle, Jakob
    Wolff, Christian
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 261
  • [35] Uyghur character models with shared structure information for segmentation-free recognition under low data resource conditions
    Tsinghua National Laboratory for Information Science and Technology, Beijing
    100084, China
    不详
    100084, China
    Dianzi Yu Xinxi Xuebao, 9 (2103-2109):