A Method for Efficient Structured Data Generation with Large Language Models

被引:0
|
作者
Hou, Zongzhi [1 ]
Zhao, Ruohan [1 ]
Li, Zhongyang [1 ]
Wang, Zheng [1 ]
Wu, Yizhen [1 ]
Gou, Junwei [1 ]
Zhu, Zhifeng [1 ]
机构
[1] Huawei, Shanghai, Peoples R China
关键词
Multi-modality; Data Generation; Artificial Intelligence; Large Language Model;
D O I
10.1145/3688866.3689127
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the rapid advancement of large language model technology, the data utilized for training these models has become increasingly significant. The quality of text data samples produced by large unsupervised models is often inadequate, leading to insufficient outcomes. This inadequacy arises from the model's constrained capacity to precisely emulate the underlying structure of the data without direct supervision, resulting in outputs that may lack the necessary fidelity and relevance to the authentic data distribution. In order to overcome the shortcomings of training data generation for specific language generation tasks, this paper proposes a fast data generation system (Fast Data Generation System, FDGS) that can handle multi-modal and structured data generation. As a method for generating data, FDGS uses clustering abstraction to handle multiple data input types through templates. This approach allows for quick data generation and reduces consumption. FDGS is robust, ensuring stable and reliable performance under various conditions. It is more cost-effective in terms of token usage compared to traditional methods that work on a per-instance basis and do not use templates. By abstracting and clustering different input types, FDGS can efficiently generate data from large models. This system is highly adaptable, making it a great choice for multi-modal data generation tasks. It relies on the basic functions of general large-scale language models and employs a query-answer bidirectional generation mechanism to achieve fast data amplification.
引用
收藏
页码:36 / 44
页数:9
相关论文
共 50 条
  • [1] Comprehensive testing of large language models for extraction of structured data in pathology
    Bastian Grothey
    Jan Odenkirchen
    Adnan Brkic
    Birgid Schömig-Markiefka
    Alexander Quaas
    Reinhard Büttner
    Yuri Tolkach
    Communications Medicine, 5 (1):
  • [2] Structured Pruning of Large Language Models
    Wang, Ziheng
    Wohlwend, Jeremy
    Lei, Tao
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6151 - 6162
  • [3] Data augmented large language models for medical record generation
    Zhang, Xuanyi
    Zhao, Genghong
    Ren, Yi
    Wang, Weiguang
    Cai, Wei
    Zhao, Yan
    Zhang, Xia
    Liu, Jiren
    APPLIED INTELLIGENCE, 2025, 55 (02)
  • [4] Shopfloor layout generation method based on large language models
    Hu, Yi
    Sun, Yicheng
    Wen, Xiaojian
    Wang, Sen
    Bao, Jinsong
    INTERNATIONAL JOURNAL OF COMPUTER INTEGRATED MANUFACTURING, 2025,
  • [5] Debiasing Large Language Models with Structured Knowledge
    Ma, Congda
    Zhao, Tianyu
    Okumura, Manabu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 10274 - 10287
  • [6] WActiGrad: Structured Pruning for Efficient Finetuning and Inference of Large Language Models on AI Accelerators
    Chitty-Venkata, Krishna Teja
    Sastry, Varuni Katti
    Emani, Murali
    Vishwanath, Venkatram
    Shanmugavelu, Sanjif
    Howland, Sylvia
    EURO-PAR 2024: PARALLEL PROCESSING, PART II, EURO-PAR 2024, 2024, 14802 : 317 - 331
  • [7] An efficient rendering method for large vector data on large terrain models
    YANG Ling1
    2Department of Geodesy and Geomatics
    Science China(Information Sciences), 2010, 53 (06) : 1122 - 1129
  • [8] An efficient rendering method for large vector data on large terrain models
    Ling Yang
    LiQiang Zhang
    ZhiZhong Kang
    ZhiQiang Xiao
    JunHuan Peng
    XingMing Zhang
    Liu Liu
    Science China Information Sciences, 2010, 53 : 1122 - 1129
  • [9] An efficient rendering method for large vector data on large terrain models
    Yang Ling
    Zhang LiQiang
    Kang ZhiZhong
    Xiao ZhiQiang
    Peng JunHuan
    Zhang XingMing
    Liu Liu
    SCIENCE CHINA-INFORMATION SCIENCES, 2010, 53 (06) : 1122 - 1129
  • [10] Directions Towards Efficient and Automated Data Wrangling with Large Language Models
    Zhang, Zeyu
    Groth, Paul
    Calixto, Iacer
    Schelter, Sebastian
    2024 IEEE 40TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOP, ICDEW, 2024, : 301 - 304