A Method for Efficient Structured Data Generation with Large Language Models

被引:0
|
作者
Hou, Zongzhi [1 ]
Zhao, Ruohan [1 ]
Li, Zhongyang [1 ]
Wang, Zheng [1 ]
Wu, Yizhen [1 ]
Gou, Junwei [1 ]
Zhu, Zhifeng [1 ]
机构
[1] Huawei, Shanghai, Peoples R China
关键词
Multi-modality; Data Generation; Artificial Intelligence; Large Language Model;
D O I
10.1145/3688866.3689127
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the rapid advancement of large language model technology, the data utilized for training these models has become increasingly significant. The quality of text data samples produced by large unsupervised models is often inadequate, leading to insufficient outcomes. This inadequacy arises from the model's constrained capacity to precisely emulate the underlying structure of the data without direct supervision, resulting in outputs that may lack the necessary fidelity and relevance to the authentic data distribution. In order to overcome the shortcomings of training data generation for specific language generation tasks, this paper proposes a fast data generation system (Fast Data Generation System, FDGS) that can handle multi-modal and structured data generation. As a method for generating data, FDGS uses clustering abstraction to handle multiple data input types through templates. This approach allows for quick data generation and reduces consumption. FDGS is robust, ensuring stable and reliable performance under various conditions. It is more cost-effective in terms of token usage compared to traditional methods that work on a per-instance basis and do not use templates. By abstracting and clustering different input types, FDGS can efficiently generate data from large models. This system is highly adaptable, making it a great choice for multi-modal data generation tasks. It relies on the basic functions of general large-scale language models and employs a query-answer bidirectional generation mechanism to achieve fast data amplification.
引用
收藏
页码:36 / 44
页数:9
相关论文
共 50 条
  • [31] MapReduce algorithms for efficient generation of CPS models from large historical data sets
    Windmann, Stefan
    Niggemann, Oliver
    PROCEEDINGS OF 2015 IEEE 20TH CONFERENCE ON EMERGING TECHNOLOGIES & FACTORY AUTOMATION (ETFA), 2015,
  • [32] Demystifying Data Management for Large Language Models
    Miao, Xupeng
    Jia, Zhihao
    Cui, Bin
    COMPANION OF THE 2024 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, SIGMOD-COMPANION 2024, 2024, : 547 - 555
  • [33] Bias and Cyberbullying Detection and Data Generation Using Transformer Artificial Intelligence Models and Top Large Language Models
    Kumar, Yulia
    Huang, Kuan
    Perez, Angelo
    Yang, Guohao
    Li, J. Jenny
    Morreale, Patricia
    Kruger, Dov
    Jiang, Raymond
    ELECTRONICS, 2024, 13 (17)
  • [34] Evaluating Large Language Models on Controlled Generation Tasks
    Sun, Jiao
    Tian, Yufei
    Zhou, Wangchunshu
    Xu, Nan
    Hu, Qian
    Gupta, Rahul
    Wieting, John
    Peng, Nanyun
    Ma, Xuezhe
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 3155 - 3168
  • [35] On the Evaluation of Large Language Models in Unit Test Generation
    Yang, Lin
    Yang, Chen
    Gao, Shutao
    Wang, Weijing
    Wang, Bo
    Zhu, Qihao
    Chu, Xiao
    Zhou, Jianyi
    Liang, Guangtai
    Wang, Qianxiang
    Chen, Junjie
    Proceedings - 2024 39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024, : 1607 - 1619
  • [36] Question Generation Capabilities of "Small" Large Language Models
    Berger, Joshua
    Koss, Jonathan
    Stamatakis, Markos
    Hoppe, Anett
    Ewerth, Ralph
    Wartenal, Christian
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PT II, NLDB 2024, 2024, 14763 : 183 - 194
  • [37] Bootstrapping Large Language Models for Radiology Report Generation
    Liu, Chang
    Tian, Yuanhe
    Chen, Weidong
    Song, Yan
    Zhang, Yongdong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 18635 - 18643
  • [38] StableYolo: Optimizing Image Generation for Large Language Models
    Berger, Harel
    Dakhama, Aidan
    Ding, Zishuo
    Even-Mendoza, Karine
    Kelly, David
    Menendez, Hector
    Moussa, Rebecca
    Sarro, Federica
    SEARCH-BASED SOFTWARE ENGINEERING, SSBSE 2023, 2024, 14415 : 133 - 139
  • [39] An efficient method for transient data recovery from large DOF FE models
    Johal, RS
    IMAC-XVIII: A CONFERENCE ON STRUCTURAL DYNAMICS, VOLS 1 AND 2, PROCEEDINGS, 2000, 4062 : 208 - 212
  • [40] CONCEPTUAL DESIGN GENERATION USING LARGE LANGUAGE MODELS
    Ma, Kevin
    Grandi, Daniele
    McComb, Christopher
    Goucher-Lambert, Kosa
    PROCEEDINGS OF ASME 2023 INTERNATIONAL DESIGN ENGINEERING TECHNICAL CONFERENCES AND COMPUTERS AND INFORMATION IN ENGINEERING CONFERENCE, IDETC-CIE2023, VOL 6, 2023,