A Method for Efficient Structured Data Generation with Large Language Models

被引:0
|
作者
Hou, Zongzhi [1 ]
Zhao, Ruohan [1 ]
Li, Zhongyang [1 ]
Wang, Zheng [1 ]
Wu, Yizhen [1 ]
Gou, Junwei [1 ]
Zhu, Zhifeng [1 ]
机构
[1] Huawei, Shanghai, Peoples R China
关键词
Multi-modality; Data Generation; Artificial Intelligence; Large Language Model;
D O I
10.1145/3688866.3689127
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the rapid advancement of large language model technology, the data utilized for training these models has become increasingly significant. The quality of text data samples produced by large unsupervised models is often inadequate, leading to insufficient outcomes. This inadequacy arises from the model's constrained capacity to precisely emulate the underlying structure of the data without direct supervision, resulting in outputs that may lack the necessary fidelity and relevance to the authentic data distribution. In order to overcome the shortcomings of training data generation for specific language generation tasks, this paper proposes a fast data generation system (Fast Data Generation System, FDGS) that can handle multi-modal and structured data generation. As a method for generating data, FDGS uses clustering abstraction to handle multiple data input types through templates. This approach allows for quick data generation and reduces consumption. FDGS is robust, ensuring stable and reliable performance under various conditions. It is more cost-effective in terms of token usage compared to traditional methods that work on a per-instance basis and do not use templates. By abstracting and clustering different input types, FDGS can efficiently generate data from large models. This system is highly adaptable, making it a great choice for multi-modal data generation tasks. It relies on the basic functions of general large-scale language models and employs a query-answer bidirectional generation mechanism to achieve fast data amplification.
引用
收藏
页码:36 / 44
页数:9
相关论文
共 50 条
  • [41] Large language models for causal hypothesis generation in science
    Cohrs, Kai-Hendrik
    Diaz, Emiliano
    Sitokonstantinou, Vasileios
    Varando, Gherardo
    Camps-Valls, Gustau
    MACHINE LEARNING-SCIENCE AND TECHNOLOGY, 2025, 6 (01):
  • [42] Toward Keyword Generation through Large Language Models
    Lee, Wanhae
    Chun, Minki
    Jeong, Hyeonhak
    Jung, Hyunggu
    COMPANION PROCEEDINGS OF 2023 28TH ANNUAL CONFERENCE ON INTELLIGENT USER INTERFACES, IUI 2023 COMPANION, 2023, : 37 - 40
  • [43] On the Evaluation of Large Language Models in Unit Test Generation
    Yang, Lin
    Yang, Chen
    Gao, Shutao
    Wang, Weijing
    Wang, Bo
    Zhu, Qihao
    Chu, Xiao
    Zhou, Jianyi
    Liang, Guangtai
    Wang, Qianxiang
    Chen, Junjie
    arXiv,
  • [44] Efficient Detection of Toxic Prompts in Large Language Models
    Liu, Yi
    Yu, Junzhe
    Sun, Huijia
    Shi, Ling
    Deng, Gelei
    Chen, Yuqi
    Liu, Yang
    arXiv, 1600,
  • [45] Dynamic Voting for Efficient Reasoning in Large Language Models
    Xue, Mingfeng
    Liu, Dayiheng
    Lei, Wenqiang
    Ren, Xingzhang
    Yang, Baosong
    Xie, Jun
    Zhang, Yidan
    Peng, Dezhong
    Lv, Jiancheng
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 3085 - 3104
  • [46] Structured Pruning for Efficient Generative Pre-trained Language Models
    Tao, Chaofan
    Hou, Lu
    Bai, Haoli
    Wei, Jiansheng
    Jiang, Xin
    Liu, Qun
    Lu, Ping
    Wong, Ngai
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 10880 - 10895
  • [47] Prompting Large Language Models With the Socratic Method
    Chang, Edward Y.
    2023 IEEE 13TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE, CCWC, 2023, : 351 - 360
  • [48] Socially Aware Synthetic Data Generation for Suicidal Ideation Detection Using Large Language Models
    Ghanadian, Hamideh
    Nejadgholi, Isar
    Al Osman, Hussein
    IEEE ACCESS, 2024, 12 : 14350 - 14363
  • [49] ChatTwin: Toward Automated Digital Twin Generation for Data Center via Large Language Models
    Li, Minghao
    Wang, Ruihang
    Zhou, Xin
    Zhu, Zhaomeng
    Wen, Yonggang
    Tan, Rui
    PROCEEDINGS OF THE 10TH ACM INTERNATIONAL CONFERENCE ON SYSTEMS FOR ENERGY-EFFICIENT BUILDINGS, CITIES, AND TRANSPORTATION, BUILDSYS 2023, 2023, : 208 - 211
  • [50] Large language models for structured reporting in radiology: past, present, and future
    Busch, Felix
    Hoffmann, Lena
    dos Santos, Daniel Pinto
    Makowski, Marcus R.
    Saba, Luca
    Prucker, Philipp
    Hadamitzky, Martin
    Navab, Nassir
    Kather, Jakob Nikolas
    Truhn, Daniel
    Cuocolo, Renato
    Adams, Lisa C.
    Bressem, Keno K.
    EUROPEAN RADIOLOGY, 2024,