Natural language generation from Universal Dependencies using data augmentation and pre-trained language models

被引：0

作者：

Nguyen D.T. ^{[1
]}

Tran T. ^{[1
]}

机构：

[1] Saigon University, Ho Chi Minh City

来源：

International Journal of Intelligent Information and Database Systems | 2023年 / 16卷 / 01期

关键词：

data augmentation; data-to-text generation; deep learning; fine-tune; pre-trained language models; sequence-to-sequence models; Universal Dependencies;

D O I：

10.1504/IJIIDS.2023.10053426

中图分类号：

学科分类号：

摘要：

Natural language generation (NLG) has focused on data-to-text tasks with different structured inputs in recent years. The generated text should contain given information, be grammatically correct, and meet other criteria. We propose in this research an approach that combines solid pre-trained language models with input data augmentation. The studied data in this work are Universal Dependencies (UDs) which is developed as a framework for consistent annotation of grammar (parts of speech, morphological features and syntactic dependencies) for cross-lingual learning. We study the English UD structures, which are modified into two groups. In the first group, the modification phase is to remove the order information of each word and lemmatise the tokens. In the second group, the modification phase is to remove the functional words and surface-oriented morphological details. With both groups of modified structures, we apply the same approach to explore how pre-trained sequence-to-sequence models text-to-text transfer transformer (T5) and BART perform on the training data. We augment the training data by creating several permutations for each input structure. The result shows that our approach can generate good quality English text with the exciting idea of studying strategies to represent UD inputs. Copyright © 2023 Inderscience Enterprises Ltd.

引用

页码：89 / 105

页数：16

共 50 条

[31] μBERT: Mutation Testing using Pre-Trained Language Models
Degiovanni, Renzo
Papadakis, Mike
2022 IEEE 15TH INTERNATIONAL CONFERENCE ON SOFTWARE TESTING, VERIFICATION AND VALIDATION WORKSHOPS (ICSTW 2022), 2022, : 160 - 169
[32] Devulgarization of Polish Texts Using Pre-trained Language Models
Klamra, Cezary
Wojdyga, Grzegorz
Zurowski, Sebastian
Rosalska, Paulina
Kozlowska, Matylda
Ogrodniczuk, Maciej
COMPUTATIONAL SCIENCE, ICCS 2022, PT II, 2022, : 49 - 55
[33] Commonsense Knowledge Reasoning and Generation with Pre-trained Language Models: A Survey
Bhargava, Prajjwal
Ng, Vincent
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 12317 - 12325
[34] Non-Autoregressive Text Generation with Pre-trained Language Models
Su, Yixuan
Cai, Deng
Wang, Yan
Vandyke, David
Baker, Simon
Li, Piji
Collier, Nigel
16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 234 - 243
[35] LaoPLM: Pre-trained Language Models for Lao
Lin, Nankai
Fu, Yingwen
Yang, Ziyu
Chen, Chuwei
Jiang, Shengyi
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6506 - 6512
[36] HinPLMs: Pre-trained Language Models for Hindi
Huang, Xixuan
Lin, Nankai
Li, Kexin
Wang, Lianxi
Gan, Suifu
2021 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2021, : 241 - 246
[37] Deciphering Stereotypes in Pre-Trained Language Models
Ma, Weicheng
Scheible, Henry
Wang, Brian
Veeramachaneni, Goutham
Chowdhary, Pratim
Sung, Alan
Koulogeorge, Andrew
Wang, Lili
Yang, Diyi
Vosoughi, Soroush
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 11328 - 11345
[38] PhoBERT: Pre-trained language models for Vietnamese
Dat Quoc Nguyen
Anh Tuan Nguyen
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 1037 - 1042
[39] Knowledge Rumination for Pre-trained Language Models
Yao, Yunzhi
Wang, Peng
Mao, Shengyu
Tan, Chuanqi
Huang, Fei
Chen, Huajun
Zhang, Ningyu
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 3387 - 3404
[40] Enhancing radiology report generation through pre-trained language models
Leonardi, Giorgio
Portinale, Luigi
Santomauro, Andrea
PROGRESS IN ARTIFICIAL INTELLIGENCE, 2024,

← 1 2 3 4 5 →