Data augmentation using virtual word insertion techniques in text classification tasks

被引:1
|
作者
Long, Zhigao [1 ,2 ]
Li, Hong [1 ]
Shi, Jiawen [1 ]
Ma, Xin [1 ]
机构
[1] Cent South Univ, Sch Comp Sci & Engn, Changsha, Peoples R China
[2] Cent South Univ, Sch Comp Sci & Engn, Changsha 410083, Hunan, Peoples R China
基金
中国国家自然科学基金;
关键词
class deviation factor; data augmentation; text classification; virtual word insertion techniques;
D O I
10.1111/exsy.13519
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Labelling multiple training examples for text classification models is usually time-consuming and complex. Data augmentation can be used to automatically expand the dataset by transforming the original data. However, it may cause semantic changes without modifying the labels, which reduces the effectiveness of the classifiers. In this paper, we propose a data-augmentation method called the virtual word insertion technique, which generates new sentences by randomly inserting virtual words into existing sentences. Two methods are used to achieve virtual word embedding: unweighted average and weighted average. Furthermore, a new concept of weight is proposed: the class deviation factor, which demonstrates the correlation between words and classes. Based on this new concept, different weights are assigned to words of different classes. Experiments are conducted on five different classification tasks. Ablation experiments are also performed to explore the effects of random operation and number of augmented sentences for classification results. The results of these experiments show that our method improves the classification performance and outperforms two other contrasting data-augmentation methods in automatically augmenting the dataset. Compared to raw datasets, the average accuracy improvement of our method is 3.5% for a small-scale dataset and 1% for a large-scale dataset.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] Text Data Augmentation Techniques for Word Embeddings in Fake News Classification
    Kapusta, Jozef
    Drzik, David
    Steflovic, Kirsten
    Nagy, Kitti Szabo
    IEEE ACCESS, 2024, 12 : 31538 - 31550
  • [2] EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
    Wei, Jason
    Zou, Kai
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 6382 - 6388
  • [3] Data Augmentation Methods for Enhancing Robustness in Text Classification Tasks
    Tang, Huidong
    Kamei, Sayaka
    Morimoto, Yasuhiko
    ALGORITHMS, 2023, 16 (01)
  • [4] Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks
    Wu, Xing
    Gao, Chaochen
    Lin, Meng
    Zang, Liangjun
    Hu, Songlin
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, 2022, : 871 - 875
  • [5] Iterative Translation-Based Data Augmentation Method for Text Classification Tasks
    Lee, Sangwon
    Liu, Ling
    Choi, Wonik
    IEEE ACCESS, 2021, 9 : 160437 - 160445
  • [6] Word segmentation of handwritten text using supervised classification techniques
    Sun, Yi
    Butler, Timothy S.
    Shafarenko, Alex
    Adams, Rod
    Loomes, Martin
    Davey, Neil
    APPLIED SOFT COMPUTING, 2007, 7 (01) : 71 - 88
  • [7] Data Augmentation with Transformers for Text Classification
    Medardo Tapia-Tellez, Jose
    Jair Escalante, Hugo
    ADVANCES IN COMPUTATIONAL INTELLIGENCE, MICAI 2020, PT II, 2020, 12469 : 247 - 259
  • [8] A Survey on Data Augmentation for Text Classification
    Bayer, Markus
    Kaufhold, Marc-Andre
    Reuter, Christian
    ACM COMPUTING SURVEYS, 2023, 55 (07)
  • [9] Evaluation of data augmentation techniques on subjective tasks
    Gonzalez-Naharro, Luis
    Flores, M. Julia
    Martinez-Gomez, Jesus
    Puerta, Jose M.
    MACHINE VISION AND APPLICATIONS, 2024, 35 (04)
  • [10] EASY DATA AUGMENTATION METHOD FOR CLASSIFICATION TASKS
    Liu Guohang
    Zhang Shibin
    Tang Haozhe
    Yang Lu
    Lu Jiazhong
    Huang Yuanyuan
    2020 17TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING (ICCWAMTIP), 2020, : 166 - 169