Data augmentation using virtual word insertion techniques in text classification tasks

被引:1
|
作者
Long, Zhigao [1 ,2 ]
Li, Hong [1 ]
Shi, Jiawen [1 ]
Ma, Xin [1 ]
机构
[1] Cent South Univ, Sch Comp Sci & Engn, Changsha, Peoples R China
[2] Cent South Univ, Sch Comp Sci & Engn, Changsha 410083, Hunan, Peoples R China
基金
中国国家自然科学基金;
关键词
class deviation factor; data augmentation; text classification; virtual word insertion techniques;
D O I
10.1111/exsy.13519
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Labelling multiple training examples for text classification models is usually time-consuming and complex. Data augmentation can be used to automatically expand the dataset by transforming the original data. However, it may cause semantic changes without modifying the labels, which reduces the effectiveness of the classifiers. In this paper, we propose a data-augmentation method called the virtual word insertion technique, which generates new sentences by randomly inserting virtual words into existing sentences. Two methods are used to achieve virtual word embedding: unweighted average and weighted average. Furthermore, a new concept of weight is proposed: the class deviation factor, which demonstrates the correlation between words and classes. Based on this new concept, different weights are assigned to words of different classes. Experiments are conducted on five different classification tasks. Ablation experiments are also performed to explore the effects of random operation and number of augmented sentences for classification results. The results of these experiments show that our method improves the classification performance and outperforms two other contrasting data-augmentation methods in automatically augmenting the dataset. Compared to raw datasets, the average accuracy improvement of our method is 3.5% for a small-scale dataset and 1% for a large-scale dataset.
引用
收藏
页数:17
相关论文
共 50 条
  • [21] AEDA: An Easier Data Augmentation Technique for Text Classification
    Karimi, Akbar
    Rossi, Leonardo
    Prati, Andrea
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 2748 - 2754
  • [22] Automated Data Augmentation Services Using Text Mining, Data Cleansing and Web Crawling Techniques
    Jacob, Matthias
    Kuscher, Alexander
    Plauth, Max
    Thiele, Christoph
    IEEE CONGRESS ON SERVICES 2008, PT I, PROCEEDINGS, 2008, : 136 - 143
  • [23] Self-Evolution Learning for Mixup: Enhance Data Augmentation on Few-Shot Text Classification Tasks
    Zheng, Haoqi
    Zhong, Qihuang
    Ding, Liang
    Tian, Zhiliang
    Niu, Xin
    Wang, Changjian
    Li, Dongsheng
    Tao, Dacheng
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8964 - 8974
  • [24] Deep text classification of Instagram data using word embeddings and weak supervision
    Hammar, Kim
    Jaradat, Shatha
    Dokoohaki, Nima
    Matskin, Mihhail
    WEB INTELLIGENCE, 2020, 18 (01) : 53 - 67
  • [25] A Review of Techniques to Determine the Optimal Word Score in Text Classification
    Agnihotri, Deepak
    Verma, Kesari
    Tripathi, Priyanka
    Choudhary, Nilam
    AMBIENT COMMUNICATIONS AND COMPUTER SYSTEMS, RACCCS 2017, 2018, 696 : 497 - 507
  • [26] Statistical techniques for text classification based on word recurrence intervals
    Berryman, MJ
    Allison, A
    Abbott, D
    FLUCTUATION AND NOISE LETTERS, 2003, 3 (01): : L1 - L10
  • [27] GDA: Generative Data Augmentation Techniques for Relation Extraction Tasks
    Hu, Xuming
    Liu, Aiwei
    Tan, Zeqi
    Zhang, Xin
    Zhang, Chenwei
    King, Irwin
    Yu, Philip S.
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 10221 - 10234
  • [28] EasyAug: An Automatic Textual Data Augmentation Platform for Classification Tasks
    Qiu, Siyuan
    Xu, Binxia
    Zhang, Jie
    Wang, Yafang
    Shen, Xiaoyu
    de Melo, Gerard
    Long, Chong
    Li, Xiaolong
    WWW'20: COMPANION PROCEEDINGS OF THE WEB CONFERENCE 2020, 2020, : 249 - 252
  • [29] Text Classification Using Ensemble Features Selection and Data Mining Techniques
    Shravankumar, B.
    Ravi, Vadlamani
    SWARM, EVOLUTIONARY, AND MEMETIC COMPUTING, SEMCCO 2014, 2015, 8947 : 176 - 186
  • [30] GDA: Grammar-based Data Augmentation for Text Classification using Slot Information
    Hahn, Joonghyuk
    Cheon, Hyunjoon
    Orwig, Elizabeth
    Kim, Su-Hyeon
    Ko, Sang-Ki
    Han, Yo-Sub
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 7291 - 7306