Data augmentation using virtual word insertion techniques in text classification tasks

被引:1
|
作者
Long, Zhigao [1 ,2 ]
Li, Hong [1 ]
Shi, Jiawen [1 ]
Ma, Xin [1 ]
机构
[1] Cent South Univ, Sch Comp Sci & Engn, Changsha, Peoples R China
[2] Cent South Univ, Sch Comp Sci & Engn, Changsha 410083, Hunan, Peoples R China
基金
中国国家自然科学基金;
关键词
class deviation factor; data augmentation; text classification; virtual word insertion techniques;
D O I
10.1111/exsy.13519
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Labelling multiple training examples for text classification models is usually time-consuming and complex. Data augmentation can be used to automatically expand the dataset by transforming the original data. However, it may cause semantic changes without modifying the labels, which reduces the effectiveness of the classifiers. In this paper, we propose a data-augmentation method called the virtual word insertion technique, which generates new sentences by randomly inserting virtual words into existing sentences. Two methods are used to achieve virtual word embedding: unweighted average and weighted average. Furthermore, a new concept of weight is proposed: the class deviation factor, which demonstrates the correlation between words and classes. Based on this new concept, different weights are assigned to words of different classes. Experiments are conducted on five different classification tasks. Ablation experiments are also performed to explore the effects of random operation and number of augmented sentences for classification results. The results of these experiments show that our method improves the classification performance and outperforms two other contrasting data-augmentation methods in automatically augmenting the dataset. Compared to raw datasets, the average accuracy improvement of our method is 3.5% for a small-scale dataset and 1% for a large-scale dataset.
引用
收藏
页数:17
相关论文
共 50 条
  • [41] Using Data Augmentation for Improving Text Summarization
    Constantin, Daniel
    Mihaescu, Marian Cristian
    Heras, Stella
    Jordan, Jaume
    Palanca, Javier
    Julian, Vicente
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2024, PT II, 2025, 15347 : 132 - 144
  • [42] Multiclass Classification for Bangla News Tags with Parallel CNN Using Word Level Data Augmentation
    Amin, Ruhul
    Sworna, Nabila Sabrin
    Hossain, Nahid
    2020 IEEE REGION 10 SYMPOSIUM (TENSYMP) - TECHNOLOGY FOR IMPACTFUL SUSTAINABLE DEVELOPMENT, 2020, : 174 - 177
  • [43] Substituting clinical features using synthetic medical phrases: Medical text data augmentation techniques
    Abdollahi, Mahdi
    Gao, Xiaoying
    Mei, Yi
    Ghosh, Shameek
    Li, Jinyan
    Narag, Michael
    Artificial Intelligence in Medicine, 2021, 120
  • [44] TAWC: Text Augmentation with Word Contributions for Imbalance Aspect-Based Sentiment Classification
    Santoso, Noviyanti
    Mendonca, Israel
    Aritsugi, Masayoshi
    APPLIED SCIENCES-BASEL, 2024, 14 (19):
  • [45] Substituting clinical features using synthetic medical phrases: Medical text data augmentation techniques
    Abdollahi, Mahdi
    Gao, Xiaoying
    Mei, Yi
    Ghosh, Shameek
    Li, Jinyan
    Narag, Michael
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2021, 120
  • [46] An analysis of hierarchical text classification using word embeddings
    Stein, Roger Alan
    Jaques, Patricia A.
    Valiati, Joao Francisco
    INFORMATION SCIENCES, 2019, 471 : 216 - 232
  • [47] Text classification using multi-word features
    Zhang, Wen
    Yoshida, Taketoshi
    Tang, Xijin
    2007 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS, VOLS 1-8, 2007, : 3740 - +
  • [48] Using WordNet to disambiguate word senses for text classification
    Liu, Ying
    Scheuermann, Peter
    Li, Xingsen
    Zhu, Xingquan
    COMPUTATIONAL SCIENCE - ICCS 2007, PT 3, PROCEEDINGS, 2007, 4489 : 781 - +
  • [49] Classification of Ovarian Cyst Using Regularized Convolution Neural Network with Data Augmentation Techniques
    Priya, N.
    Jeevitha, S.
    PROCEEDINGS OF SECOND INTERNATIONAL CONFERENCE ON SUSTAINABLE EXPERT SYSTEMS (ICSES 2021), 2022, 351 : 199 - 209
  • [50] Adversarial Word Dilution as Text Data Augmentation in Low-Resource Regime
    Chen, Junfan
    Zhang, Richong
    Luo, Zheyan
    Hu, Chunming
    Mao, Yongyi
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 12626 - 12634