Data augmentation using virtual word insertion techniques in text classification tasks

被引:1
|
作者
Long, Zhigao [1 ,2 ]
Li, Hong [1 ]
Shi, Jiawen [1 ]
Ma, Xin [1 ]
机构
[1] Cent South Univ, Sch Comp Sci & Engn, Changsha, Peoples R China
[2] Cent South Univ, Sch Comp Sci & Engn, Changsha 410083, Hunan, Peoples R China
基金
中国国家自然科学基金;
关键词
class deviation factor; data augmentation; text classification; virtual word insertion techniques;
D O I
10.1111/exsy.13519
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Labelling multiple training examples for text classification models is usually time-consuming and complex. Data augmentation can be used to automatically expand the dataset by transforming the original data. However, it may cause semantic changes without modifying the labels, which reduces the effectiveness of the classifiers. In this paper, we propose a data-augmentation method called the virtual word insertion technique, which generates new sentences by randomly inserting virtual words into existing sentences. Two methods are used to achieve virtual word embedding: unweighted average and weighted average. Furthermore, a new concept of weight is proposed: the class deviation factor, which demonstrates the correlation between words and classes. Based on this new concept, different weights are assigned to words of different classes. Experiments are conducted on five different classification tasks. Ablation experiments are also performed to explore the effects of random operation and number of augmented sentences for classification results. The results of these experiments show that our method improves the classification performance and outperforms two other contrasting data-augmentation methods in automatically augmenting the dataset. Compared to raw datasets, the average accuracy improvement of our method is 3.5% for a small-scale dataset and 1% for a large-scale dataset.
引用
收藏
页数:17
相关论文
共 50 条
  • [31] ALP: Data Augmentation Using Lexicalized PCFGs for Few-Shot Text Classification
    Kim, Hazel H.
    Woo, Daecheol
    Oh, Seong Joon
    Cha, Jeong-Won
    Han, Yo-Sub
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 10894 - 10902
  • [32] Exploring Word-gesture Text Entry Techniques in Virtual Reality
    Chen, Sibo
    Wang, Junce
    Guerra, Santiago
    Mittal, Neha
    Prakkamakul, Soravis
    CHI EA '19 EXTENDED ABSTRACTS: EXTENDED ABSTRACTS OF THE 2019 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, 2019,
  • [33] Classification of application reviews into software maintenance tasks using data mining techniques
    Al-Hawari, Assem
    Najadat, Hassan
    Shatnawi, Raed
    SOFTWARE QUALITY JOURNAL, 2021, 29 (03) : 667 - 703
  • [34] Classification of application reviews into software maintenance tasks using data mining techniques
    Assem Al-Hawari
    Hassan Najadat
    Raed Shatnawi
    Software Quality Journal, 2021, 29 : 667 - 703
  • [35] LiDA: Language-Independent Data Augmentation for Text Classification
    Sujana, Yudianto
    Kao, Hung-Yu
    IEEE ACCESS, 2023, 11 : 10894 - 10901
  • [36] Hybrid Model of Data Augmentation Methods for Text Classification Task
    Feng, Jia Hui
    Mohaghegh, Mahsa
    PROCEEDINGS OF THE 13TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT (KMIS), VOL 3, 2021, : 194 - 197
  • [37] TextANN: An Improved Text Classification Model Based on Data Augmentation
    Li, Hong
    Yang, Xiaosheng
    Yang, Guoqing
    Ouyang, Xiaogang
    Chen, Yu
    Wang, Xueqing
    2018 INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, BIG DATA AND BLOCKCHAIN (ICCBB 2018), 2018, : 160 - 163
  • [38] Data augmentation and adversary attack on limit resources text classification
    Sánchez-Vega F.
    López-Monroy A.P.
    Balderas-Paredes A.
    Pellegrin L.
    Rosales-Pérez A.
    Multimedia Tools and Applications, 2025, 84 (3) : 1317 - 1344
  • [39] PDA: Data Augmentation with Preposition Words on Chinese text classification
    Yang, Leixin
    Xiong, Haoyu
    Xiang, Yu
    2024 2ND ASIA CONFERENCE ON COMPUTER VISION, IMAGE PROCESSING AND PATTERN RECOGNITION, CVIPPR 2024, 2024,
  • [40] A Submodular Optimization Framework for Imbalanced Text Classification With Data Augmentation
    Alemayehu, Eyor
    Fang, Yi
    IEEE ACCESS, 2023, 11 : 41680 - 41696