Data augmentation using virtual word insertion techniques in text classification tasks

被引：1

作者：

Long, Zhigao ^{[1
,2
]}

Li, Hong ^{[1
]}

Shi, Jiawen ^{[1
]}

Ma, Xin ^{[1
]}

机构：

[1] Cent South Univ, Sch Comp Sci & Engn, Changsha, Peoples R China

[2] Cent South Univ, Sch Comp Sci & Engn, Changsha 410083, Hunan, Peoples R China

来源：

EXPERT SYSTEMS | 2024年 / 41卷 / 04期

基金：

中国国家自然科学基金;

关键词：

class deviation factor; data augmentation; text classification; virtual word insertion techniques;

D O I：

10.1111/exsy.13519

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Labelling multiple training examples for text classification models is usually time-consuming and complex. Data augmentation can be used to automatically expand the dataset by transforming the original data. However, it may cause semantic changes without modifying the labels, which reduces the effectiveness of the classifiers. In this paper, we propose a data-augmentation method called the virtual word insertion technique, which generates new sentences by randomly inserting virtual words into existing sentences. Two methods are used to achieve virtual word embedding: unweighted average and weighted average. Furthermore, a new concept of weight is proposed: the class deviation factor, which demonstrates the correlation between words and classes. Based on this new concept, different weights are assigned to words of different classes. Experiments are conducted on five different classification tasks. Ablation experiments are also performed to explore the effects of random operation and number of augmented sentences for classification results. The results of these experiments show that our method improves the classification performance and outperforms two other contrasting data-augmentation methods in automatically augmenting the dataset. Compared to raw datasets, the average accuracy improvement of our method is 3.5% for a small-scale dataset and 1% for a large-scale dataset.

引用

页数：17

共 50 条

[21] AEDA: An Easier Data Augmentation Technique for Text Classification
Karimi, Akbar
Rossi, Leonardo
Prati, Andrea
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 2748 - 2754
[22] Automated Data Augmentation Services Using Text Mining, Data Cleansing and Web Crawling Techniques
Jacob, Matthias
Kuscher, Alexander
Plauth, Max
Thiele, Christoph
IEEE CONGRESS ON SERVICES 2008, PT I, PROCEEDINGS, 2008, : 136 - 143
[23] Self-Evolution Learning for Mixup: Enhance Data Augmentation on Few-Shot Text Classification Tasks
Zheng, Haoqi
Zhong, Qihuang
Ding, Liang
Tian, Zhiliang
Niu, Xin
Wang, Changjian
Li, Dongsheng
Tao, Dacheng
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8964 - 8974
[24] Deep text classification of Instagram data using word embeddings and weak supervision
Hammar, Kim
Jaradat, Shatha
Dokoohaki, Nima
Matskin, Mihhail
WEB INTELLIGENCE, 2020, 18 (01) : 53 - 67
[25] A Review of Techniques to Determine the Optimal Word Score in Text Classification
Agnihotri, Deepak
Verma, Kesari
Tripathi, Priyanka
Choudhary, Nilam
AMBIENT COMMUNICATIONS AND COMPUTER SYSTEMS, RACCCS 2017, 2018, 696 : 497 - 507
[26] Statistical techniques for text classification based on word recurrence intervals
Berryman, MJ
Allison, A
Abbott, D
FLUCTUATION AND NOISE LETTERS, 2003, 3 (01): : L1 - L10
[27] GDA: Generative Data Augmentation Techniques for Relation Extraction Tasks
Hu, Xuming
Liu, Aiwei
Tan, Zeqi
Zhang, Xin
Zhang, Chenwei
King, Irwin
Yu, Philip S.
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 10221 - 10234
[28] EasyAug: An Automatic Textual Data Augmentation Platform for Classification Tasks
Qiu, Siyuan
Xu, Binxia
Zhang, Jie
Wang, Yafang
Shen, Xiaoyu
de Melo, Gerard
Long, Chong
Li, Xiaolong
WWW'20: COMPANION PROCEEDINGS OF THE WEB CONFERENCE 2020, 2020, : 249 - 252
[29] Text Classification Using Ensemble Features Selection and Data Mining Techniques
Shravankumar, B.
Ravi, Vadlamani
SWARM, EVOLUTIONARY, AND MEMETIC COMPUTING, SEMCCO 2014, 2015, 8947 : 176 - 186
[30] GDA: Grammar-based Data Augmentation for Text Classification using Slot Information
Hahn, Joonghyuk
Cheon, Hyunjoon
Orwig, Elizabeth
Kim, Su-Hyeon
Ko, Sang-Ki
Han, Yo-Sub
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 7291 - 7306

← 1 2 3 4 5 →