TCS: A Teacher-Curriculum-Student Learning Framework for Cross-Lingual Text Labeling

被引:0
|
作者
Pu T. [1 ]
Huang S.-J. [1 ,2 ]
Zhang Y.-M. [3 ]
Zhou X.-S. [3 ]
Tu Y.-F. [3 ]
Dai X.-Y. [1 ]
Chen J.-J. [1 ]
机构
[1] National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing
[2] Peng Cheng Laboratory, Shenzhen
[3] ZTE Corporation, Nanjing
来源
基金
中国国家自然科学基金;
关键词
Cross-lingual transfer; Curriculum learning; Named entity recognition; Self-training; Text classification;
D O I
10.11897/SP.J.1016.2022.01983
中图分类号
学科分类号
摘要
In recent years, deep learning models have greatly promoted the development of English and Chinese natural language processing. However, most other languages in the world are unable to perform effective text processing and analysis because of the difficulty in obtaining labeled data. Cross-lingual transfer is the main way to solve this problem by using labeled samples in the source language to make the model learn the corresponding tasks in the target language, so it has been widely concerned. Recently, some works based on self-training achieve the best results in the cross-lingual text labeling tasks by using both labeled samples in the source language and unlabeled samples in the target language to fine-tune the multilingual BERT. However, self-training suffers from the problem of inaccurate supervision, that is, the teacher model's inaccurate predictions on the target unlabeled samples (i.e. inaccurate samples) may mislead the subsequent student model. And in cross-lingual transfer scenarios, the natural distribution gap between the source labeled samples and the target unlabeled samples will make this problem even worse. In order to further improve the results of cross-lingual text labeling, in this paper, we utilize three techniques to address the inaccurate supervision problem in the self-training, and propose a learning framework called Teacher-Curriculum-Student (TCS). Firstly, we employ a soft-target training technique to reduce the impact of inaccurate samples at the level of loss function. Secondly, we employ a progressive sample selection technique to construct iterative training datasets containing more accurate samples. Finally, in order to handle the inaccurate samples in iterative training datasets, we propose a from-confident-to-suspicious curriculum learning technique: according to prediction confidences of the teacher model, a training dataset is organized into learning courses arranged from confident to suspicious, so as to enhance the role of accurate samples and reduce the role of inaccurate samples in the student model's training process. Experiments on the benchmark datasets of cross-lingual text classification and cross-lingual named entity recognition show that the average results obtained by TCS are improved by 2.51% and 3.25% respectively on the basis of self-training thanks to the three techniques TCS used, and are 1.51% and 4.45% higher than existing state-of-the-art results respectively. In addition, ablation experiments show that: all three techniques used in TCS can effectively improve the performance of the final model and the curriculum learning technique contributes to the largest increase; the from-confident-to-suspicious curriculum order is the key to the effectiveness of the curriculum learning technique in the self-training scenario. More interestingly, further analysis shows that in the whole iterative training process, the effect of TCS is always better than that of self-training, the sample selection technique plays a role in the initial iteration, and the effect of the curriculum learning technique is mainly reflected in the middle and later iteration. For reproducibility, we release the code and experimental configurations at https://github.com/ericput/TCS. © 2022, Science Press. All right reserved.
引用
收藏
页码:1983 / 1996
页数:13
相关论文
共 39 条
  • [1] Wu S, Dredze M., Beto, Bentz, Becas: The surprising cross-lingual effectiveness of BERT, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 833-844, (2019)
  • [2] Pires T, Schlinger E, Garrette D., How multilingual is multilingual BERT?, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4996-5001, (2019)
  • [3] Devlin J, Chang M W, Lee K, Et al., BERT: Pre-training of deep bidirectional transformers for language understanding, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1(Long and Short Papers), pp. 4171-4186, (2019)
  • [4] Keung P, Bhardwaj V, Et al., Adversarial learning with contextual embeddings for zero-resource cross-lingual classifi-cation and NER, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 1355-1360, (2019)
  • [5] Zhang D, Nallapati R, Zhu H, Et al., Unsupervised domain adaptation for cross-lingual text labeling, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 3527-3536, (2020)
  • [6] Dong X L, de Melo G., A robust self-learning framework for cross-lingual text classification, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 6307-6311, (2019)
  • [7] Wu Q, Lin Z, Karlsson B, Et al., Single-/multi-source cross-lingual NER via teacher-student learning on unlabeled data in target language, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6505-6514, (2020)
  • [8] Zhu X., Semi-supervised learning literature survey, World, 10, (2005)
  • [9] He J, Gu J, Shen J, Et al., Revisiting self-training for neural sequence generation, Proceedings of the International Conference on Learning Representations, (2020)
  • [10] Schwenk H, Li X., A corpus for multilingual document classification in eight languages, Proceedings of the Eleventh International Conference on Language Resources and Evaluation(LREC 2018), (2018)