Semi-supervised patent text classification method based on improved Tri-training algorithm

被引:3
|
作者
Hu Y.-Q. [1 ]
Qiu Q.-Y. [1 ]
Yu X. [1 ]
Wu J.-W. [1 ]
机构
[1] College of Mechanical Engineering, Zhejiang University, Hangzhou
关键词
Feature selection; Information gain; Patent text classification; Semi-supervised; Tri-training algorithm;
D O I
10.3785/j.issn.1008-973X.2020.02.014
中图分类号
学科分类号
摘要
An improved information gain (IG) algorithm was proposed, in order to solve the problem that the IG algorithm can only be used to investigate the contribution of features to the whole system, but not for a single category. The weight coefficient is introduced to adjust the information gain values of features important for classification, so the inhomogeneity of distribution of a word among categories can be better considered. A semi-supervised classification method based on the improved Tri-training algorithm was proposed, aiming at the bottleneck problem of training set labeling in traditional patent automatic classification. The prediction probability thresholds of the same unlabeled sample's category of three classifiers are dynamically changed by tracking the distribution of sample categories of training sets after each iteration. As a result, the influence of noise data is reduced and the full advantage of the unmarked training samples is achieved. Results indicate that the proposed classification method has positive automatic classification effect in the case of fewer labeled training samples, and the generalization ability of the classifier can be improved through appropriately increasing unlabeled sample data. © 2020, Zhejiang University Press. All right reserved.
引用
收藏
页码:331 / 339
页数:8
相关论文
共 28 条
  • [1] Takeru M., Shin-Ichi M., Shin I., Et al., Virtual adversarial training: a regularization method for supervisedand semi-supervised learning, IEEE Transactions on Pattern Analysis and Machine Intelligence, (2018)
  • [2] Yang H.F., Lin K., Chen C.S., Supervised learning of semantics-preserving hash via deep convolutional neural networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 2, pp. 437-451, (2018)
  • [3] Zhou Z.-H., Disagreement-based semi-supervised learning, Actaautomatica Sinica, 39, 11, pp. 1871-1878, (2013)
  • [4] Chapelle O., Scholkopfb, Zien A., Semi-supervised learning, IEEE Transactions on Neural Networks, 20, 3, (2009)
  • [5] Turian J., Ratinov L., Bengio Y., Word representations: a simple and general method for semi-supervised learning, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 384-394, (2010)
  • [6] Kipf T.N., Welling M., Semi-supervised classification with graph convolutional networks, ICLR 2017, pp. 1-14, (2017)
  • [7] Dai A.M., Le Q.V., Semi-supervised sequencelearning, Neural Information Processing Systems, pp. 1-9, (2015)
  • [8] Shahshahani B.M., Landgrebe D.A., The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon, IEEE Transactions on Geoscience and Remote Sensing, 32, 5, pp. 1087-1095, (1994)
  • [9] Miller D., Uyar H., A mixture of experts classifier with learning based on both labeled and unlabeled data, Advances in Neural Information Processing Systems 9, pp. 571-577, (1997)
  • [10] Nigam K., Mccallum A.K., Thrun S., Text classification from labeled and unlabeled documents using EM, Machine Learning, 39, 2-3, pp. 103-134, (2000)