A novel robust kernel for classifying high-dimensional data using Support Vector Machines

被引：41

作者：

Hussain, Syed Fawad ^{[1
]}

机构：

[1] Ghulam Ishaq Khan Inst Engn Sci & Technol, Machine Learning & Data Sci MDS Lab, Fac Comp Sci & Engn, Topi, Pakistan

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2019年 / 131卷

关键词：

Semantic kernels; Support Vector Machines; Co-clustering; Label noise; TEXT CLASSIFICATION; CLASSIFIERS; ALGORITHM;

D O I：

10.1016/j.eswa.2019.04.037

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper presents a new semantic kernel for classification of high-dimensional data in the framework of Support Vector Machines (SVM). SVMs have gained widespread application due to their relatively higher accuracy. The efficacy of SVMs, however, depends upon the separation of the data itself as well as the kernel function. Text data, for instance, is difficult to classify due to synonymy and polysemy in its contents, having multi-topical instances that can result in mislabeling, and being highly sparse in the bag-of-words representation. While the soft margin parameter and kernel tricks are used in SVM to deal with outliers and non-linearly separable data, using data statistics and correlation has not been fully explored in the literature. This paper explore the use co-similarity (i.e., soft co-clustering) to find latent relationships between documents motivated by the success of co-clustering and subspace clustering methods. It has been shown that the use of weighted higher-order paths between instances in the data can be a good measure of similarity values which can then be used for both classification and to correct mislabeled (or outlier) data in the training set. The proposed kernel is generic in nature and suitable for sparse, dyadic data where direct co-occurrences are not necessary common as in the case of textual data, link-analysis in social media networks, co-authorship, etc. It also studies the impact of noise in the training data and provides a technique to re-label such instances. It is also observed that re-labelling of selected training data reduces the adverse effect of outliers or label noise and can greatly improve the classification of the test data. To the best of our knowledge, we are the first to introduce a supervised co-similarity based kernel function and also provide mathematical formulation to show that it is a valid Mercer's kernel. Our experiments show that the proposed framework outperforms current and state-of-the-art methods in terms of classification accuracy and is more resilient to label noise. (C) 2019 Elsevier Ltd. All rights reserved.

引用

页码：116 / 131

页数：16

共 50 条

[41] Kernel design for RNA classification using Support Vector Machines
Wang, Jason T. L.
Wu, Xiaoming
INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 2006, 1 (01) : 57 - 76
[42] Classification using intersection kernel support vector machines is efficient
Maji, Subhransu
Berg, Alexander C.
Malik, Jitendra
2008 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-12, 2008, : 2245 - +
[43] Using robust dispersion estimation in support vector machines
Vretos, N.
Tefas, A.
Pitas, I.
PATTERN RECOGNITION, 2013, 46 (12) : 3441 - 3451
[44] Robust classification and regression using support vector machines
Trafalis, Theodore B.
Gilbert, Robin C.
EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2006, 173 (03) : 893 - 909
[45] Classifying Political Tweets Using Naive Bayes and Support Vector Machines
Al Hamoud, Ahmed
Alwehaibi, Ali
Roy, Kaushik
Bikdash, Marwan
RECENT TRENDS AND FUTURE TECHNOLOGY IN APPLIED INTELLIGENCE, IEA/AIE 2018, 2018, 10868 : 736 - 744
[46] Classifying seismograms using the FastMap algorithm and support-vector machines
Malcolm C. A. White
Kushal Sharma
Ang Li
T. K. Satish Kumar
Nori Nakata
Communications Engineering, 2 (1):
[47] Testing the Mean Vector for High-Dimensional Data
Shi, Gongming
Lin, Nan
Zhang, Baoxue
COMMUNICATIONS IN MATHEMATICS AND STATISTICS, 2024,
[48] Classification of Imbalanced Data by Oversampling in Kernel Space of Support Vector Machines
Mathew, Josey
Pang, Chee Khiang
Luo, Ming
Leong, Weng Hoe
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2018, 29 (09) : 4065 - 4076
[49] Sparse kernel methods for high-dimensional survival data
Evers, Ludger
Messow, Claudia-Martina
BIOINFORMATICS, 2008, 24 (14) : 1632 - 1638
[50] On new robust tests for the multivariate normal mean vector with high-dimensional data and applications
de Paula Alves, Henrique Toss
Ferreira, Daniel Furtado
CHILEAN JOURNAL OF STATISTICS, 2020, 11 (02): : 117 - 136

← 1 2 3 4 5 →