Automatic Chinese Text Classification Using Character-based and Word-based Approach

被引:4
|
作者
Luo, Xi [1 ]
Ohyama, Wataru [1 ]
Wakabayashi, Tetsushi [1 ]
Kimura, Fumitaka [1 ]
机构
[1] Mie Univ, Grad Sch Engn, Tsu, Mie 514, Japan
来源
2013 12TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR) | 2013年
关键词
Chinese Text Classification/Categorization; N-gram; Feature Transformation; Dimension Reduction; Principal Component Analysis; Support Vector Machine;
D O I
10.1109/ICDAR.2013.73
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we study on Chinese text classification using character-based approach (N-gram) and word-based approach and propose the use of uni-gram, bi-gram and word features of length greater than or equal to three. A weight coefficient which can be used to give higher weights to word features is also introduced. We further investigate a serial approach based on feature transformation and dimension reduction techniques to improve the performance. Experimental results show that our proposed approach is efficient and effective for improving the performance of Chinese text classification.
引用
收藏
页码:329 / 333
页数:5
相关论文
共 50 条
  • [1] Text Classification Using Word-Based PPM Models
    Bobicev, Victoria
    COMPUTER SCIENCE JOURNAL OF MOLDOVA, 2006, 14 (02) : 183 - 201
  • [2] Chinese new word finding using character-based parsing model
    Meng, Y
    Yu, H
    Nishino, F
    NATURAL LANGUAGE PROCESSING - IJCNLP 2004, 2005, 3248 : 733 - 742
  • [3] WORD-BASED TEXT COMPRESSION
    MOFFAT, A
    SOFTWARE-PRACTICE & EXPERIENCE, 1989, 19 (02): : 185 - 198
  • [4] Word-based text compression
    Moffat, Alistair
    Software - Practice and Experience, 1989, 19 (02) : 185 - 198
  • [5] A unified character-based tagging framework for chinese word segmentation
    Zhao H.
    Huang C.-N.
    Li Mu.
    Lu B.L.
    ACM Transactions on Asian Language Information Processing, 2010, 9 (02):
  • [6] Image and Text Fusion for Character-based Breast Cancer Classification
    Qiao, Pan
    Jin, Yanhong
    Chen, Dehua
    Zhang, YuanYuan
    IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 298 - 305
  • [7] Japanese text compression using word-based coding
    Morihara, T
    Satoh, N
    Yahagi, H
    Yoshida, S
    DCC '98 - DATA COMPRESSION CONFERENCE, 1998, : 564 - 564
  • [8] A Deep Convolutional Neural Model for Character-Based Chinese Word Segmentation
    Xie, Zhipeng
    Hu, Junfeng
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2017, 2018, 10619 : 380 - 392
  • [9] Incorporating Word Attention into Character-Based Word Segmentation
    Higashiyama, Shohei
    Utiyama, Masao
    Sumita, Eiichiro
    Ideuchi, Masao
    Oida, Yoshiaki
    Sakamoto, Yohei
    Okada, Isaac
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 2699 - 2709
  • [10] Application of a word-based text compression method to Japanese and Chinese texts
    Yoshida, S
    Morihara, T
    Yahagi, H
    Satoh, N
    DCC '99 - DATA COMPRESSION CONFERENCE, PROCEEDINGS, 1999, : 561 - 561