On a New Model for Automatic Text Categorization Based on Vector Space Model

被引:0
|
作者
Suzuki, Makoto [1 ]
Yamagishi, Naohide [1 ]
Ishidat, Takashi [2 ]
Gotot, Masayuki [2 ]
Hirasawa, Shigeichi [3 ]
机构
[1] Shonan Inst Technol, Fac Informat Sci, 1-1-25 Tsujido Nishikaigan, Kanagawa 2518511, Japan
[2] Waseda Univ, Shinjuku Ku, Tokyo 169, Japan
[3] Cyber Univ, Shinjuku Ku, Tokyo 162, Japan
关键词
text mining; classification; N-gram; newspaper;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In our previous paper, we proposed a new classification technique called the Frequency Ratio Accumulation Method (FRAM). This is a simple technique that adds up the ratios of term frequencies among categories, and it is able to use index terms without limit. Then, we adopted the Character N-gram to form index terms, thereby improving FRAM. However, FRAM did not have a satisfactory mathematical basis. Therefore, we present here a new mathematical model based on a "Vector Space Model" and consider its implications. The proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from the English Reuters-21578 data set, a Japanese CD-Mainichi 2002 data set using the proposed method. The Reuters-2I578 data set is a benchmark data set for automatic text categorization. It is shown that FRAM has good classification accuracy. Specifically, the micro-averaged F-measure of the proposed method is 92.2% for English. The proposed method can perform classification utilizing a single program and it is language-independent.
引用
收藏
页码:3152 / 3159
页数:8
相关论文
共 50 条
  • [41] Plagiarism Detection on Electronic Text based Assignments using Vector Space Model
    Jiffriya, M. A. C.
    Jahan, M. A. C. Akmal
    Ragel, Roshan G.
    2014 7TH INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION FOR SUSTAINABILITY (ICIAFS), 2014,
  • [42] String Vector based KNN for Text Categorization
    Jo, Taeho
    2017 19TH INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATIONS TECHNOLOGY (ICACT) - OPENING NEW ERA OF SMART SOCIETY, 2017, : 458 - 463
  • [43] Automatic text categorization based on angle distribution
    Liu, T
    Guo, J
    Proceedings of 2005 International Conference on Machine Learning and Cybernetics, Vols 1-9, 2005, : 3797 - 3801
  • [44] New Model of Feature Selection based Chaotic Firefly Algorithm for Arabic Text Categorization
    Hadni, Meryeme
    Hjiaj, Hassane
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2023, 20 (3A) : 461 - 468
  • [45] Hierarchical text categorization model based on Fuzzy Cognitive Maps
    Zhang, Weijuan
    Li, Zhixian
    Zhang, Guiyun
    2008 PROCEEDINGS OF INFORMATION TECHNOLOGY AND ENVIRONMENTAL SYSTEM SCIENCES: ITESS 2008, VOL 2, 2008, : 611 - 614
  • [46] Smoothing LDA model for text categorization
    Li, Wenbo
    Sun, Le
    Feng, Yuanyong
    Zhang, Dakun
    INFORMATION RETRIEVAL TECHNOLOGY, 2008, 4993 : 83 - +
  • [47] An Adaptive Markov Model for Text Categorization
    Li, Jin
    Yue, Kun
    Liu, Weiyi
    2008 3RD INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEM AND KNOWLEDGE ENGINEERING, VOLS 1 AND 2, 2008, : 802 - +
  • [48] Weighted kernel model for text categorization
    Faculty of Information Technology, University of Technology, Sydney, PO Box 123, Broadway NSW 2007, Australia
    Conf. Res. Pract. Inf. Technol. Ser., 2006, (111-114):
  • [49] Automatic Question Categorization: a New Approach for Text Elaboration
    Amancio, Marcelo Adriano
    Duran, Magali Sanches
    Aluisio, Sandra Maria
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2011, (46): : 43 - 50
  • [50] Text summarization using topic-based vector space model and semantic measure
    Belwal, Ramesh Chandra
    Rai, Sawan
    Gupta, Atul
    INFORMATION PROCESSING & MANAGEMENT, 2021, 58 (03)