Short Texts Classification Through Reference Document Expansion

被引:0
|
作者
Yang Zhen [1 ]
Fan Kefeng [2 ]
Lai Yingxu [1 ]
Gao Kaiming [1 ]
Wang Yong [3 ]
机构
[1] Beijing Univ Technol, Coll Comp Sci, Beijing 100124, Peoples R China
[2] China Elect Standardizat Inst, Beijing 100007, Peoples R China
[3] Guilin Univ Elect Technol, CSIP Guangxi Sect, Guilin 541004, Peoples R China
基金
国家高技术研究发展计划(863计划); 北京市自然科学基金;
关键词
Text classification; Short texts; Language model; Document expansion; External reference;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
With the rapid development of information technology, short texts arising from socialized human interaction are gradually predominant in network information streams. Accelerating demands are requiring the industry to provide more effective classification of the brief texts. However, faced with short text documents, each of which contains only a few words, traditional document classification models run into difficulty. Aggressive documents expansion works remarkably well for many cases but suffers from the assumption of independent, identically distributed observations. We formalize a view of classification using Bayesian decision theory, treat each short text as observations from a probabilistic model, called a statistical language model, and encode classification preferences with a loss function defined by the language models and the external reference document. According to Vapnik's methods of Structural risk minimization (SRM), the optimal classification action is the one that minimizes the structural risk, which provides a result that allows one to trade off errors on the training sample against improved generalization performance. We conduct experiments by using several corpora of microblog-like data, and analyze the experimental results. With respect to established baselines, results of these experiments show that applying our proposed document expansion method produces better chance to achieve the improved classification performance.
引用
收藏
页码:315 / 321
页数:7
相关论文
共 50 条
  • [31] Temporal Language Modeling for Short Text Document Classification with Transformers
    Pokrywka, Jakub
    Gralinski, Filip
    PROCEEDINGS OF THE 2022 17TH CONFERENCE ON COMPUTER SCIENCE AND INTELLIGENCE SYSTEMS (FEDCSIS), 2022, : 121 - 128
  • [32] Document representations for classification of short Web-page descriptions
    Radovanovic, Milos
    Ivanovic, Mirjana
    DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2006, 4081 : 544 - 553
  • [33] Frame of reference and navigation through document visualizations: Flying through information space
    Vincow, MA
    Wickens, CD
    PROCEEDINGS OF THE HUMAN FACTORS AND ERGONOMICS SOCIETY 42ND ANNUAL MEETING, VOLS 1 AND 2, 1998, : 511 - 515
  • [34] Document Sensitivity Classification for Data Leakage Prevention with Twitter-based Document Embedding and Query Expansion
    Trieu, Lap Q.
    Trung-Nguyen Tran
    Mai-Khiem Tran
    Minh-Triet Tran
    2017 13TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS), 2017, : 537 - 542
  • [35] DOCUMENT IMAGE AND ZONE CLASSIFICATION THROUGH INCREMENTAL LEARNING
    Bouguelia, Mohamed-Rafik
    Belaid, Yolande
    Belaid, Abdel
    2013 20TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP 2013), 2013, : 4230 - 4234
  • [36] Topic Modeling of Short Texts: A Pseudo-Document View With Word Embedding Enhancement
    Zuo, Yuan
    Li, Congrui
    Lin, Hao
    Wu, Junjie
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (01) : 972 - 985
  • [37] PSLDA: a novel supervised pseudo document-based topic model for short texts
    Mingtao Sun
    Xiaowei Zhao
    Jingjing Lin
    Jian Jing
    Deqing Wang
    Guozhu Jia
    Frontiers of Computer Science, 2022, 16
  • [38] A Pseudo-document-based Topical N-grams model for short texts
    Hao Lin
    Yuan Zuo
    Guannan Liu
    Hong Li
    Junjie Wu
    Zhiang Wu
    World Wide Web, 2020, 23 : 3001 - 3023
  • [39] PSLDA:a novel supervised pseudo document-based topic model for short texts
    Mingtao SUN
    Xiaowei ZHAO
    Jingjing LIN
    Jian JING
    Deqing WANG
    Guozhu JIA
    Frontiers of Computer Science, 2022, 16 (06) : 72 - 81
  • [40] A Pseudo-document-based Topical N-grams model for short texts
    Lin, Hao
    Zuo, Yuan
    Liu, Guannan
    Li, Hong
    Wu, Junjie
    Wu, Zhiang
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2020, 23 (06): : 3001 - 3023