Effectiveness of document representation for classification

被引:0
|
作者
Chen, DY [1 ]
Li, X [1 ]
Dong, ZY [1 ]
Chen, X [1 ]
机构
[1] Univ Queensland, Sch Informat Technol & Elect Engn, Brisbane, Qld 4072, Australia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Conventionally, document classification researches focus on improving the learning capabilities of classifiers. Nevertheless, according to our observation, the effectiveness of classification is limited by the suitability of document representation. Intuitively, the more features that are used in representation, the more comprehensive that documents are represented. However, if a representation contains too many irrelevant features, the classifier would suffer from not only the curse of high dimensionality, but also overfitting. To address this problem of suitableness of document representations, we present a classifier-independent approach to measure the effectiveness of document representations. Our approach utilises a labelled document corpus to estimate the distribution of documents in the feature space. By looking through documents in this way, we can clearly identify the contributions made by different features toward the document classification. Some experiments have been performed to show how the effectiveness is evaluated. Our approach can be used as a tool to assist feature selection, dimensionality reduction and document classification.
引用
收藏
页码:368 / 377
页数:10
相关论文
共 50 条
  • [31] Document representation and classification with Twitter-based document embedding, adversarial domain-adaptation, and query expansion
    Minh-Triet Tran
    Lap Q. Trieu
    Huy Q. Tran
    Journal of Heuristics, 2022, 28 : 211 - 233
  • [32] Document representation based on probabilistic word clustering in customer-voice classification
    Younghoon Lee
    Seokmin Song
    Sungzoon Cho
    Jinhae Choi
    Pattern Analysis and Applications, 2019, 22 : 221 - 232
  • [33] Laser print document identification based on Gabor feature and sparse representation classification
    Fang T.
    Chen Q.
    Yan Y.
    Zhou Q.
    Wuhan Daxue Xuebao (Xinxi Kexue Ban)/Geomatics and Information Science of Wuhan University, 2016, 41 (11): : 1550 - 1555
  • [34] A Hierarchical Neural-Network-Based Document Representation Approach for Text Classification
    Zheng, Jianming
    Guo, Yupu
    Feng, Chong
    Chen, Honghui
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2018, 2018
  • [35] Label-Specific Document Representation for Multi-Label Text Classification
    Xiao, Lin
    Huang, Xin
    Chen, Boli
    Jing, Liping
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 466 - 475
  • [36] Document representation based on probabilistic word clustering in customer-voice classification
    Lee, Younghoon
    Song, Seokmin
    Cho, Sungzoon
    Choi, Jinhae
    PATTERN ANALYSIS AND APPLICATIONS, 2019, 22 (01) : 221 - 232
  • [37] Self-Interaction Attention Mechanism Based Text Representation for Document Classification
    Zheng, Jianming
    Cai, Fei
    Shao, Taihua
    Chen, Honghui
    APPLIED SCIENCES-BASEL, 2018, 8 (04):
  • [38] DOCUMENT DESCRIPTION AND REPRESENTATION
    RICHMOND, PA
    ANNUAL REVIEW OF INFORMATION SCIENCE AND TECHNOLOGY, 1972, 7 : 73 - 102
  • [39] DOCUMENT DESCRIPTION AND REPRESENTATION
    VICKERY, BC
    ANNUAL REVIEW OF INFORMATION SCIENCE AND TECHNOLOGY, 1971, 6 : 113 - 140
  • [40] DOCUMENT DESCRIPTION AND REPRESENTATION
    HARRIS, JL
    ANNUAL REVIEW OF INFORMATION SCIENCE AND TECHNOLOGY, 1974, 9 : 80 - 117