Effectiveness of document representation for classification

被引:0
|
作者
Chen, DY [1 ]
Li, X [1 ]
Dong, ZY [1 ]
Chen, X [1 ]
机构
[1] Univ Queensland, Sch Informat Technol & Elect Engn, Brisbane, Qld 4072, Australia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Conventionally, document classification researches focus on improving the learning capabilities of classifiers. Nevertheless, according to our observation, the effectiveness of classification is limited by the suitability of document representation. Intuitively, the more features that are used in representation, the more comprehensive that documents are represented. However, if a representation contains too many irrelevant features, the classifier would suffer from not only the curse of high dimensionality, but also overfitting. To address this problem of suitableness of document representations, we present a classifier-independent approach to measure the effectiveness of document representations. Our approach utilises a labelled document corpus to estimate the distribution of documents in the feature space. By looking through documents in this way, we can clearly identify the contributions made by different features toward the document classification. Some experiments have been performed to show how the effectiveness is evaluated. Our approach can be used as a tool to assist feature selection, dimensionality reduction and document classification.
引用
收藏
页码:368 / 377
页数:10
相关论文
共 50 条
  • [21] Automated text classification for fast feedback - Investigating the effects of document representation
    Menon, R
    Tong, LH
    Sathiyakeerthi, S
    Brombacher, A
    KNOWLEDGE-BASED INTELLIGNET INFORMATION AND ENGINEERING SYSTEMS, PT 2, PROCEEDINGS, 2003, 2774 : 1008 - 1014
  • [22] Wikipedia-Based Hybrid Document Representation for Textual News Classification
    Mourino Garcia, Marcos Antonio
    Perez Rodriguez, Roberto
    Anido Rifon, Luis
    Vilares Ferro, Manuel
    2016 3RD INTERNATIONAL CONFERENCE ON SOFT COMPUTING & MACHINE INTELLIGENCE (ISCMI 2016), 2016, : 148 - 153
  • [23] Short Text Classification using Wikipedia Concept based Document Representation
    Wang, Xiang
    Chen, Ruhua
    Jia, Yan
    Zhou, Bin
    2013 INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND APPLICATIONS (ITA), 2013, : 471 - 474
  • [24] Wikipedia-based hybrid document representation for textual news classification
    Antonio Mourino-Garcia, Marcos
    Perez-Rodriguez, Roberto
    Anido-Rifon, Luis
    Vilares-Ferro, Manuel
    SOFT COMPUTING, 2018, 22 (18) : 6047 - 6065
  • [25] Wikipedia-based hybrid document representation for textual news classification
    Marcos Antonio Mouriño-García
    Roberto Pérez-Rodríguez
    Luis Anido-Rifón
    Manuel Vilares-Ferro
    Soft Computing, 2018, 22 : 6047 - 6065
  • [26] Cross-Lingual Sentiment Classification with Bilingual Document Representation Learning
    Zhou, Xinjie
    Wan, Xianjun
    Xiao, Jianguo
    PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 1403 - 1412
  • [27] A New One-class Classification Method Based on Symbolic Representation: Application to Document Classification
    Alaei, Fahimeh
    Girard, Nathalie
    Barrat, Sabine
    Ramel, Jean-Yves
    2014 11TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS 2014), 2014, : 272 - 276
  • [28] Effectiveness of Representation and Length Variation of Shortest Paths in Graph Classification
    Salim, Asif
    Shiju, S. S.
    Sumitra, S.
    PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PREMI 2017, 2017, 10597 : 509 - 516
  • [29] Document representation and classification with Twitter-based document embedding, adversarial domain-adaptation, and query expansion
    Tran, Minh-Triet
    Trieu, Lap Q.
    Tran, Huy Q.
    JOURNAL OF HEURISTICS, 2022, 28 (02) : 211 - 233
  • [30] Quantum probability-inspired graph neural network for document representation and classification
    Yan, Peng
    Li, Linjing
    Jin, Miaotianzi
    Zeng, Daniel
    NEUROCOMPUTING, 2021, 445 : 276 - 286