Machine learning in automated text categorization

被引:4410
作者
Sebastiani, F [1 ]
机构
[1] CNR, Ist Elaboraz Informaz, I-56124 Pisa, Italy
关键词
algorithms; experimentation; theory; machine learning; text categorization; text classification;
D O I
10.1145/505282.505283
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
引用
收藏
页码:1 / 47
页数:47
相关论文
共 147 条
[1]   Probabilistic learning for selective dissemination of information [J].
Amati, G ;
Crestani, F .
INFORMATION PROCESSING & MANAGEMENT, 1999, 35 (05) :633-654
[2]  
ANDROUTSOPOULOS I, 2000, P 23 ANN INT ACM SIG, P160
[3]  
[Anonymous], P ICML 97
[4]  
[Anonymous], 1997, Proceedings of the fourteenth international conference on machine learning, DOI DOI 10.1016/J.ESWA.2008.05.026
[5]  
[Anonymous], 1994, SIGIR
[6]   AUTOMATED LEARNING OF DECISION RULES FOR TEXT CATEGORIZATION [J].
APTE, C ;
DAMERAU, F ;
WEISS, SM .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1994, 12 (03) :233-251
[7]  
Attardi G., 1998, Journal of Universal Computer Science, V4, P719
[8]  
Baker L. D., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P96, DOI 10.1145/290941.290970
[9]   INFORMATION FILTERING AND INFORMATION-RETRIEVAL - 2 SIDES OF THE SAME COIN [J].
BELKIN, NJ ;
CROFT, WB .
COMMUNICATIONS OF THE ACM, 1992, 35 (12) :29-38
[10]  
Biebricher P., 1988, 11th International Conference on Research and Development in Information Retrieval, P333