Sampling and feature selection in a genetic algorithm for document clustering

被引:0
|
作者
Casillas, A [1 ]
de Lena, MTG
Martínez, R
机构
[1] Univ Basque Country, Dpt Elect & Elect, E-48080 Bilbao, Spain
[2] Univ Rey Juan Carlos, Dpt Informat Estadist & Telemat, Madrid, Spain
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we describe a Genetic Algorithm for document clustering that includes a sampling technique to reduce computation time. This algorithm calculates an approximation of the optimum k value, and solves the best grouping of the documents into these k clusters. We evaluate this algorithm with sets of documents that are the output of a query in a search engine. Two types of experiment are carried out to determine: (1) how the genetic algorithm works with a sample of documents, (2) which document features lead to the best clustering according to an external evaluation. On the one hand, our CA with sampling performs the clustering in a time that makes interaction with a search engine viable. On the other hand, our CA approach with the representation of the documents by means of entities leads to better results than representation by lemmas only.
引用
收藏
页码:601 / 612
页数:12
相关论文
共 50 条
  • [1] A Clustering Based Genetic Algorithm for Feature Selection
    Rostami, Mehrdad
    Moradi, Parham
    2014 6TH CONFERENCE ON INFORMATION AND KNOWLEDGE TECHNOLOGY (IKT), 2014, : 112 - 116
  • [2] Feature selection and document clustering
    Dhillon, I
    Kogan, J
    Nicholas, C
    SURVEY OF TEXT MINING: CLUSTERING, CLASSIFICATION, AND RETRIEVAL, 2004, : 73 - 100
  • [3] A feature selection Bayesian approach for a clustering genetic algorithm
    Hruschka, ER
    Hruschka, ER
    Ebecken, NFF
    DATA MINING IV, 2004, 7 : 181 - 192
  • [4] Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification
    Endalie, Demeke
    Haile, Getamesay
    Abebe, Wondmagegn Taye
    PEERJ COMPUTER SCIENCE, 2022, 8
  • [5] A Feature Selection for Korean Web Document Clustering
    Park, Heum
    Kim, Young-Gi
    Kwon, Hyuk-Chul
    IECON 2004: 30TH ANNUAL CONFERENCE OF IEEE INDUSTRIAL ELECTRONICS SOCIETY, VOL 3, 2004, : 2650 - 2654
  • [6] LDA Based Feature Selection for Document Clustering
    Kumar, B. Shravan
    Ravi, Vadlamani
    COMPUTE'17: PROCEEDINGS OF THE 10TH ANNUAL ACM INDIA COMPUTE CONFERENCE, 2017, : 125 - 130
  • [7] Application of Genetic Algorithm in Document Clustering
    Wei Jian-Xiang
    Liu Huai
    Sun Yue-hong
    Su Xin-Ning
    2009 INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND COMPUTER SCIENCE, VOL 1, PROCEEDINGS, 2009, : 145 - +
  • [8] A feature selection algorithm for document clustering based on word co-occurence frequency
    Liu, YC
    Wang, XL
    Liu, BQ
    PROCEEDINGS OF THE 2004 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2004, : 2963 - 2968
  • [9] Empirical Study on Unsupervised Feature Selection for Document Clustering
    Mackute-Varoneckiene, Ausra
    Krilavicius, Tomas
    HUMAN LANGUAGE TECHNOLOGIES - THE BALTIC PERSPECTIVE, BALTIC HLT 2014, 2014, 268 : 107 - +
  • [10] Unsupervised Feature Selection Technique Based on Genetic Algorithm for Improving the Text Clustering
    Abualigah, Laith Mohammad
    Khader, Ahamad Tajudin
    Al-Betar, Mohammed Azmi
    2016 7TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY (CSIT), 2016,