Variance-based features for keyword extraction in Persian and English text documents

被引:0
|
作者
Veisi H. [1 ]
Aflaki N. [2 ,3 ]
Parsafard P. [2 ]
机构
[1] Faculty of New Sciences and Technologies (FNST), University of Tehran, Tehran
[2] Kish International Campus, University of Tehran, Kish
[3] Geoinformatics Collaboratory, School of Natural and Computational Sciences, Massey University, Auckland
关键词
Clustering; Extraction; Persian text processing; Term frequency; Variance;
D O I
10.24200/SCI.2019.50426.1685
中图分类号
学科分类号
摘要
This paper addresses automatic keyword extraction in Persian and English text documents. Generally, to extract keywords from a text, a weight is assigned to each token, and words characterized by higher weights are selected as the keywords. This study proposed four methods for weighting the words and compared these methods with five previous weighting techniques. The previous methods used in this paper include Term Frequency (TF), Term Frequency Inverse Document Frequency (TF-IDF), variance, Discriminative Feature Selection (DFS), and document length normalization based on unit words (LNU). The proposed weighting methods are presented using variance features and include variance to TF-IDF ratio, variance to TF ratio, the intersection of TF and variance, and the intersection of variance and IDF. For evaluation, the documents are clustered using the extracted keywords as feature vectors and by using K-means, Expectation Maximization (EM), and Ward hierarchical clustering methods. The entropy of the clusters and predefined classes of the documents are used as the evaluation metrics. For the evaluations, this study collected and labeled Persian documents. Results showed that the proposed weighting method, variance to TF ratio, showed the best performance for Persian texts. Moreover, the best entropy was found by variance to TD-IDF ratio for English texts. © 2020 Sharif University of Technology. All rights reserved.
引用
收藏
页码:1301 / 1315
页数:14
相关论文
共 50 条
  • [31] Chinese Text Keyword Extraction Based on Doc2vec And TextRank
    Wang, Wei
    Li, Xiangshun
    Yu, Sheng
    PROCEEDINGS OF THE 32ND 2020 CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2020), 2020, : 369 - 373
  • [32] Keyword Extraction for Web News Documents Based on LM-BP Neural Network
    Liu, Xiaohui
    Yan, Xin
    Yu, Zhengtao
    Qin, Guangshun
    Mo, Yuanyuan
    2015 27TH CHINESE CONTROL AND DECISION CONFERENCE (CCDC), 2015, : 2525 - 2531
  • [33] An Algorithm for Cross-Language Keyword Extraction from Multiple Documents Based on HowNet
    Dai, Liuling
    Wang, ShuMei
    Hu, JinWu
    Liu, WanChun
    PROCEEDINGS OF 2008 INTERNATIONAL PRE-OLYMPIC CONGRESS ON COMPUTER SCIENCE, VOL II: INFORMATION SCIENCE AND ENGINEERING, 2008, : 1 - 7
  • [34] TopCells: Keyword-Based Search of Top-k Aggregated Documents in Text Cube
    Ding, Bolin
    Zhao, Bo
    Lin, Cindy Xide
    Han, Jiawei
    Zhai, Chengxiang
    26TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING ICDE 2010, 2010, : 381 - 384
  • [35] Query Generation for Patent Retrieval with Keyword Extraction Based on Syntactic Features
    Rossi, Julien
    Kanoulas, Evangelos
    LEGAL KNOWLEDGE AND INFORMATION SYSTEMS (JURIX 2018), 2018, 313 : 210 - 214
  • [36] SCRIPT-DESCRIPTION PAIR EXTRACTION FROM TEXT DOCUMENTS OF ENGLISH AS SECOND LANGUAGE PODCAST
    Noh, Hyungjong
    Jeong, Minwoo
    Lee, Sungjin
    Lee, Jonghoon
    Lee, Gary Geunbae
    CSEDU 2010: PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED EDUCATION, VOL 1, 2010, : 5 - 10
  • [37] Feedback-based Keyphrase extraction from Unstructured Text Documents
    Madaan, Nishtha
    Saxena, Mudit
    Patel, Hima
    Mehta, Sameep
    2020 INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS & NETWORKS (COMSNETS), 2020,
  • [38] Web topic text extraction based on document features
    Lin, Kunhui
    Xiao, Zhimin
    Wu, Tunhua
    Zhou, Changle
    Yao, Junfeng
    Journal of Computational Information Systems, 2007, 3 (03): : 1181 - 1188
  • [39] A Novel Method To Summarize and Retrieve Text Documents Using Text Feature Extraction Based on Ontology
    Patil, Aradhana R.
    Manjrekar, Amrita A.
    2016 IEEE INTERNATIONAL CONFERENCE ON RECENT TRENDS IN ELECTRONICS, INFORMATION & COMMUNICATION TECHNOLOGY (RTEICT), 2016, : 1256 - 1260
  • [40] Handheld Mobile Device Based Text Region Extraction and Binarization of Image Embedded Text Documents
    Mollah, Ayatullah
    Basu, Suhhadip
    Nasipuri, Mita
    Basu, Dipak
    JOURNAL OF INTELLIGENT SYSTEMS, 2013, 22 (01) : 25 - 47