Uyghur-Kazakh-Kirghiz Text Keyword Extraction Based on Morpheme Segmentation

被引:1
|
作者
Parhat, Sardar [1 ]
Sattar, Mutallip [1 ]
Hamdulla, Askar [2 ]
Kadir, Abdurahman [1 ]
机构
[1] Xinjiang Univ Finance & Econ, Coll Informat Management, Urumqi 830012, Peoples R China
[2] Xinjiang Univ, Coll Informat Sci & Engn, Urumqi 830046, Peoples R China
基金
中国国家自然科学基金;
关键词
Uyghur-Kazakh-Kirghiz; keyword extraction; morpheme segmentation; stem extraction; stem vector; TextRank;
D O I
10.3390/info14050283
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this study, based on a morpheme segmentation framework, we researched a text keyword extraction method for Uyghur, Kazakh and Kirghiz languages, which have similar grammatical and lexical structures. In these languages, affixes and a stem are joined together to form a word. A stem is a word particle with a notional meaning, while the affixes perform grammatical functions. Because of these derivative properties, the vocabularies used for these languages are huge. Therefore, pre-processing is a necessary step in NLP tasks for Uyghur, Kazakh and Kirghiz. Morpheme segmentation enabled us to remove the suffixes as the auxiliary unit while retaining the meaningful stem and it reduced the dimension of the feature space present in the keyword extraction task for Uyghur, Kazakh and Kirghiz texts. We transformed the morpheme segmentation task into the problem of labeling the morpheme sequences, and we used the Bi-LSTM network to bidirectionally obtain the position feature information of character sequences. We applied CRF to effectively learn the information of the preceding and following label sequences to build a highly accurate Bi-LSTM_CRF morpheme segmentation model, and we prepared morpheme-based experimental text sets by using this model. Subsequently, we used the stem vectors' similarity to modify the TextRank algorithm, subsequent to the training of the stem embedding vector using the Doc2vec algorithm, and then we performed a text keyword extraction experiment. In this experiment, the highest F1 scores of 43.8%, 44% and 43.9% were obtained for three datasets. The experimental results show that the morpheme-based approach provides much better results than the word-based approach, which shows the stem vector similarity weighting is an efficient method for the text keyword extraction task, thus proving the efficiency of morpheme sequence for morphologically derivative languages.
引用
收藏
页数:17
相关论文
共 31 条
  • [21] Research on keyword extraction of Tibetan web news based on improved TEXT-RANK algorithm
    Lan, Chuanqi
    Yu, Hongzhi
    Xu, Tao
    Liu, Peixin
    Li, Jiuyi
    PROCEEDINGS OF 2017 IEEE 2ND INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC), 2017, : 208 - 212
  • [22] RETRACTED: Research on Keyword Extraction Algorithm in English Text Based on Cluster Analysis (Retracted Article)
    Ma, Jingxia
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2022, 2022
  • [23] Clustering-based word segmentation from off-line handwritten Uyghur text-line images
    Hamdulla A.
    Abliz A.
    Dawut A.
    Moydin K.
    Tuerxun P.
    International Journal of Information and Communication Technology, 2020, 16 (03) : 214 - 229
  • [24] A graph-based segmentation and feature extraction framework for Arabic text recognition
    Elgammal, AM
    Ismail, MA
    SIXTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, PROCEEDINGS, 2001, : 622 - 626
  • [25] Improving the performance of semantic graph-based keyword extraction and text summarization using fuzzy relations in Hindi Wordnet
    Joshi, Manju Lata
    Mittal, Namita
    Joshi, Nisheeth
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2022, 43 (03) : 3771 - 3788
  • [26] Multi-layers Segmentation Based Adaptive Binarization for Text Extraction in Scanned Card Images
    Liu, Chunmei
    INTELLIGENT COMPUTING THEORY, 2014, 8588 : 367 - 375
  • [27] CNN-IETS: A CNN-based Probabilistic Approach for Information Extraction by Text Segmentation
    Hu, Meng
    Li, Zhixu
    Shen, Yongxin
    Liu, An
    Liu, Guanfeng
    Zheng, Kai
    Zhao, Lei
    CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, : 1159 - 1168
  • [28] Exploring ensemble oversampling method for imbalanced keyword extraction learning in policy text based on three-way decisions and SMOTE
    Liang, Decui
    Yi, Bochun
    Cao, Wen
    Zheng, Qiang
    EXPERT SYSTEMS WITH APPLICATIONS, 2022, 188
  • [29] Fuzzy-Based Segmentation for Variable Font-Sized Text Extraction from Images/Videos
    Tehsin, Samabia
    Masood, Asif
    Kausar, Sumaira
    Arif, Fahim
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2014, 2014
  • [30] Text extraction from web images based on a split-and-merge segmentation method using colour perception
    Karatzas, D
    Antonacopoulos, A
    PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 2, 2004, : 634 - 637