Uyghur-Kazakh-Kirghiz Text Keyword Extraction Based on Morpheme Segmentation

被引:1
|
作者
Parhat, Sardar [1 ]
Sattar, Mutallip [1 ]
Hamdulla, Askar [2 ]
Kadir, Abdurahman [1 ]
机构
[1] Xinjiang Univ Finance & Econ, Coll Informat Management, Urumqi 830012, Peoples R China
[2] Xinjiang Univ, Coll Informat Sci & Engn, Urumqi 830046, Peoples R China
基金
中国国家自然科学基金;
关键词
Uyghur-Kazakh-Kirghiz; keyword extraction; morpheme segmentation; stem extraction; stem vector; TextRank;
D O I
10.3390/info14050283
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this study, based on a morpheme segmentation framework, we researched a text keyword extraction method for Uyghur, Kazakh and Kirghiz languages, which have similar grammatical and lexical structures. In these languages, affixes and a stem are joined together to form a word. A stem is a word particle with a notional meaning, while the affixes perform grammatical functions. Because of these derivative properties, the vocabularies used for these languages are huge. Therefore, pre-processing is a necessary step in NLP tasks for Uyghur, Kazakh and Kirghiz. Morpheme segmentation enabled us to remove the suffixes as the auxiliary unit while retaining the meaningful stem and it reduced the dimension of the feature space present in the keyword extraction task for Uyghur, Kazakh and Kirghiz texts. We transformed the morpheme segmentation task into the problem of labeling the morpheme sequences, and we used the Bi-LSTM network to bidirectionally obtain the position feature information of character sequences. We applied CRF to effectively learn the information of the preceding and following label sequences to build a highly accurate Bi-LSTM_CRF morpheme segmentation model, and we prepared morpheme-based experimental text sets by using this model. Subsequently, we used the stem vectors' similarity to modify the TextRank algorithm, subsequent to the training of the stem embedding vector using the Doc2vec algorithm, and then we performed a text keyword extraction experiment. In this experiment, the highest F1 scores of 43.8%, 44% and 43.9% were obtained for three datasets. The experimental results show that the morpheme-based approach provides much better results than the word-based approach, which shows the stem vector similarity weighting is an efficient method for the text keyword extraction task, thus proving the efficiency of morpheme sequence for morphologically derivative languages.
引用
收藏
页数:17
相关论文
共 31 条
  • [1] A Robust Morpheme Sequence and Convolutional Neural Network-Based Uyghur and Kazakh Short Text Classification
    Parhat, Sardar
    Ablimit, Mijit
    Hamdulla, Askar
    INFORMATION, 2019, 10 (12)
  • [2] Keyword extraction algorithms for emotion recognition from Uyghur text
    Imam S.
    Parhat R.
    Hamdulla A.
    Li Z.
    Hamdulla, Askar (askar@xju.edu.cn), 1600, Tsinghua University (57): : 270 - 273
  • [3] A morpheme sequence and convolutional neural network based Kazakh text classification
    Parhat, Sardar
    Ting, Gao
    Ablimit, Mijit
    Hamdulla, Askar
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1903 - 1906
  • [4] Performance Analysis of Different Keyword Extraction Algorithms for Emotion Recognition from Uyghur Text
    Imam, Seyyare
    Parhat, Rayilam
    Hamdulla, Askar
    Li, Zhijun
    2014 9TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2014, : 351 - 351
  • [5] Text Keyword Extraction Based on GPT
    He, Pinyao
    Huang, Jingyue
    Li, Ming
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 1394 - 1398
  • [6] An Improved Focused Crawler Based on Text Keyword Extraction
    Zheng, Zhang
    Qian, Du
    PROCEEDINGS OF 2016 5TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT), 2016, : 386 - 390
  • [7] Chinese Automatic Text Summarization Based on Keyword Extraction
    Jiang Xiao-yu
    FIRST INTERNATIONAL WORKSHOP ON DATABASE TECHNOLOGY AND APPLICATIONS, PROCEEDINGS, 2009, : 225 - 228
  • [8] Text Keyword Extraction Based on Meta-Learning Strategy
    Yuan, Man
    Zou, Chenhong
    2018 INTERNATIONAL CONFERENCE ON BIG DATA AND ARTIFICIAL INTELLIGENCE (BDAI 2018), 2018, : 78 - 81
  • [9] An Unsupervised Keyword Extraction Method based on Text Semantic Graph
    Zhao, Liujun
    Miao, Zhongquan
    Wang, Chunming
    Kong, Weizheng
    2022 IEEE 6TH ADVANCED INFORMATION TECHNOLOGY, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (IAEAC), 2022, : 1431 - 1436
  • [10] Keyword Sequence Extraction Based on Byte Entropy Iterative Segmentation
    Ding, Siyuan
    Zhang, Xia
    Li, Ou
    Li, Shengxiang
    PROCEEDINGS OF 2017 3RD IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATIONS (ICCC), 2017, : 1530 - 1535