Tibetan Multi-word Expressions Identification Framework Based on News Corpora

被引:2
|
作者
Nuo, Minghua [1 ]
Lun, Congjun [2 ,3 ]
Liu, Huidan [3 ]
机构
[1] Inner Mongolia Univ, Coll Software Engn, Coll Comp Sci, Hohhot, Peoples R China
[2] Chinese Acad Social Sci, Inst Ethnol & Anthropol, Beijing, Peoples R China
[3] Chinese Acad Sci, Inst Software, Beijing, Peoples R China
基金
美国国家科学基金会;
关键词
Tibetan Multi-word expression; Two-word coupling degree; Inside word probability;
D O I
10.1007/978-3-319-50496-4_2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents an identification framework for extracting Tibetan multi-word expressions. The framework includes two phases. In the first phase, sentences are segmented and high-frequency word-based n-grams are extracted using Nagao's N-gram statistical algorithm and Statistical Substring Reduction Algorithm. In the second phase, the Tibetan MWEs are identified by the proposed framework which based on the combination of context analysis and language model-based analysis. Context analysis, two-word Coupling Degree and Tibetan syllable inside word probability are three strategies in Tibetan MWE identification framework. In experimental part, we evaluate the effectiveness of three strategies on small test data, and evaluate results of different granularity for Context analysis. On small test corpus, F-score above 75% have been achieved when words are segmented in pre-processing. On larger corpus, the P@N (N is 800) overcomes 85%. It indicates that the identification framework can work well on larger corpus. The experimental result reaches acceptable performance for Tibetan MWEs.
引用
收藏
页码:16 / 26
页数:11
相关论文
共 50 条
  • [1] Harvesting Multi-Word Expressions from Parallel Corpora
    Vintar, Spela
    Fiser, Darja
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 1091 - 1096
  • [2] Building wordnets with multi-word expressions from parallel corpora
    Simoes, Alberto
    Gomez Guinovart, Xavier
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2020, (64): : 45 - 52
  • [3] Extraction of multi-word expressions from small parallel corpora
    Tsvetkov, Yulia
    Wintner, Shuly
    NATURAL LANGUAGE ENGINEERING, 2012, 18 : 549 - 573
  • [4] A framework for the inclusion of multi-word expressions in ELT
    Martinez, Ron
    ELT JOURNAL, 2013, 67 (02) : 184 - 198
  • [5] Tibetan Unknown Word Identification from News Corpora for Supporting Lexicon-based Tibetan Word Segmentation
    Nuo, Minghua
    Liu, Huidan
    Long, Congjun
    Wu, Jian
    PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL) AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (IJCNLP), VOL 2, 2015, : 451 - 457
  • [6] Hybrid Approach for Automatic Identification of Multi-Word Expressions in Lithuanian
    Mandravickaite, Justina
    Rimkute, Erika
    Krilavicius, Tomas
    HUMAN LANGUAGE TECHNOLOGIES - THE BALTIC PERSPECTIVE, 2016, 289 : 153 - 159
  • [7] Constraint Based Description of Polish Multi-word Expressions
    Kurc, Roman
    Piasecki, Maciej
    Broda, Bartosz
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 2408 - 2413
  • [8] Verbal Multi-Word Expressions in Yiddish
    Liebeskind, Chaya
    HaCohen-Kerner, Yaakov
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2018), 2018, 10859 : 205 - 216
  • [9] The variability of multi-word verbal expressions in Estonian
    Kadri Muischnek
    Heiki-Jaan Kaalep
    Language Resources and Evaluation, 2010, 44 : 115 - 135
  • [10] The variability of multi-word verbal expressions in Estonian
    Muischnek, Kadri
    Kaalep, Heiki-Jaan
    LANGUAGE RESOURCES AND EVALUATION, 2010, 44 (1-2) : 115 - 135