Tibetan Multi-word Expressions Identification Framework Based on News Corpora

被引:2
|
作者
Nuo, Minghua [1 ]
Lun, Congjun [2 ,3 ]
Liu, Huidan [3 ]
机构
[1] Inner Mongolia Univ, Coll Software Engn, Coll Comp Sci, Hohhot, Peoples R China
[2] Chinese Acad Social Sci, Inst Ethnol & Anthropol, Beijing, Peoples R China
[3] Chinese Acad Sci, Inst Software, Beijing, Peoples R China
来源
NATURAL LANGUAGE UNDERSTANDING AND INTELLIGENT APPLICATIONS (NLPCC 2016) | 2016年 / 10102卷
基金
美国国家科学基金会;
关键词
Tibetan Multi-word expression; Two-word coupling degree; Inside word probability;
D O I
10.1007/978-3-319-50496-4_2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents an identification framework for extracting Tibetan multi-word expressions. The framework includes two phases. In the first phase, sentences are segmented and high-frequency word-based n-grams are extracted using Nagao's N-gram statistical algorithm and Statistical Substring Reduction Algorithm. In the second phase, the Tibetan MWEs are identified by the proposed framework which based on the combination of context analysis and language model-based analysis. Context analysis, two-word Coupling Degree and Tibetan syllable inside word probability are three strategies in Tibetan MWE identification framework. In experimental part, we evaluate the effectiveness of three strategies on small test data, and evaluate results of different granularity for Context analysis. On small test corpus, F-score above 75% have been achieved when words are segmented in pre-processing. On larger corpus, the P@N (N is 800) overcomes 85%. It indicates that the identification framework can work well on larger corpus. The experimental result reaches acceptable performance for Tibetan MWEs.
引用
收藏
页码:16 / 26
页数:11
相关论文
共 50 条
  • [21] Identifying bilingual Multi-Word Expressions for Statistical Machine Translation
    Bouamor, Dhouha
    Semmar, Nasredine
    Zweigenbaum, Pierre
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 674 - 679
  • [22] Towards Lexical Encoding of Multi-Word Expressions in Spanish Dialects
    Bogantes, Diana
    Rodriguez, Eric
    Arauco, Alejandro
    Rodriguez, Alejandro
    Savary, Agata
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 2255 - 2261
  • [23] Multi-Word Expressions Annotations Effect in Document Classification Task
    Najar, Dhekra
    Mesfar, Slim
    Ben Ghezela, Henda
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2018), 2018, 10859 : 238 - 246
  • [24] Multi-word Expressions in English-Latvian Machine Translation
    Skadina, Inguna
    BALTIC JOURNAL OF MODERN COMPUTING, 2016, 4 (04): : 811 - 825
  • [25] English Multi-Word Expressions (MWE): A Tagset for Health Domain
    Singh, Srishti
    Jha, Girish Nath
    2018 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2018, : 1812 - 1817
  • [26] MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
    Han, Lifeng
    Jones, Gareth J. F.
    Smeaton, Alan F.
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2970 - 2979
  • [27] Multi-Word Expressions in English-Latvian SMT: Problems and Solutions
    Skadina, Inguna
    HUMAN LANGUAGE TECHNOLOGIES - THE BALTIC PERSPECTIVE, 2016, 289 : 97 - 104
  • [28] Pre-tokenization of Multi-word Expressions in Cross-lingual Word Embeddings
    Otani, Naoki
    Ozakil, Satoru
    Zhao, Xingyuan
    Li, Yucen
    St Johns, Micaelah
    Levin, Lori
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 4451 - 4464
  • [29] Increasing use of multi-word expressions in conversation through a fluency workshop
    Thomson, Haidee
    Coxhead, Averil
    Boers, Frank
    Warren, Paul
    SYSTEM, 2023, 113
  • [30] A Lexical Resource of Hebrew Verb-Noun Multi-Word Expressions
    Liebeskind, Chaya
    HaCohen-Kerner, Yaakov
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 522 - 527