Tibetan Multi-word Expressions Identification Framework Based on News Corpora

被引:2
|
作者
Nuo, Minghua [1 ]
Lun, Congjun [2 ,3 ]
Liu, Huidan [3 ]
机构
[1] Inner Mongolia Univ, Coll Software Engn, Coll Comp Sci, Hohhot, Peoples R China
[2] Chinese Acad Social Sci, Inst Ethnol & Anthropol, Beijing, Peoples R China
[3] Chinese Acad Sci, Inst Software, Beijing, Peoples R China
基金
美国国家科学基金会;
关键词
Tibetan Multi-word expression; Two-word coupling degree; Inside word probability;
D O I
10.1007/978-3-319-50496-4_2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents an identification framework for extracting Tibetan multi-word expressions. The framework includes two phases. In the first phase, sentences are segmented and high-frequency word-based n-grams are extracted using Nagao's N-gram statistical algorithm and Statistical Substring Reduction Algorithm. In the second phase, the Tibetan MWEs are identified by the proposed framework which based on the combination of context analysis and language model-based analysis. Context analysis, two-word Coupling Degree and Tibetan syllable inside word probability are three strategies in Tibetan MWE identification framework. In experimental part, we evaluate the effectiveness of three strategies on small test data, and evaluate results of different granularity for Context analysis. On small test corpus, F-score above 75% have been achieved when words are segmented in pre-processing. On larger corpus, the P@N (N is 800) overcomes 85%. It indicates that the identification framework can work well on larger corpus. The experimental result reaches acceptable performance for Tibetan MWEs.
引用
收藏
页码:16 / 26
页数:11
相关论文
共 50 条
  • [31] Machine translation and human translation of multi-word expressions: peeling this pineapple
    Rebechi, Rozane Rodrigues
    Marcon, Nathalia Oliva
    Faller, Guilherme de Almeida
    REVISTA VIRTUAL DE ESTUDOS DA LINGUAGEM-REVEL, 2025, 23 (44): : 346 - 380
  • [32] l(1) Regularization of Word Embeddings for Multi-Word Expression Identification
    Berend, Gabor
    ACTA CYBERNETICA, 2018, 23 (03): : 801 - 813
  • [33] MOST FREQUENT MULTI-WORD EXPRESSIONS IN ENGLISH FOR BANKING: A CORPUS-BASED DIACHRONIC STUDY
    Marusic, Borislav
    FOLIA LINGUISTICA ET LITTERARIA, 2023, (46): : 43 - 78
  • [34] Slovene Multi-word Units: Identification, Categorization, and Representation
    Gantar, Polona
    Cibej, Jaka
    Bon, Mija
    COMPUTATIONAL AND CORPUS-BASED PHRASEOLOGY, EUROPHRAS 2019, 2019, 11755 : 99 - 112
  • [36] Developmental features of multi-word expressions in spoken discourse by Chinese learners of English
    Zhang, Huiping
    Wang, Xingzuo
    HUMANITIES & SOCIAL SCIENCES COMMUNICATIONS, 2024, 11 (01):
  • [37] Handling multi-word expressions without explicit linguistic rules in an MT system
    Bharati, A
    Sangal, R
    Mishra, D
    Venkatapathy, S
    Reddy, P
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2004, 3206 : 31 - 40
  • [38] A contrastive Approach to Multi-word Term Extraction from Domain-specific Corpora
    Bonin, Francesca
    Dell' Orletta, Felice
    Venturi, Giulia
    Montemagni, Simonetta
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010,
  • [39] Multi-Word Expressions in Serbian - Properties, Typology and Classification for Natural Language Processing
    Krstev, Cvetana
    Vitas, Dusko
    PROCEEDINGS OF THE INTERNATIONAL JUBILEE CONFERENCE OF THE INSTITUTE FOR BULGARIAN LANGUAGE, VOL 1, 2017, : 298 - 310
  • [40] Utilization of Multi-word Expressions to Improve Statistical Machine Translation of Statutory Sentences
    Sakamoto, Satomi
    Ogawa, Yasuhiro
    Nakamura, Makoto
    Ohno, Tomohiro
    Toyama, Katsuhiko
    NEW FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2017, 10091 : 249 - 264