Tibetan Multi-word Expressions Identification Framework Based on News Corpora

被引：2

作者：

Nuo, Minghua ^{[1
]}

Lun, Congjun ^{[2
,3
]}

Liu, Huidan ^{[3
]}

机构：

[1] Inner Mongolia Univ, Coll Software Engn, Coll Comp Sci, Hohhot, Peoples R China

[2] Chinese Acad Social Sci, Inst Ethnol & Anthropol, Beijing, Peoples R China

[3] Chinese Acad Sci, Inst Software, Beijing, Peoples R China

来源：

NATURAL LANGUAGE UNDERSTANDING AND INTELLIGENT APPLICATIONS (NLPCC 2016) | 2016年 / 10102卷

基金：

美国国家科学基金会;

关键词：

Tibetan Multi-word expression; Two-word coupling degree; Inside word probability;

D O I：

10.1007/978-3-319-50496-4_2

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper presents an identification framework for extracting Tibetan multi-word expressions. The framework includes two phases. In the first phase, sentences are segmented and high-frequency word-based n-grams are extracted using Nagao's N-gram statistical algorithm and Statistical Substring Reduction Algorithm. In the second phase, the Tibetan MWEs are identified by the proposed framework which based on the combination of context analysis and language model-based analysis. Context analysis, two-word Coupling Degree and Tibetan syllable inside word probability are three strategies in Tibetan MWE identification framework. In experimental part, we evaluate the effectiveness of three strategies on small test data, and evaluate results of different granularity for Context analysis. On small test corpus, F-score above 75% have been achieved when words are segmented in pre-processing. On larger corpus, the P@N (N is 800) overcomes 85%. It indicates that the identification framework can work well on larger corpus. The experimental result reaches acceptable performance for Tibetan MWEs.

引用

页码：16 / 26

页数：11

共 50 条

[1] Harvesting Multi-Word Expressions from Parallel Corpora
Vintar, Spela
Fiser, Darja
SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 1091 - 1096
[2] Building wordnets with multi-word expressions from parallel corpora
Simoes, Alberto
Gomez Guinovart, Xavier
PROCESAMIENTO DEL LENGUAJE NATURAL, 2020, (64): : 45 - 52
[3] Extraction of multi-word expressions from small parallel corpora
Tsvetkov, Yulia
Wintner, Shuly
NATURAL LANGUAGE ENGINEERING, 2012, 18 : 549 - 573
[4] A framework for the inclusion of multi-word expressions in ELT
Martinez, Ron
ELT JOURNAL, 2013, 67 (02) : 184 - 198
[5] Tibetan Unknown Word Identification from News Corpora for Supporting Lexicon-based Tibetan Word Segmentation
Nuo, Minghua
Liu, Huidan
Long, Congjun
Wu, Jian
PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL) AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (IJCNLP), VOL 2, 2015, : 451 - 457
[6] Hybrid Approach for Automatic Identification of Multi-Word Expressions in Lithuanian
Mandravickaite, Justina
Rimkute, Erika
Krilavicius, Tomas
HUMAN LANGUAGE TECHNOLOGIES - THE BALTIC PERSPECTIVE, 2016, 289 : 153 - 159
[7] Constraint Based Description of Polish Multi-word Expressions
Kurc, Roman
Piasecki, Maciej
Broda, Bartosz
LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 2408 - 2413
[8] Verbal Multi-Word Expressions in Yiddish
Liebeskind, Chaya
HaCohen-Kerner, Yaakov
NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2018), 2018, 10859 : 205 - 216
[9] The variability of multi-word verbal expressions in Estonian
Kadri Muischnek
Heiki-Jaan Kaalep
Language Resources and Evaluation, 2010, 44 : 115 - 135
[10] The variability of multi-word verbal expressions in Estonian
Muischnek, Kadri
Kaalep, Heiki-Jaan
LANGUAGE RESOURCES AND EVALUATION, 2010, 44 (1-2) : 115 - 135

← 1 2 3 4 5 →