Tibetan Multi-word Expressions Identification Framework Based on News Corpora

被引：2

作者：

Nuo, Minghua ^{[1
]}

Lun, Congjun ^{[2
,3
]}

Liu, Huidan ^{[3
]}

机构：

[1] Inner Mongolia Univ, Coll Software Engn, Coll Comp Sci, Hohhot, Peoples R China

[2] Chinese Acad Social Sci, Inst Ethnol & Anthropol, Beijing, Peoples R China

[3] Chinese Acad Sci, Inst Software, Beijing, Peoples R China

来源：

NATURAL LANGUAGE UNDERSTANDING AND INTELLIGENT APPLICATIONS (NLPCC 2016) | 2016年 / 10102卷

基金：

美国国家科学基金会;

关键词：

Tibetan Multi-word expression; Two-word coupling degree; Inside word probability;

D O I：

10.1007/978-3-319-50496-4_2

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper presents an identification framework for extracting Tibetan multi-word expressions. The framework includes two phases. In the first phase, sentences are segmented and high-frequency word-based n-grams are extracted using Nagao's N-gram statistical algorithm and Statistical Substring Reduction Algorithm. In the second phase, the Tibetan MWEs are identified by the proposed framework which based on the combination of context analysis and language model-based analysis. Context analysis, two-word Coupling Degree and Tibetan syllable inside word probability are three strategies in Tibetan MWE identification framework. In experimental part, we evaluate the effectiveness of three strategies on small test data, and evaluate results of different granularity for Context analysis. On small test corpus, F-score above 75% have been achieved when words are segmented in pre-processing. On larger corpus, the P@N (N is 800) overcomes 85%. It indicates that the identification framework can work well on larger corpus. The experimental result reaches acceptable performance for Tibetan MWEs.

引用

页码：16 / 26

页数：11

共 50 条

[31] Machine translation and human translation of multi-word expressions: peeling this pineapple
Rebechi, Rozane Rodrigues
Marcon, Nathalia Oliva
Faller, Guilherme de Almeida
REVISTA VIRTUAL DE ESTUDOS DA LINGUAGEM-REVEL, 2025, 23 (44): : 346 - 380
[32] l(1) Regularization of Word Embeddings for Multi-Word Expression Identification
Berend, Gabor
ACTA CYBERNETICA, 2018, 23 (03): : 801 - 813
[33] MOST FREQUENT MULTI-WORD EXPRESSIONS IN ENGLISH FOR BANKING: A CORPUS-BASED DIACHRONIC STUDY
Marusic, Borislav
FOLIA LINGUISTICA ET LITTERARIA, 2023, (46): : 43 - 78
[34] Slovene Multi-word Units: Identification, Categorization, and Representation
Gantar, Polona
Cibej, Jaka
Bon, Mija
COMPUTATIONAL AND CORPUS-BASED PHRASEOLOGY, EUROPHRAS 2019, 2019, 11755 : 99 - 112
[35] The Role of Syntactic Class, Frequency, and Word Order in Looking up English Multi-Word Expressions
Lew, Robert
LEXIKOS, 2012, 22 : 243 - 260
[36] Developmental features of multi-word expressions in spoken discourse by Chinese learners of English
Zhang, Huiping
Wang, Xingzuo
HUMANITIES & SOCIAL SCIENCES COMMUNICATIONS, 2024, 11 (01):
[37] Handling multi-word expressions without explicit linguistic rules in an MT system
Bharati, A
Sangal, R
Mishra, D
Venkatapathy, S
Reddy, P
TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2004, 3206 : 31 - 40
[38] A contrastive Approach to Multi-word Term Extraction from Domain-specific Corpora
Bonin, Francesca
Dell' Orletta, Felice
Venturi, Giulia
Montemagni, Simonetta
LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010,
[39] Multi-Word Expressions in Serbian - Properties, Typology and Classification for Natural Language Processing
Krstev, Cvetana
Vitas, Dusko
PROCEEDINGS OF THE INTERNATIONAL JUBILEE CONFERENCE OF THE INSTITUTE FOR BULGARIAN LANGUAGE, VOL 1, 2017, : 298 - 310
[40] Utilization of Multi-word Expressions to Improve Statistical Machine Translation of Statutory Sentences
Sakamoto, Satomi
Ogawa, Yasuhiro
Nakamura, Makoto
Ohno, Tomohiro
Toyama, Katsuhiko
NEW FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2017, 10091 : 249 - 264

← 1 2 3 4 5 →