Tibetan Multi-word Expressions Identification Framework Based on News Corpora

被引：2

作者：

Nuo, Minghua ^{[1
]}

Lun, Congjun ^{[2
,3
]}

Liu, Huidan ^{[3
]}

机构：

[1] Inner Mongolia Univ, Coll Software Engn, Coll Comp Sci, Hohhot, Peoples R China

[2] Chinese Acad Social Sci, Inst Ethnol & Anthropol, Beijing, Peoples R China

[3] Chinese Acad Sci, Inst Software, Beijing, Peoples R China

来源：

NATURAL LANGUAGE UNDERSTANDING AND INTELLIGENT APPLICATIONS (NLPCC 2016) | 2016年 / 10102卷

基金：

美国国家科学基金会;

关键词：

Tibetan Multi-word expression; Two-word coupling degree; Inside word probability;

D O I：

10.1007/978-3-319-50496-4_2

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper presents an identification framework for extracting Tibetan multi-word expressions. The framework includes two phases. In the first phase, sentences are segmented and high-frequency word-based n-grams are extracted using Nagao's N-gram statistical algorithm and Statistical Substring Reduction Algorithm. In the second phase, the Tibetan MWEs are identified by the proposed framework which based on the combination of context analysis and language model-based analysis. Context analysis, two-word Coupling Degree and Tibetan syllable inside word probability are three strategies in Tibetan MWE identification framework. In experimental part, we evaluate the effectiveness of three strategies on small test data, and evaluate results of different granularity for Context analysis. On small test corpus, F-score above 75% have been achieved when words are segmented in pre-processing. On larger corpus, the P@N (N is 800) overcomes 85%. It indicates that the identification framework can work well on larger corpus. The experimental result reaches acceptable performance for Tibetan MWEs.

引用

页码：16 / 26

页数：11

共 50 条

[41] Multi-Word Expressions Occur Infrequently in Picturebooks: Implications for Early Vocabulary Instruction
Green, Clarence
LITERACY RESEARCH AND INSTRUCTION, 2025,
[42] MULTI-WORD EXPRESSIONS IN ANNUAL REPORTS OF AMERICAN AND BRITISH CORPORATIONS: A CORPUS-BASED DIACHRONIC STUDY
Marusic, Borislav
JOURNAL OF TEACHING ENGLISH FOR SPECIFIC AND ACADEMIC PURPOSES, 2023, 11 (03): : 793 - 811
[43] Fixed Multi-Word Expressions of German. Collocations Dictionary for Everyday Use
Hrisztova-Gotthardt, Hrisztalina
YEARBOOK OF PHRASEOLOGY, 2015, 6 (01) : 131 - 134
[44] Single and multi-word unit vocabulary in university tutorials and laboratories: Evidence from corpora and textbooks
Coxhead, Averil
Thi Ngoc Yen Dang
Mukai, Shota
JOURNAL OF ENGLISH FOR ACADEMIC PURPOSES, 2017, 30 : 66 - 78
[45] A Span-based Enhanced Bidirectional Extraction Framework for Multi-word Aspect Sentiment Triplets
Liu, Geng
Zhao, Yingsi
Shen, Bo
JOURNAL OF INTERNET TECHNOLOGY, 2025, 26 (02): : 199 - 209
[46] A cross-disciplinary investigation of multi-word expressions in the moves of research article abstracts
Omidian, Taha
Shahriari, Hesamoddin
Siyanova-Chanturia, Anna
JOURNAL OF ENGLISH FOR ACADEMIC PURPOSES, 2018, 36 : 1 - 14
[47] Classifying Multi-Word Expressions in the Latvian Monolingual Electronic Dictionary Tezaurs.lv
Rituma, Laura
Nespore-Berzkalne, Gunta
Klints, Agute
Lokmane, Ilze
Stade, Madara
Paikens, Peteris
PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE COMPUTATIONAL LINGUISTICS IN BULGARIA, CLIB 2024, 2024, : 113 - 118
[48] Extracting Bilingual Multi-word Expressions for Low-resource Statistical Machine Translation
Wei, Linyu
Li, Miao
Chen, Lei
Yang, Zhenxin
Sun, Kai
Yuan, Man
PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2015, : 21 - 24
[49] Text classification based on multi-word with support vector machine
Zhang, Wen
Yoshida, Taketoshi
Tang, Xijin
KNOWLEDGE-BASED SYSTEMS, 2008, 21 (08) : 879 - 886
[50] A Tibetan Word Sense Disambiguation Method Based on HowNet and Chinese-Tibetan Parallel Corpora
Jiang, Xinmin
Qiu, Lirong
Li, Yeqing
TRUSTWORTHY COMPUTING AND SERVICES (ISCTCS 2014), 2015, 520 : 152 - 159

← 1 2 3 4 5 →