Tibetan Multi-word Expressions Identification Framework Based on News Corpora

被引:2
|
作者
Nuo, Minghua [1 ]
Lun, Congjun [2 ,3 ]
Liu, Huidan [3 ]
机构
[1] Inner Mongolia Univ, Coll Software Engn, Coll Comp Sci, Hohhot, Peoples R China
[2] Chinese Acad Social Sci, Inst Ethnol & Anthropol, Beijing, Peoples R China
[3] Chinese Acad Sci, Inst Software, Beijing, Peoples R China
基金
美国国家科学基金会;
关键词
Tibetan Multi-word expression; Two-word coupling degree; Inside word probability;
D O I
10.1007/978-3-319-50496-4_2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents an identification framework for extracting Tibetan multi-word expressions. The framework includes two phases. In the first phase, sentences are segmented and high-frequency word-based n-grams are extracted using Nagao's N-gram statistical algorithm and Statistical Substring Reduction Algorithm. In the second phase, the Tibetan MWEs are identified by the proposed framework which based on the combination of context analysis and language model-based analysis. Context analysis, two-word Coupling Degree and Tibetan syllable inside word probability are three strategies in Tibetan MWE identification framework. In experimental part, we evaluate the effectiveness of three strategies on small test data, and evaluate results of different granularity for Context analysis. On small test corpus, F-score above 75% have been achieved when words are segmented in pre-processing. On larger corpus, the P@N (N is 800) overcomes 85%. It indicates that the identification framework can work well on larger corpus. The experimental result reaches acceptable performance for Tibetan MWEs.
引用
收藏
页码:16 / 26
页数:11
相关论文
共 50 条
  • [41] Multi-Word Expressions Occur Infrequently in Picturebooks: Implications for Early Vocabulary Instruction
    Green, Clarence
    LITERACY RESEARCH AND INSTRUCTION, 2025,
  • [42] MULTI-WORD EXPRESSIONS IN ANNUAL REPORTS OF AMERICAN AND BRITISH CORPORATIONS: A CORPUS-BASED DIACHRONIC STUDY
    Marusic, Borislav
    JOURNAL OF TEACHING ENGLISH FOR SPECIFIC AND ACADEMIC PURPOSES, 2023, 11 (03): : 793 - 811
  • [43] Fixed Multi-Word Expressions of German. Collocations Dictionary for Everyday Use
    Hrisztova-Gotthardt, Hrisztalina
    YEARBOOK OF PHRASEOLOGY, 2015, 6 (01) : 131 - 134
  • [44] Single and multi-word unit vocabulary in university tutorials and laboratories: Evidence from corpora and textbooks
    Coxhead, Averil
    Thi Ngoc Yen Dang
    Mukai, Shota
    JOURNAL OF ENGLISH FOR ACADEMIC PURPOSES, 2017, 30 : 66 - 78
  • [45] A Span-based Enhanced Bidirectional Extraction Framework for Multi-word Aspect Sentiment Triplets
    Liu, Geng
    Zhao, Yingsi
    Shen, Bo
    JOURNAL OF INTERNET TECHNOLOGY, 2025, 26 (02): : 199 - 209
  • [46] A cross-disciplinary investigation of multi-word expressions in the moves of research article abstracts
    Omidian, Taha
    Shahriari, Hesamoddin
    Siyanova-Chanturia, Anna
    JOURNAL OF ENGLISH FOR ACADEMIC PURPOSES, 2018, 36 : 1 - 14
  • [47] Classifying Multi-Word Expressions in the Latvian Monolingual Electronic Dictionary Tezaurs.lv
    Rituma, Laura
    Nespore-Berzkalne, Gunta
    Klints, Agute
    Lokmane, Ilze
    Stade, Madara
    Paikens, Peteris
    PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE COMPUTATIONAL LINGUISTICS IN BULGARIA, CLIB 2024, 2024, : 113 - 118
  • [48] Extracting Bilingual Multi-word Expressions for Low-resource Statistical Machine Translation
    Wei, Linyu
    Li, Miao
    Chen, Lei
    Yang, Zhenxin
    Sun, Kai
    Yuan, Man
    PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2015, : 21 - 24
  • [49] Text classification based on multi-word with support vector machine
    Zhang, Wen
    Yoshida, Taketoshi
    Tang, Xijin
    KNOWLEDGE-BASED SYSTEMS, 2008, 21 (08) : 879 - 886
  • [50] A Tibetan Word Sense Disambiguation Method Based on HowNet and Chinese-Tibetan Parallel Corpora
    Jiang, Xinmin
    Qiu, Lirong
    Li, Yeqing
    TRUSTWORTHY COMPUTING AND SERVICES (ISCTCS 2014), 2015, 520 : 152 - 159