Challenges and Solutions for Consistent Annotation of Vietnamese Treebank

被引:0
|
作者
Nguyen, Quy T. [1 ,2 ]
Miyao, Yusuke [1 ,2 ]
Le, Ha T. T. [3 ]
Nguyen, Ngan L. T. [4 ]
机构
[1] Grad Univ Adv Studies, Hayama, Kanagawa, Japan
[2] Natl Inst Informat, Tokyo, Japan
[3] Univ Social Sci & Humanities, Warsaw, Poland
[4] Univ Informat Technol, Ho Chi Minh City, Vietnam
关键词
Vietnamese Treebank; Consistent Annotation; Challenges and Solutions;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
Treebanks are important resources for research in natural language processing, speech recognition, theoretical linguistics, etc. To strengthen the automatic processing of the Vietnamese language, a Vietnamese treebank has been built. However, the quality of this treebank is not satisfactory and is a possible source for the low performance of Vietnamese language processing. We have been building a new treebank for Vietnamese with about 40,000 sentences annotated with three layers: word segmentation, part-of-speech tagging, and bracketing. In this paper, we describe several challenges of Vietnamese language and how we solve them in developing annotation guidelines. We also present our methods to improve the quality of the annotation guidelines and ensure annotation accuracy and consistency. Experiment results show that inter-annotator agreement ratios and accuracy are higher than 90% which is satisfactory.
引用
收藏
页码:1532 / 1539
页数:8
相关论文
共 50 条
  • [1] Ensuring annotation consistency and accuracy for Vietnamese treebank
    Nguyen, Quy T.
    Miyao, Yusuke
    Le, Ha T. T.
    Nguyen, Nhung T. H.
    LANGUAGE RESOURCES AND EVALUATION, 2018, 52 (01) : 269 - 315
  • [2] Ensuring annotation consistency and accuracy for Vietnamese treebank
    Quy T. Nguyen
    Yusuke Miyao
    Ha T. T. Le
    Nhung T. H. Nguyen
    Language Resources and Evaluation, 2018, 52 : 269 - 315
  • [3] Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank
    Kulick, Seth
    Bies, Ann
    Maamouri, Mohamed
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 1499 - 1506
  • [4] BKTreebank: Building a Vietnamese Dependency Treebank
    Kiem-Hieu Nguyen
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 2164 - 2168
  • [5] Building a Treebank for Vietnamese Dependency Parsing
    Luong Nguyen Thi
    Linh Ha My
    Hung Nguyen Viet
    Huyen Nguyen Thi Minh
    Phuong Le Hong
    PROCEEDINGS OF 2013 IEEE RIVF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES: RESEARCH, INNOVATION, AND VISION FOR THE FUTURE (RIVF), 2013, : 147 - 151
  • [6] The Annotation Scheme for Uyghur Dependency Treebank
    Mamitimin, Samat
    Ibrahim, Turgun
    Eli, Marhaba
    2013 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2013), 2013, : 185 - 188
  • [7] Annotation of grammatical function in the Persian treebank
    Pouramini, Ahmad
    Moridi, Elham
    4TH INTERNATIONAL CONFERENCE OF COGNITIVE SCIENCE, 2012, 32 : 302 - 307
  • [8] Sense annotation in the penn discourse treebank
    Miltsakaki, Eleni
    Robaldo, Livio
    Lee, Alan
    Joshi, Aravind
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2008, 4919 : 275 - +
  • [9] A dependency annotation scheme for Bangla treebank
    Sanjay Chatterji
    Tanaya Mukherjee Sarkar
    Pragati Dhang
    Samhita Deb
    Sudeshna Sarkar
    Jayshree Chakraborty
    Anupam Basu
    Language Resources and Evaluation, 2014, 48 : 443 - 477
  • [10] A dependency annotation scheme for Bangla treebank
    Chatterji, Sanjay
    Sarkar, Tanaya Mukherjee
    Dhang, Pragati
    Deb, Samhita
    Sarkar, Sudeshna
    Chakraborty, Jayshree
    Basu, Anupam
    LANGUAGE RESOURCES AND EVALUATION, 2014, 48 (03) : 443 - 477