Challenges and Solutions for Consistent Annotation of Vietnamese Treebank

被引:0
|
作者
Nguyen, Quy T. [1 ,2 ]
Miyao, Yusuke [1 ,2 ]
Le, Ha T. T. [3 ]
Nguyen, Ngan L. T. [4 ]
机构
[1] Grad Univ Adv Studies, Hayama, Kanagawa, Japan
[2] Natl Inst Informat, Tokyo, Japan
[3] Univ Social Sci & Humanities, Warsaw, Poland
[4] Univ Informat Technol, Ho Chi Minh City, Vietnam
关键词
Vietnamese Treebank; Consistent Annotation; Challenges and Solutions;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
Treebanks are important resources for research in natural language processing, speech recognition, theoretical linguistics, etc. To strengthen the automatic processing of the Vietnamese language, a Vietnamese treebank has been built. However, the quality of this treebank is not satisfactory and is a possible source for the low performance of Vietnamese language processing. We have been building a new treebank for Vietnamese with about 40,000 sentences annotated with three layers: word segmentation, part-of-speech tagging, and bracketing. In this paper, we describe several challenges of Vietnamese language and how we solve them in developing annotation guidelines. We also present our methods to improve the quality of the annotation guidelines and ensure annotation accuracy and consistency. Experiment results show that inter-annotator agreement ratios and accuracy are higher than 90% which is satisfactory.
引用
收藏
页码:1532 / 1539
页数:8
相关论文
共 50 条
  • [41] Semi-automatic Korean FrameNet Annotation over KAIST Treebank
    Hahm, Younggyun
    Kwon, Sunggoo
    Kim, Jiseong
    Choi, Key-Sun
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 83 - 87
  • [42] From Speech to Trees: Applying Treebank Annotation to Arabic Broadcast News
    Maamouri, Mohamed
    Bies, Ann
    Kulick, Seth
    Zaghouani, Wajdi
    Graff, David
    Ciul, Michael
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 2117 - 2122
  • [43] Extending the TuBa-D/Z Treebank with GermaNet Sense Annotation
    Henrich, Verena
    Hinrichs, Erhard
    LANGUAGE PROCESSING AND KNOWLEDGE IN THE WEB, 2013, 8105 : 89 - 96
  • [44] Analysis of Typical Annotation Problems in Bilingual Case Grammar Treebank Construction
    Zan, Hongying
    Chen, Wanli
    Zhang, Kunli
    Jia, Yuxiang
    CHINESE LEXICAL SEMANTICS (CLSW 2015), 2015, 9332 : 524 - 534
  • [45] Post-annotation checking of Prague Dependency Treebank 2.0 data
    Stepanek, Jan
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2006, 4188 : 277 - 284
  • [46] Analyzing Text Coherence via Multiple Annotation in the Prague Dependency Treebank
    Rysova, Katerina
    Rysova, Magdalena
    TEXT, SPEECH, AND DIALOGUE (TSD 2015), 2015, 9302 : 71 - 79
  • [47] Constructions in Latvian Treebank: the Impact of Annotation Decisions on the Dependency Parsing Performance
    Pretkalnina, Lauma
    Rituma, Laura
    HUMAN LANGUAGE TECHNOLOGIES - THE BALTIC PERSPECTIVE, BALTIC HLT 2014, 2014, 268 : 219 - 226
  • [48] Enhancing the Arabic Treebank: A Collaborative Effort toward New Annotation Guidelines
    Maamouri, Mohamed
    Bies, Ann
    Kulick, Seth
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 3192 - 3196
  • [49] Croatian Dependency Treebank 2.0: New Annotation Guidelines for Improved Parsing
    Agic, Zeljko
    Berovic, Dasa
    Merkler, Danijela
    Tadic, Marko
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2313 - 2319
  • [50] Developing an Egyptian Arabic Treebank: Impact of Dialectal Morphology on Annotation and Tool Development
    Maamouri, Mohamed
    Bies, Ann
    Kulick, Seth
    Ciul, Michael
    Habash, Nizar
    Eskander, Ramy
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2348 - 2354