Challenges and Solutions for Consistent Annotation of Vietnamese Treebank

被引:0
|
作者
Nguyen, Quy T. [1 ,2 ]
Miyao, Yusuke [1 ,2 ]
Le, Ha T. T. [3 ]
Nguyen, Ngan L. T. [4 ]
机构
[1] Grad Univ Adv Studies, Hayama, Kanagawa, Japan
[2] Natl Inst Informat, Tokyo, Japan
[3] Univ Social Sci & Humanities, Warsaw, Poland
[4] Univ Informat Technol, Ho Chi Minh City, Vietnam
关键词
Vietnamese Treebank; Consistent Annotation; Challenges and Solutions;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
Treebanks are important resources for research in natural language processing, speech recognition, theoretical linguistics, etc. To strengthen the automatic processing of the Vietnamese language, a Vietnamese treebank has been built. However, the quality of this treebank is not satisfactory and is a possible source for the low performance of Vietnamese language processing. We have been building a new treebank for Vietnamese with about 40,000 sentences annotated with three layers: word segmentation, part-of-speech tagging, and bracketing. In this paper, we describe several challenges of Vietnamese language and how we solve them in developing annotation guidelines. We also present our methods to improve the quality of the annotation guidelines and ensure annotation accuracy and consistency. Experiment results show that inter-annotator agreement ratios and accuracy are higher than 90% which is satisfactory.
引用
收藏
页码:1532 / 1539
页数:8
相关论文
共 50 条
  • [31] Reflections on the Penn Discourse TreeBank, Comparable Corpora, and Complementary Annotation
    Prasad, Rashmi
    Webber, Bonnie
    Joshi, Aravind
    COMPUTATIONAL LINGUISTICS, 2014, 40 (04) : 921 - 950
  • [32] Towards building a Kashmiri Treebank: Setting up the Annotation Pipeline
    Bhat, Riyaz Ahmad
    Bhat, Shahid Mushtaq
    Sharma, Dipti Misra
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 748 - 752
  • [33] The Index Thomisticus Treebank Project: Annotation, Parsing and Valency Lexicon
    McGillivray, Barbara
    Passarotti, Marco
    Ruffolo, Paolo
    TRAITEMENT AUTOMATIQUE DES LANGUES, 2009, 50 (02): : 103 - 127
  • [34] Partial Parsing as a Method to Expedite Dependency Annotation of a Hindi Treebank
    Gupta, Mridul
    Yadav, Vineet
    Husain, Samar
    Sharma, Dipti Misra
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 1930 - 1935
  • [35] CCGweb: a New Annotation Tool and a First Quadrilingual CCG Treebank
    Evang, Kilian
    Abzianidze, Lasha
    Bos, Johan
    13TH LINGUISTIC ANNOTATION WORKSHOP (LAW XIII), 2019, : 37 - 42
  • [36] Adjusting Indonesian Multiword Expression Annotation to the Penn Treebank Format
    Arwidarasti, Jessica Naraiswari
    Alfina, Ika
    Krisnadhi, Adila Alfa
    2020 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2020), 2020, : 75 - 80
  • [37] Diacritic Annotation in the Arabic Treebank and Its Impact on Parser Evaluation
    Maamouri, Mohamed
    Kulick, Seth
    Bies, Ann
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 2773 - 2776
  • [38] The annotation guidelines of the Latin Dependency Treebank and Index Thomisticus Treebank The treatment of some specific syntactic constructions in Latin
    Bamman, David
    Passarotti, Marco
    Busa, Roberto
    Crane, Gregory
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 71 - 76
  • [39] Syntactic Annotation in the I3rab Dependency Treebank
    Halabi, Dana
    Awajan, Arafat
    Fayyoumi, Ebaa
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2021, 18 (3A) : 381 - 392
  • [40] The Construction of Interactive Environment for Sentence Pattern Structure Based Treebank Annotation
    Guan, Shiyu
    Peng, Weiming
    Song, Jihua
    Xu, Zhiping
    CHINESE LEXICAL SEMANTICS (CLSW 2019), 2020, 11831 : 753 - 763