A web-based Bengali news corpus for named entity recognition

被引:28
|
作者
Ekbal, Asif [1 ]
Bandyopadhyay, Sivaji [1 ]
机构
[1] Jadavpur Univ, Dept Comp Sci & Engn, Kolkata 700032, India
关键词
web as corpus; news corpus; web-based tagged Bengali news corpus; named entity; named entity recognition;
D O I
10.1007/s10579-008-9064-x
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The rapid development of language resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. A tagged Bengali news corpus has been developed from the web archive of a widely read Bengali newspaper. A web crawler retrieves the web pages in Hyper Text Markup Language (HTML) format from the news archive. At present, the corpus contains approximately 34 million wordforms. Named Entity Recognition (NER) systems based on pattern based shallow parsing with or without using linguistic knowledge have been developed using a part of this corpus. The NER system that uses linguistic knowledge has performed better yielding highest F-Score values of 75.40%, 72.30%, 71.37%, and 70.13% for person, location, organization, and miscellaneous names, respectively.
引用
收藏
页码:173 / 182
页数:10
相关论文
共 50 条
  • [41] Adaptive, multilingual named entity recognition in Web pages
    Petasis, G
    Karkaletsis, V
    Grover, C
    Hachey, B
    Pazienza, MT
    Vindigni, M
    Coch, J
    ECAI 2004: 16TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2004, 110 : 1073 - 1074
  • [42] ESpotter: Adaptive named entity recognition for web browsing
    Zhu, JH
    Uren, V
    Motta, E
    PROFESSIONAL KNOWLEDGE MANAGEMENT, 2005, 3782 : 518 - 529
  • [43] An Adaptive Approach for Web Scale Named Entity Recognition
    Zhu, Jianhan
    2009 1ST IEEE SYMPOSIUM ON WEB SOCIETY, PROCEEDINGS, 2009, : 41 - 46
  • [44] Effective Named Entity Recognition for Idiosyncratic Web Collections
    Prokofyev, Roman
    Demartini, Gianluca
    Cudre-Mauroux, Philippe
    WWW'14: PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 2014, : 397 - 407
  • [45] Named Entity Recognition to Detect Criminal Texts on the Web
    Skorzewski, Pawel
    Pieniowski, Mikolaj
    Demenko, Grazyna
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6223 - 6231
  • [46] Named Entity Recognition Approach for Malay Crime News Retrieval
    Saad, Saidah
    Mansor, Mohamed Kamil
    GEMA ONLINE JOURNAL OF LANGUAGE STUDIES, 2018, 18 (04): : 216 - 235
  • [47] Chinese Named Entity Recognition in the Ship News Field Based on Adversarial Transfer Learning
    Zhu, Zhihong
    Zhang, Weiwen
    Zhang, Hongbin
    Cheng, Lianglun
    2024 16TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, ICMLC 2024, 2024, : 562 - 567
  • [48] News text named entity Recognition based on BI-LSTM-CRF model
    Meng, LingMing
    Qi, WeiMin
    Zhou, YongKang
    Chen, Ying
    2022 41ST CHINESE CONTROL CONFERENCE (CCC), 2022, : 7217 - 7222
  • [49] Person Browser System Based on Named Entity Recognition for Broadcast News Interview Videos
    Sanghee Lee
    Kanghyun Jo
    International Journal of Control, Automation and Systems, 2021, 19 : 186 - 199
  • [50] Person Browser System Based on Named Entity Recognition for Broadcast News Interview Videos
    Lee, Sanghee
    Jo, Kanghyun
    INTERNATIONAL JOURNAL OF CONTROL AUTOMATION AND SYSTEMS, 2021, 19 (01) : 186 - 199