Multilingual news document clustering:: Two algorithms based on cognate named entities

被引:0
|
作者
Montalvo, Soto [1 ]
Martinez, Raquel [1 ]
Casillas, Arantza [1 ]
Fresno, Victor [1 ]
机构
[1] Univ Basque Country, EHU, Dept Electricidad & Electron, E-48080 Bilbao, Spain
来源
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents an approach for Multilingual News Document Clustering in comparable corpora. We have implemented two algorithms of heuristic nature that follow the approach. They use as unique evidence for clustering the identification of cognate named entities between both sides of the comparable corpora. In addition, no information about the right number of clusters has to be provided to the algorithms. The applicability of the approach only depends on the possibility of identifying cognate named entities between the languages involved in the corpus. The main difference between the two algorithms consists of whether a monolingual clustering phase is applied at first or not. We have tested both algorithms with a comparable corpus of news written in English and Spanish. The performance of both algorithms is slightly different; the one that does not apply the monolingual phase reaches better results. In any case, the obtained results with both algorithms are encouraging and show that the use of cognate named entities can be enough knowledge for deal with multilingual clustering of news documents.
引用
收藏
页码:165 / 172
页数:8
相关论文
共 41 条
  • [1] Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities
    Montalvo, Soto
    Martinez, Raquel
    Casillas, Arantza
    Fresno, Victor
    COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, 2006, : 1145 - 1152
  • [2] Multilingual news clustering:: Feature translation vs. identification of cognate named entities
    Montalvo, S.
    Martinez, R.
    Casillas, A.
    Fresno, V.
    PATTERN RECOGNITION LETTERS, 2007, 28 (16) : 2305 - 2311
  • [3] Exploiting Named Entities for Bilingual News Clustering
    Montalvo, Soto
    Martinez, Raquel
    Fresno, Victor
    Delgado, Agustin
    JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2015, 66 (02) : 363 - 376
  • [4] NESM: a Named Entity based Proximity Measure for Multilingual News Clustering
    Montalvo, Soto
    Fresno, Victor
    Martinez, Raquel
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2012, (48): : 81 - 88
  • [5] Bilingual news clustering using named entities and fuzzy similarity
    Montalvo, Soto
    Martinez, Raquel
    Casillas, Arantza
    Fresno, Victor
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2007, 4629 : 107 - 114
  • [6] Grouping business news stories based on salience of named entities
    Escoter, Llorenc
    Pivovarova, Lidia
    Du, Mian
    Katiskaya, Anisia
    Yangarber, Roman
    15TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2017), VOL 1: LONG PAPERS, 2017, : 1096 - 1106
  • [7] Fuzzy Named Entity-Based Document Clustering
    Cao, Tru H.
    Do, Hai T.
    Hong, Dung T.
    Quan, Tho T.
    2008 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, VOLS 1-5, 2008, : 2030 - 2036
  • [8] A Language-Independent Approach to Identify the Named Entities in Under-Resourced Languages and Clustering Multilingual Documents
    Kumar, N. Kiran
    Santosh, G. S. K.
    Varma, Vasudeva
    MULTILINGUAL AND MULTIMODAL INFORMATION ACCESS EVALUATION, 2011, 6941 : 74 - 82
  • [9] A Latent Semantic Indexing-based approach to multilingual document clustering
    Wei, Chih-Ping
    Yang, Christopher C.
    Lin, Chia-Min
    DECISION SUPPORT SYSTEMS, 2008, 45 (03) : 606 - 620
  • [10] Knowledge Discovery with CRF-Based Clustering of Named Entities without a Priori Classes
    Claveau, Vincent
    Ncibi, Abir
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, CICLING 2014, PT I, 2014, 8403 : 415 - 428