English-Welsh Cross-Lingual Embeddings

被引:4
|
作者
Espinosa-Anke, Luis [1 ]
Palmer, Geraint [2 ]
Corcoran, Padraig [1 ]
Filimonov, Maxim [1 ]
Spasic, Irena [1 ]
Knight, Dawn [3 ]
机构
[1] Cardiff Univ, Sch Comp Sci & Informat, Cardiff CF24 3AA, Wales
[2] Cardiff Univ, Sch Math, Cardiff CF24 4AG, Wales
[3] Cardiff Univ, Sch English Commun & Philosophy, Cardiff CF10 3EU, Wales
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 14期
关键词
natural language processing; distributional semantics; machine learning; language model; word embeddings; machine translation; sentiment analysis;
D O I
10.3390/app11146541
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Cross-lingual embeddings are vector space representations where word translations tend to be co-located. These representations enable learning transfer across languages, thus bridging the gap between data-rich languages such as English and others. In this paper, we present and evaluate a suite of cross-lingual embeddings for the English-Welsh language pair. To train the bilingual embeddings, a Welsh corpus of approximately 145 M words was combined with an English Wikipedia corpus. We used a bilingual dictionary to frame the problem of learning bilingual mappings as a supervised machine learning task, where a word vector space is first learned independently on a monolingual corpus, after which a linear alignment strategy is applied to map the monolingual embeddings to a common bilingual vector space. Two approaches were used to learn monolingual embeddings, including word2vec and fastText. Three cross-language alignment strategies were explored, including cosine similarity, inverted softmax and cross-domain similarity local scaling (CSLS). We evaluated different combinations of these approaches using two tasks, bilingual dictionary induction, and cross-lingual sentiment analysis. The best results were achieved using monolingual fastText embeddings and the CSLS metric. We also demonstrated that by including a few automatically translated training documents, the performance of a cross-lingual text classifier for Welsh can increase by approximately 20 percent points.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] WELSH-ENGLISH ENGLISH-WELSH DICTIONARY - EVANS,HM
    THOMAS, AR
    MODERN LANGUAGE JOURNAL, 1994, 78 (01): : 139 - 140
  • [22] Manipuri-English Cross-lingual Word Embeddings using a Temporally Aligned Comparable Corpus
    Laitonjam, Lenin
    Singh, Sanasam Ranbir
    2021 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2021, : 195 - 199
  • [23] A Cross-Lingual Dictionary for English Wikipedia Concepts
    Spitkovsky, Valentin I.
    Chang, Angel X.
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 3168 - 3175
  • [24] Multilingual Knowledge Graph Embeddings for Cross-lingual Knowledge Alignment
    Chen, Muhao
    Tian, Yingtao
    Yang, Mohan
    Zaniolo, Carlo
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 1511 - 1517
  • [25] Multi-Adversarial Learning for Cross-Lingual Word Embeddings
    Wang, Haozhou
    Henderson, James
    Merlo, Paola
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 463 - 472
  • [26] Learning Tibetan-Chinese cross-lingual word embeddings
    Ma, Wei
    Yu, Hongzhi
    Zhao, Kun
    Zhao, Deshun
    2019 15TH INTERNATIONAL CONFERENCE ON SEMANTICS, KNOWLEDGE AND GRIDS (SKG 2019), 2019, : 49 - 53
  • [27] A Variational Autoencoding Approach for Inducing Cross-lingual Word Embeddings
    Wei, Liangchen
    Deng, Zhi-Hong
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 4165 - 4171
  • [28] Cross-Lingual Word Representations via Spectral Graph Embeddings
    Oshikiri, Takamasa
    Fukui, Kazuki
    Shimodaira, Hidetoshi
    PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2016), VOL 2, 2016, : 493 - 498
  • [29] Cross-Lingual Taxonomy Alignment with Bilingual Knowledge Graph Embeddings
    Wu, Tianxing
    Zhang, Du
    Zhang, Lei
    Qi, Guilin
    SEMANTIC TECHNOLOGY, JIST 2017, 2017, 10675 : 251 - 258
  • [30] A Study of Efficacy of Cross-lingual Word Embeddings for Indian Languages
    Khatri, Jyotsana
    Murthy, Rudra
    Bhattacharyya, Pushpak
    PROCEEDINGS OF THE 7TH ACM IKDD CODS AND 25TH COMAD (CODS-COMAD 2020), 2020, : 347 - 348