English-Welsh Cross-Lingual Embeddings

被引:4
|
作者
Espinosa-Anke, Luis [1 ]
Palmer, Geraint [2 ]
Corcoran, Padraig [1 ]
Filimonov, Maxim [1 ]
Spasic, Irena [1 ]
Knight, Dawn [3 ]
机构
[1] Cardiff Univ, Sch Comp Sci & Informat, Cardiff CF24 3AA, Wales
[2] Cardiff Univ, Sch Math, Cardiff CF24 4AG, Wales
[3] Cardiff Univ, Sch English Commun & Philosophy, Cardiff CF10 3EU, Wales
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 14期
关键词
natural language processing; distributional semantics; machine learning; language model; word embeddings; machine translation; sentiment analysis;
D O I
10.3390/app11146541
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Cross-lingual embeddings are vector space representations where word translations tend to be co-located. These representations enable learning transfer across languages, thus bridging the gap between data-rich languages such as English and others. In this paper, we present and evaluate a suite of cross-lingual embeddings for the English-Welsh language pair. To train the bilingual embeddings, a Welsh corpus of approximately 145 M words was combined with an English Wikipedia corpus. We used a bilingual dictionary to frame the problem of learning bilingual mappings as a supervised machine learning task, where a word vector space is first learned independently on a monolingual corpus, after which a linear alignment strategy is applied to map the monolingual embeddings to a common bilingual vector space. Two approaches were used to learn monolingual embeddings, including word2vec and fastText. Three cross-language alignment strategies were explored, including cosine similarity, inverted softmax and cross-domain similarity local scaling (CSLS). We evaluated different combinations of these approaches using two tasks, bilingual dictionary induction, and cross-lingual sentiment analysis. The best results were achieved using monolingual fastText embeddings and the CSLS metric. We also demonstrated that by including a few automatically translated training documents, the performance of a cross-lingual text classifier for Welsh can increase by approximately 20 percent points.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Cross-Lingual Word Embeddings
    Søgaard A.
    Vulić I.
    Ruder S.
    Faruqui M.
    Synthesis Lectures on Human Language Technologies, 2019, 12 (02): : 1 - 132
  • [2] Cross-Lingual Word Embeddings
    Corro, Caio Filippo
    TRAITEMENT AUTOMATIQUE DES LANGUES, 2019, 60 (01): : 46 - 48
  • [3] Cross-Lingual Word Embeddings
    Agirre, Eneko
    COMPUTATIONAL LINGUISTICS, 2020, 46 (01) : 245 - 248
  • [4] The Welsh Academy English-Welsh dictionary
    Sorlin, E
    ZEITSCHRIFT FUR CELTISCHE PHILOLOGIE, 1999, 51 : 332 - 336
  • [5] Cross-lingual alignments of ELMo contextual embeddings
    Matej Ulčar
    Marko Robnik-Šikonja
    Neural Computing and Applications, 2022, 34 : 13043 - 13061
  • [6] Refinement of Unsupervised Cross-Lingual Word Embeddings
    Biesialska, Magdalena
    Costa-jussa, Marta R.
    ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, 325 : 1978 - 1981
  • [7] Interactive Refinement of Cross-Lingual Word Embeddings
    Yuan, Michelle
    Zhang, Mozhi
    Van Durme, Benjamin
    Findlater, Leah
    Boyd-Graber, Jordan
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 5984 - 5996
  • [8] CLUSE: Cross-Lingual Unsupervised Sense Embeddings
    Chi, Ta-Chung
    Chen, Yun-Nung
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 271 - 281
  • [9] Cross-lingual alignments of ELMo contextual embeddings
    Ulcar, Matej
    Robnik-Sikonja, Marko
    NEURAL COMPUTING & APPLICATIONS, 2022, 34 (15): : 13043 - 13061
  • [10] Cross-lingual embeddings with auxiliary topic models
    Zhou, Dong
    Peng, Xiaoya
    Li, Lin
    Han, Jun-mei
    EXPERT SYSTEMS WITH APPLICATIONS, 2022, 190