Massively Multilingual Pronunciation Mining with WikiPron

被引:0
|
作者
Lee, Jackson L.
Ashby, Lucas F. E. [1 ]
Garza, M. Elizabeth [1 ]
Lee-Sikka, Yeonju [1 ]
Miller, Sean [1 ]
Wong, Alan [1 ]
McCarthy, Arya D. [2 ]
Gorman, Kyle [1 ]
机构
[1] CUNY, Grad Ctr, New York, NY 10021 USA
[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
关键词
speech; pronunciation; grapheme-to-phoneme; g2p; MODELS;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
We introduce WikiPron, an open-source command-line tool for extracting pronunciation data from Wiktionary, a collaborative multilingual online dictionary. We first describe the design and use of WikiPron. We then discuss the challenges faced scaling this tool to create an automatically-generated database of 1.7 million pronunciations from 165 languages. Finally, we validate the pronunciation database by using it to train and evaluating a collection of generic grapheme-to-phoneme models. The software, pronunciation data, and models are all made available under permissive open-source licenses.
引用
收藏
页码:4223 / 4228
页数:6
相关论文
共 50 条
  • [31] Multilingual sentence categorization and novelty mining
    Zhang, Yi
    Tsai, Flora S.
    Kwee, Agus Trisnajaya
    INFORMATION PROCESSING & MANAGEMENT, 2011, 47 (05) : 667 - 675
  • [32] Personalized multilingual Web content mining
    Chau, R
    Yeh, CH
    Smith, KA
    KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 1, PROCEEDINGS, 2004, 3213 : 155 - 163
  • [33] Multilingual Corpus Development for Opinion Mining
    Schulz, Julia Maria
    Womser-Hacker, Christa
    Mandl, Thomas
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 3409 - 3412
  • [34] A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space
    Jones, Alex
    Wang, William Yang
    Mahowald, Kyle
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 5833 - 5847
  • [36] On the Construction of Multilingual Corpora for Clinical Text Mining
    Villena, Fabian
    Eisenmann, Urs
    Knaup, Petra
    Dunstan, Jocelyn
    Ganzinger, Matthias
    DIGITAL PERSONALIZED HEALTH AND MEDICINE, 2020, 270 : 347 - 351
  • [37] Processing multilingual collections for text mining applications
    Gaussier, E
    TEXT MINING AND ITS APPLICATIONS, 2004, 138 : 119 - 130
  • [38] Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations
    Geigle, Gregor
    Timofte, Radu
    Glavas, Goran
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 5064 - 5084
  • [39] XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
    Casanova, Edresson
    Davis, Kelly
    Goelge, Eren
    Goekncar, Gorkem
    Gulea, Iulian
    Hart, Logan
    Aljafari, Aya
    Meyer, Joshua
    Morais, Reuben
    Olayemi, Samuel
    Weber, Julian
    INTERSPEECH 2024, 2024, : 4978 - 4982
  • [40] An Analysis of Massively Multilingual Neural Machine Translation for Low-Resource Languages
    Mueller, Aaron
    Nicolai, Garrett
    McCarthy, Arya D.
    Lewis, Dylan
    Wu, Winston
    Yarowsky, David
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3710 - 3718