Massively Multilingual Pronunciation Mining with WikiPron

被引:0
|
作者
Lee, Jackson L.
Ashby, Lucas F. E. [1 ]
Garza, M. Elizabeth [1 ]
Lee-Sikka, Yeonju [1 ]
Miller, Sean [1 ]
Wong, Alan [1 ]
McCarthy, Arya D. [2 ]
Gorman, Kyle [1 ]
机构
[1] CUNY, Grad Ctr, New York, NY 10021 USA
[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
来源
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020) | 2020年
关键词
speech; pronunciation; grapheme-to-phoneme; g2p; MODELS;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
We introduce WikiPron, an open-source command-line tool for extracting pronunciation data from Wiktionary, a collaborative multilingual online dictionary. We first describe the design and use of WikiPron. We then discuss the challenges faced scaling this tool to create an automatically-generated database of 1.7 million pronunciations from 165 languages. Finally, we validate the pronunciation database by using it to train and evaluating a collection of generic grapheme-to-phoneme models. The software, pronunciation data, and models are all made available under permissive open-source licenses.
引用
收藏
页码:4223 / 4228
页数:6
相关论文
共 50 条
  • [21] On Pronunciation in a Multilingual Dictionary: The Case of Lukumi, Olukumi and Yoruba Dictionary
    Uguru, Joy O.
    Okeke, Chukwuma O.
    LEXIKOS, 2020, 30 : 519 - 539
  • [22] A Web-Based Tool for Developing Multilingual Pronunciation Lexicons
    Ainsley, Samantha
    Ha, Linne
    Jansehe, Martin
    Kim, Ara
    Nanzawa, Masayuki
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 3338 - +
  • [23] Learning Translations via Images with a Massively Multilingual Image Dataset
    Hewitt, John
    Ippolito, Daphne
    Callahan, Brendan
    Kriz, Reno
    Wijaya, Derry
    Callison-Burch, Chris
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 2566 - 2576
  • [24] CVSS Corpus and Massively Multilingual Speech-to-Speech Translation
    Jia, Ye
    Ramanovich, Michelle Tadmor
    Wang, Quan
    Zen, Heiga
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6691 - 6703
  • [25] EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning
    Mao, Zhuoyuan
    Chu, Chenhui
    Kurohashi, Sadao
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2841 - 2856
  • [26] Multilingual context-based pronunciation learning for Text-to-Speech
    Comini, Giulia
    Ribeiro, Manuel Sam
    Yang, Fan
    Shim, Heereen
    Lorenzo-Trueba, Jaime
    INTERSPEECH 2023, 2023, : 631 - 635
  • [27] Text mining applied to multilingual corpora
    Neri, F
    Raffaelli, R
    Knowledge Mining, 2005, 185 : 123 - 131
  • [28] COMFO: Multilingual Corpus for Opinion Mining
    Faty, Lamine
    Drame, Khadim
    Sarr, Edouard Ngor
    Ndiaye, Marie
    Diop, Ibrahima
    Dia, Yoro
    Sall, Ousmane
    ARTIFICIAL GENERAL INTELLIGENCE, AGI 2022, 2023, 13539 : 14 - 19
  • [29] Multilingual Argument Mining: Datasets and Analysis
    Toledo-Ronen, Orith
    Orbach, Matan
    Bilu, Yonatan
    Spector, Artem
    Slonim, Noam
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020,
  • [30] Mining the Multilingual Terminology from the Web
    Sadat, Fatiha
    2013 IEEE PACIFIC RIM CONFERENCE ON COMMUNICATIONS, COMPUTERS AND SIGNAL PROCESSING (PACRIM), 2013, : 41 - 45