Speaker dependent articulatory-to-acoustic mapping using real-time MRI of the vocal tract

被引：4

作者：

Csapo, Tamas Gabor ^{[1
,2
]}

机构：

[1] Budapest Univ Technol & Econ, Dept Telecommun & Media Informat, Budapest, Hungary

[2] MTA ELTE Lendulet Lingual Articulat Res Grp, Budapest, Hungary

来源：

INTERSPEECH 2020 | 2020年

关键词：

magnetic resonance imaging; articulatory-to-acoustic mapping; vocal tract; deep neural network; SPEECH RECOGNITION; DATABASE;

D O I：

10.21437/Interspeech.2020-15

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Articulatory-to-acoustic (forward) mapping is a technique to predict speech using various articulatory acquisition techniques (e.g. ultrasound tongue imaging, lip video). Real-time MRI (rtMRI) of the vocal tract has not been used before for this purpose. The advantage of MRI is that it has a high 'relative' spatial resolution: it can capture not only lingual, labial and jaw motion, but also the velum and the pharyngeal region, which is typically not possible with other techniques. In the current paper, we train various DNNs (fully connected, convolutional and recurrent neural networks) for articulatory-to-speech conversion, using rtMRI as input, in a speaker-specific way. We use two male and two female speakers of the USC-TIMIT articulatory database, each of them uttering 460 sentences. We evaluate the results with objective (Normalized MSE and MCD) and subjective measures (perceptual test) and show that CNN-LSTM networks are preferred which take multiple images as input, and achieve MCD scores between 2.8-4.5 dB. In the experiments, we find that the predictions of speaker 'm1' are significantly weaker than other speakers. We show that this is caused by the fact that 74% of the recordings of speaker 'm1' are out of sync.

引用

页码：2722 / 2726

页数：5

共 50 条

[41] Real-Time Speaker Identification Using Speaker Model Distance
Zeinali, Hossein
Sameti, Hossein
Hadian, Hossein
2015 23RD IRANIAN CONFERENCE ON ELECTRICAL ENGINEERING (ICEE), 2015, : 643 - 647
[42] A system for real-time cardiac acoustic mapping
Leong-Kon, D
Durand, LG
Durand, J
Lee, H
PROCEEDINGS OF THE 20TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY, VOL 20, PTS 1-6: BIOMEDICAL ENGINEERING TOWARDS THE YEAR 2000 AND BEYOND, 1998, 20 : 17 - 20
[43] CONCATENATIVE ARTICULATORY VIDEO SYNTHESIS USING REAL-TIME MRI DATA FOR SPOKEN LANGUAGE TRAINING
Desai, Urvish
Yarra, Chiranjeevi
Ghosh, Prasanta Kumar
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4999 - 5003
[44] Low latency real-time vocal tract length normalization
Ljolje, A
Goffin, V
Saraclar, M
TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2004, 3206 : 371 - 378
[45] Real-Time Passive Acoustic Mapping Using Sparse Matrix Multiplication
Kamimura, Hermes A. S.
Wu, Shih-Ying
Grondin, Julien
Ji, Robin
Aurup, Christian
Zheng, Wenlan
Heidmann, Marc
Pouliopoulos, Antonios N.
Konofagou, Elisa E.
IEEE TRANSACTIONS ON ULTRASONICS FERROELECTRICS AND FREQUENCY CONTROL, 2021, 68 (01) : 164 - 177
[46] Whistling shares a common tongue with speech: bioacoustics from real-time MRI of the human vocal tract
Belyk, Michel
Schultz, Benjamin G.
Correia, Joao
Beal, Deryk S.
Kotz, Sonja A.
PROCEEDINGS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 2019, 286 (1911)
[47] Vocal tract and register changes analysed by real-time MRI in male professional singers - a pilot study
Echternach, Matthias
Sundberg, Johan
Arndt, Susan
Breyer, Tobias
Markl, Michael
Schumacher, Martin
Richter, Bernhard
LOGOPEDICS PHONIATRICS VOCOLOGY, 2008, 33 (02) : 67 - 73
[48] Electroanatomic substrate mapping of the left ventricle using real-time MRI
Dukkipati, S
Schmidt, E
Holmvang, G
Gudhe, R
Darrow, RD
Slavin, G
Fung, M
Mallozi, R
Dumoulin, CL
Malchano, ZJ
Kampa, G
Dando, JD
Christina, M
Foo, TK
Ruskin, JN
Reddy, VY
CIRCULATION, 2005, 112 (17) : U707 - U707
[49] Gestural Control in the English Past-Tense Suffix: An Articulatory Study Using Real-Time MRI
Lammert, Adam
Goldstein, Louis
Ramanarayanan, Vikram
Narayanan, Shrikanth
PHONETICA, 2014, 71 (04) : 229 - 248
[50] Implementation of a Real-Time Text Dependent Speaker Identification System
Andrei, Valentin
Paleologu, Constantin
Burileanu, Corneliu
2011 6TH CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN-COMPUTER DIALOGUE (SPED), 2011,

← 1 2 3 4 5 →