Transformer-Based Speech Synthesizer Attribution in an Open Set Scenario

被引：5

作者：

Bartusiak, Emily R. ^{[1
]}

Delp, Edward J. ^{[1
]}

机构：

[1] Purdue Univ, Sch Elect & Comp Engn, Video & Image Proc Lab, W Lafayette, IN 47907 USA

来源：

2022 21ST IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, ICMLA | 2022年

关键词：

machine learning; deep learning; audio forensics; media forensics; speech synthesizer attribution; open set; spectrogram; transformer; convolutional transformer; tSNE;

D O I：

10.1109/ICMLA55696.2022.00054

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech synthesis methods can create realisticsounding speech, which may be used for fraud, spoofing, and misinformation campaigns. Forensic methods that detect synthesized speech are important for protection against such attacks. Forensic attribution methods provide even more information about the nature of synthesized speech signals because they identify the specific speech synthesis method (i.e., speech synthesizer) used to create a speech signal. Due to the increasing number of realisticsounding speech synthesizers, we propose a speech attribution method that generalizes to new synthesizers not seen during training. To do so, we investigate speech synthesizer attribution in both a closed set scenario and an open set scenario. In other words, we consider some speech synthesizers to be "known" synthesizers (i.e., part of the closed set) and others to be "unknown" synthesizers (i.e., part of the open set). We represent speech signals as spectrograms and train our proposed method, known as compact attribution transformer (CAT), on the closed set for multi-class classification. Then, we extend our analysis to the open set to attribute synthesized speech signals to both known and unknown synthesizers. We utilize a t-distributed stochastic neighbor embedding (tSNE) on the latent space of the trained CAT to differentiate between each unknown synthesizer. Additionally, we explore poly-1 loss formulations to improve attribution results. Our proposed approach successfully attributes synthesized speech signals to their respective speech synthesizers in both closed and open set scenarios.

引用

页码：329 / 336

页数：8

共 50 条

[1] TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION WITH LOCAL DENSE SYNTHESIZER ATTENTION
Xu, Menglong
Li, Shengqiang
Zhang, Xiao-Lei
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5899 - 5903
[2] A transformer-based deep learning approach for recognition of forgery methods in spoofing speech attribution
Zhang, Qiang
Zhang, Xiongwei
Sun, Meng
Yang, Jibin
APPLIED SOFT COMPUTING, 2025, 171
[3] A transformer-based network for speech recognition
Tang L.
International Journal of Speech Technology, 2023, 26 (02) : 531 - 539
[4] Musical Speech: A Transformer-based Composition Tool
d'Eon, Jason
Dumpala, Harsha
Sastry, Chandramouli Shama
Oore, Dani
Oore, Sageev
NEURIPS 2020 COMPETITION AND DEMONSTRATION TRACK, VOL 133, 2020, 133 : 253 - 274
[5] Transformer-Based Turkish Automatic Speech Recognition
Tasar, Davut Emre
Koruyan, Kutan
Cilgin, Cihan
ACTA INFOLOGICA, 2024, 8 (01): : 1 - 10
[6] TRANSFORMER-BASED DIRECT SPEECH-TO-SPEECH TRANSLATION WITH TRANSCODER
Kano, Takatomo
Sakti, Sakriani
Nakamura, Satoshi
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 958 - 965
[7] A Transformer-Based Approach to Authorship Attribution in Classical Arabic Texts
AlZahrani, Fetoun Mansour
Al-Yahya, Maha
APPLIED SCIENCES-BASEL, 2023, 13 (12):
[8] RM-Transformer: A Transformer-based Model for Mandarin Speech Recognition
Lu, Xingyu
Hu, Jianguo
Li, Shenhao
Ding, Yanyu
2022 IEEE 2ND INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND ARTIFICIAL INTELLIGENCE (CCAI 2022), 2022, : 194 - 198
[9] TRANSFORMER-BASED ACOUSTIC MODELING FOR HYBRID SPEECH RECOGNITION
Wang, Yongqiang
Mohamed, Abdelrahman
Le, Duc
Liu, Chunxi
Xiao, Alex
Mahadeokar, Jay
Huang, Hongzhao
Tjandra, Andros
Zhang, Xiaohui
Zhang, Frank
Fuegen, Christian
Zweig, Geoffrey
Seltzer, Michael L.
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6874 - 6878
[10] Transformer-based Acoustic Modeling for Streaming Speech Synthesis
Wu, Chunyang
Xiu, Zhiping
Shi, Yangyang
Kalinli, Ozlem
Fuegen, Christian
Koehler, Thilo
He, Qing
INTERSPEECH 2021, 2021, : 146 - 150

← 1 2 3 4 5 →