Transformer-Based Speech Synthesizer Attribution in an Open Set Scenario

被引：5

作者：

Bartusiak, Emily R. ^{[1
]}

Delp, Edward J. ^{[1
]}

机构：

[1] Purdue Univ, Sch Elect & Comp Engn, Video & Image Proc Lab, W Lafayette, IN 47907 USA

来源：

2022 21ST IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, ICMLA | 2022年

关键词：

machine learning; deep learning; audio forensics; media forensics; speech synthesizer attribution; open set; spectrogram; transformer; convolutional transformer; tSNE;

D O I：

10.1109/ICMLA55696.2022.00054

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech synthesis methods can create realisticsounding speech, which may be used for fraud, spoofing, and misinformation campaigns. Forensic methods that detect synthesized speech are important for protection against such attacks. Forensic attribution methods provide even more information about the nature of synthesized speech signals because they identify the specific speech synthesis method (i.e., speech synthesizer) used to create a speech signal. Due to the increasing number of realisticsounding speech synthesizers, we propose a speech attribution method that generalizes to new synthesizers not seen during training. To do so, we investigate speech synthesizer attribution in both a closed set scenario and an open set scenario. In other words, we consider some speech synthesizers to be "known" synthesizers (i.e., part of the closed set) and others to be "unknown" synthesizers (i.e., part of the open set). We represent speech signals as spectrograms and train our proposed method, known as compact attribution transformer (CAT), on the closed set for multi-class classification. Then, we extend our analysis to the open set to attribute synthesized speech signals to both known and unknown synthesizers. We utilize a t-distributed stochastic neighbor embedding (tSNE) on the latent space of the trained CAT to differentiate between each unknown synthesizer. Additionally, we explore poly-1 loss formulations to improve attribution results. Our proposed approach successfully attributes synthesized speech signals to their respective speech synthesizers in both closed and open set scenarios.

引用

页码：329 / 336

页数：8

共 50 条

[31] Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation
Sant, Gerard
Gállego, Gerard I.
Alastruey, Belen
Costa-Jussà, Marta R.
NAACL 2022 - 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Student Research Workshop, 2022, : 277 - 284
[32] Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation
Sant, Gerard
Gallego, Gerard, I
Alastruey, Belen
Costa-Jussa, Marta R.
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2022, : 277 - 284
[33] Transformer-based Long-context End-to-end Speech Recognition
Hori, Takaaki
Moritz, Niko
Hori, Chiori
Le Roux, Jonathan
INTERSPEECH 2020, 2020, : 5011 - 5015
[34] Transformer-based neural speech decoding from surface and depth electrode signals
Chen, Junbo
Chen, Xupeng
Wang, Ran
Le, Chenqian
Khalilian-Gourtani, Amirhossein
Jensen, Erika
Dugan, Patricia
Doyle, Werner
Devinsky, Orrin
Friedman, Daniel
Flinker, Adeen
Wang, Yao
JOURNAL OF NEURAL ENGINEERING, 2025, 22 (01)
[35] ScaleFormer: Transformer-based speech enhancement in the multi-scale time domain
Wu, Tianci
He, Shulin
Zhang, Hui
Zhang, XueLiang
2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 2448 - 2453
[36] On-device Streaming Transformer-based End-to-End Speech Recognition
Oh, Yoo Rhee
Park, Kiyoung
INTERSPEECH 2021, 2021, : 967 - 968
[37] Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes
de Oliveira, Danilo
Peer, Tal
Gerkmann, Timo
INTERSPEECH 2022, 2022, : 2948 - 2952
[38] An Investigation of Positional Encoding in Transformer-based End-to-end Speech Recognition
Yue, Fengpeng
Ko, Tom
2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
[39] Transformer-Based Learned Optimization
Gartner, Erik
Metz, Luke
Andriluka, Mykhaylo
Freeman, C. Daniel
Sminchisescu, Cristian
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11970 - 11979
[40] Transformer-based Image Compression
Lu, Ming
Guo, Peiyao
Shi, Huiqing
Cao, Chuntong
Ma, Zhan
DCC 2022: 2022 DATA COMPRESSION CONFERENCE (DCC), 2022, : 469 - 469

← 1 2 3 4 5 →