STFF-SM: Steganalysis Model Based on Spatial and Temporal Feature Fusion for Speech Streams

被引:4
|
作者
Tian, Hui [1 ,2 ]
Qiu, Yiqin [1 ,2 ]
Mazurczyk, Wojciech [3 ]
Li, Haizhou [4 ,5 ]
Qian, Zhenxing [6 ]
机构
[1] Natl Huaqiao Univ, Coll Comp Sci & Technol, Xiamen 361021, Peoples R China
[2] Xiamen Key Lab Data Secur & Blockchain Technol, Xiamen 361021, Peoples R China
[3] Warsaw Univ Technol, Fac Elect & Informat Technol, Inst Comp Sci, PL-00665 Warsaw, Poland
[4] Chinese Univ Hong Kong, Sch Data Sci, Shenzhen 518172, Peoples R China
[5] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 119077, Singapore
[6] Fudan Univ, Sch Comp Sci, Shanghai 200433, Peoples R China
基金
中国国家自然科学基金;
关键词
Delays; Feature extraction; Steganography; Quantization (signal); Distortion; Speech coding; Resistance; Steganalysis; steganography; voice over Internet protocol; speech streams; deep neural networks; pitch delays; STEGANOGRAPHY; SCHEME; VOICE;
D O I
10.1109/TASLP.2022.3224295
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The real-time detection of speech steganography in Voice-over-Internet-Protocol (VoIP) scenarios remains an open problem, as it requires steganalysis methods to perform for low-intensity embeddings and short-sample inputs, as well as provide rapid detection results. To address these challenges, this paper presents a novel steganalysis model based on spatial and temporal feature fusion (STFF-SM). Differing from the existing methods, we take both the integer and fractional pitch delays as input, and design subframe-stitch module to organically integrate subframe-wise integer delays and frame-wise fractional pitch delays. Further, we design a spatial fusion module based on pre-activation residual convolution to extract the pitch spatial features and gradually increase their dimensions to discover finer steganographic distortions to enhance the detection effect, where a Group-Squeeze-Weighting block is introduced to alleviate the information loss in the process of increasing the feature dimension. In addition, we design a temporal fusion module to extract pitch temporal features using the stacked LSTM, where a Gated Feed-Forward Network is introduced to learn the interaction between different feature maps while suppressing the features that are not useful for detection. We evaluated the performance of STFF-SM through comprehensive experiments and comparisons with the state-of-the-art solutions. The experimental results demonstrate that STFF-SM can well meet the needs of real-time detection of speech steganography in VoIP streams, and outperforms the existing methods in detection performance, especially with low embedding strengths and short window sizes.
引用
收藏
页码:277 / 289
页数:13
相关论文
共 50 条
  • [41] Steganalysis of Adaptive Multi-Rate Speech Streams Based on the Correlation of Fractional Pitch Delay
    Tian H.
    Wu J.-Y.
    Yan Y.
    Wang H.-D.
    Quan H.-Y.
    Jisuanji Xuebao/Chinese Journal of Computers, 2022, 45 (06): : 1308 - 1325
  • [42] Research on Speech Emotional Feature Extraction Based on Multidimensional Feature Fusion
    Zheng, Chunjun
    Wang, Chunli
    Sun, Wei
    Jia, Ning
    ADVANCED DATA MINING AND APPLICATIONS, ADMA 2019, 2019, 11888 : 535 - 547
  • [43] Human action recognition based on multi-mode spatial-temporal feature fusion
    Wang, Dongli
    Yang, Jun
    Zhou, Yan
    2019 22ND INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION 2019), 2019,
  • [44] Spatial-Temporal Feature Fusion Neural Network for EEG-Based Emotion Recognition
    Wang, Zhe
    Wang, Yongxiong
    Zhang, Jiapeng
    Hu, Chuanfei
    Yin, Zhong
    Song, Yu
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2022, 71
  • [45] Feature-level fusion based on spatial-temporal of pervasive EEG for depression recognition
    Zhang, Bingtao
    Wei, Dan
    Yan, Guanghui
    Lei, Tao
    Cai, Haishu
    Yang, Zhifei
    COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2022, 226
  • [46] Weakly supervised video anomaly detection based on spatial-temporal feature fusion enhancement
    Liang, Weijie
    Zhang, Jianming
    Zhan, Yongzhao
    SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (02) : 1111 - 1118
  • [47] Feature-level fusion based on spatial-temporal of pervasive EEG for depression recognition
    Zhang, Bingtao
    Wei, Dan
    Yan, Guanghui
    Lei, Tao
    Cai, Haishu
    Yang, Zhifei
    Computer Methods and Programs in Biomedicine, 2022, 226
  • [48] Spatial-Temporal Feature Fusion Neural Network for EEG-Based Emotion Recognition
    Wang, Zhe
    Wang, Yongxiong
    Zhang, Jiapeng
    Hu, Chuanfei
    Yin, Zhong
    Song, Yu
    IEEE Transactions on Instrumentation and Measurement, 2022, 71
  • [49] Metric Learning Based Feature Representation with Gated Fusion Model for Speech Emotion Recognition
    Gao, Yuan
    Liu, JiaXing
    Wang, Longbiao
    Dang, Jianwu
    INTERSPEECH 2021, 2021, : 4503 - 4507
  • [50] An Improved Image Fusion Approach Based on Enhanced Spatial and Temporal the Adaptive Reflectance Fusion Model
    Fu, Dongjie
    Chen, Baozhang
    Wang, Juan
    Zhu, Xiaolin
    Hilker, Thomas
    REMOTE SENSING, 2013, 5 (12) : 6346 - 6360