STFT-TCNN
Use case : speech enhancement
Model description
The TCNN is a time-domain speech enhancement temporal convolutional model proposed in 2019 by Pandey and Wang in the paper TCNN: TEMPORAL CONVOLUTIONAL NEURAL NETWORK FOR REAL-TIME SPEECH ENHANCEMENT IN THE TIME DOMAIN.
Unfortunately, time-domain models do not perform well when quantized to 8-bit integer precision. Therefore, we made several modifications to the model, in order to make it work in the frequency domain. Mainly, we removed the convolutional encoder and decoder described in the paper, keeping only the main TCN part of the model, and instead substituted the encoder/decoder pair with STFT pre-processing, and inverse STFT post-processing.
This means that the model takes as input magnitude spectrogram frames, and outputs a mask of the same dimension. Inference is then performed by applying this output mask to the complex spectrogram corresponding to the input, and performing inverse STFT on the masked complex spectrogram to retrieve the corresponding time domain denoised signal.
Network information
STFT-TCNN
Network Information | Value |
---|---|
Framework | ONNX QDQ |
Parameters | 1.594 M |
Quantization | int8 |
Network inputs / outputs
The model expects input of shape (batch, n_fft // 2 + 1, sequence_length), where batch and sequence_length are axes with dynamic length. The models provided here were pretrained with n_fft = 512, and so the input shape becomes (batch, 257, sequence_length)
We provide float and int8 quantized ONNX versions of the model, as well as a Pytorch state dict.
We also provide the original .yaml config file used to train the model. For details on which exact parameters were used to train it, please refer to the attached config file.
IMPORTANT : Since STEdgeAI does not allow deployment of models with a dynamic input shape on ST boards, we also provide a version of the model with static input shape, usually (1, 257, 40). If using the default configuration that includes 5 lookahead and lookback frames, this means the model will perform once inference every 30 frames.
For evaluation, use the model with dynamic input shape, and for deployment, use the model with static input shape !
Recommended platforms
Platform | Supported | Recommended |
---|---|---|
STM32N6 | [x] | [x] |
Performances
Metrics
Measures are done with default STEDGEAI configuration with enabled input / output allocated option.
Reference NPU memory footprint
Model | Dataset | Format | Resolution | Series | Internal RAM | External RAM | Weights Flash | STM32Cube.AI version | STEdgeAI Core version |
---|---|---|---|---|---|---|---|---|---|
STFT-TCNN Medium | valentini | Int8 | 257x40 | STM32N6 | 100.09 | 0.0 | 1626.86 | 10.0.0 | 2.0.0 |
Reference NPU inference time
IMPORTANT NOTE : In the current version of ST Edge AI, some convolutional layers of this model are still mapped to software which slows down the inference time, though we still end up with plenty of room to spare to run the model in real time. This issue will be fixed in the next STEdgeAI release.
The figures listed in this table correspond to the version of ST Edge AI with this issue (i.e. the slowed down version), so you will not experience unexpected inference times.
You can expect significant improvements once this issue is resolved.
Model | Dataset | Format | Resolution | Board | Execution Engine | Inference time (ms) | Inf / sec | STM32Cube.AI version | STEdgeAI Core version |
---|---|---|---|---|---|---|---|---|---|
STFT-TCNN medium | valentini | Int8 | 257x40 | STM32N6570-DK | NPU/MCU | 53.76 | 18.6 | 10.0.0 | 2.0.0 |
Metrics on the Valentini dataset
We report five metrics :
- PESQ (Perceptual Evaluation of Speech Quality)
- STOI (Short-Time Objective Intelligibility)
- MSE between the clean and denoised waveforms
- SNR (Signal-to-Noise Ratio) between the clean and denoised waveforms
- Scale-invariant SNR
Model | Format | Resolution | PESQ | STOI | SNR | SI-SNR | Waveform MSE |
---|---|---|---|---|---|---|---|
STFT-TCNN Medium | float32 | 257x? | 2.480 | 0.931 | 18.190 | 18.104 | 1.136e-4 |
STFT-TCNN Medium | int8 | 257x? | 2.372 | 0.932 | 18.190 | 18.100 | 1.109e-4 |
Limitations
The models provided here typically have trouble denoising speech at SNRs beyond what they were trained on. In the Valentini dataset, the lowest SNR in the training set is 0 dB. Therefore, the model tends to struggle to denoise speech at negative SNRs.
Retraining and Integration in a simple example:
Please refer to the stm32ai-modelzoo-services GitHub here