Update README.md
Browse files
README.md
CHANGED
@@ -54,6 +54,38 @@ The model is available for use in the [NVIDIA NeMo](https://github.com/NVIDIA/Ne
|
|
54 |
### Inference
|
55 |
For inference, you can follow our [Audio Codec Inference Tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Inference.ipynb) which automatically downloads the model checkpoint. Note that you will need to set the ```model_name``` parameter to "nvidia/low-frame-rate-speech-codec-22khz".
|
56 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
57 |
Alternatively, you can manually download the [checkpoint](https://huggingface.co/nvidia/low-frame-rate-speech-codec-22khz/resolve/main/low-frame-rate-speech-codec-22khz.nemo) and use the code below to make an inference on the model:
|
58 |
|
59 |
```
|
@@ -68,6 +100,7 @@ path_to_output_audio = ??? # path of the reconstructed output audio
|
|
68 |
|
69 |
nemo_codec_model = AudioCodecModel.restore_from(restore_path=codec_path, map_location="cpu").eval()
|
70 |
|
|
|
71 |
# get discrete tokens from audio
|
72 |
audio, _ = librosa.load(path_to_input_audio, sr=nemo_codec_model.sample_rate)
|
73 |
|
|
|
54 |
### Inference
|
55 |
For inference, you can follow our [Audio Codec Inference Tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Inference.ipynb) which automatically downloads the model checkpoint. Note that you will need to set the ```model_name``` parameter to "nvidia/low-frame-rate-speech-codec-22khz".
|
56 |
|
57 |
+
In addition, you can use the code bellow that automatically download the checkpoint as well:
|
58 |
+
|
59 |
+
```
|
60 |
+
import librosa
|
61 |
+
import torch
|
62 |
+
import soundfile as sf
|
63 |
+
from nemo.collections.tts.models import AudioCodecModel
|
64 |
+
|
65 |
+
path_to_input_audio = ??? # path of the input audio
|
66 |
+
path_to_output_audio = ??? # path of the reconstructed output audio
|
67 |
+
|
68 |
+
# load audio codec
|
69 |
+
nemo_codec_model = AudioCodecModel.from_pretrained("nvidia/low-frame-rate-speech-codec-22khz").eval()
|
70 |
+
|
71 |
+
# get discrete tokens from audio
|
72 |
+
audio, _ = librosa.load(path_to_input_audio, sr=nemo_codec_model.sample_rate)
|
73 |
+
|
74 |
+
device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
75 |
+
audio_tensor = torch.from_numpy(audio).unsqueeze(dim=0).to(device)
|
76 |
+
audio_len = torch.tensor([audio_tensor[0].shape[0]]).to(device)
|
77 |
+
|
78 |
+
encoded_tokens, encoded_len = nemo_codec_model.encode(audio=audio_tensor, audio_len=audio_len)
|
79 |
+
|
80 |
+
# Reconstruct audio from tokens
|
81 |
+
reconstructed_audio, _ = nemo_codec_model.decode(tokens=encoded_tokens, tokens_len=encoded_len)
|
82 |
+
|
83 |
+
# save reconstructed audio
|
84 |
+
output_audio = reconstructed_audio.cpu().numpy().squeeze()
|
85 |
+
sf.write(path_to_output_audio, output_audio, nemo_codec_model.sample_rate)
|
86 |
+
|
87 |
+
```
|
88 |
+
|
89 |
Alternatively, you can manually download the [checkpoint](https://huggingface.co/nvidia/low-frame-rate-speech-codec-22khz/resolve/main/low-frame-rate-speech-codec-22khz.nemo) and use the code below to make an inference on the model:
|
90 |
|
91 |
```
|
|
|
100 |
|
101 |
nemo_codec_model = AudioCodecModel.restore_from(restore_path=codec_path, map_location="cpu").eval()
|
102 |
|
103 |
+
nemo_codec_model = AudioCodecModel.from_pretrained("nvidia/low-frame-rate-speech-codec-22khz").eval()
|
104 |
# get discrete tokens from audio
|
105 |
audio, _ = librosa.load(path_to_input_audio, sr=nemo_codec_model.sample_rate)
|
106 |
|