ArthurZ HF staff commited on
Commit
f3ea748
·
1 Parent(s): ac43cb6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +169 -1
README.md CHANGED
@@ -1,3 +1,171 @@
1
  ---
2
- duplicated_from: Matthijs/encodec_24khz
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ {}
5
  ---
6
+
7
+ ![encodec image](https://github.com/facebookresearch/encodec/raw/2d29d9353c2ff0ab1aeadc6a3d439854ee77da3e/architecture.png)
8
+ # Model Card for EnCodec
9
+
10
+ This model card provides details and information about EnCodec, a state-of-the-art real-time audio codec developed by Meta AI.
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ EnCodec is a high-fidelity audio codec leveraging neural networks. It introduces a streaming encoder-decoder architecture with quantized latent space, trained in an end-to-end fashion.
17
+ The model simplifies and speeds up training using a single multiscale spectrogram adversary that efficiently reduces artifacts and produces high-quality samples.
18
+ It also includes a novel loss balancer mechanism that stabilizes training by decoupling the choice of hyperparameters from the typical scale of the loss.
19
+ Additionally, lightweight Transformer models are used to further compress the obtained representation while maintaining real-time performance.
20
+
21
+ - **Developed by:** Meta AI
22
+ - **Model type:** Audio Codec
23
+
24
+ ### Model Sources
25
+
26
+ - **Repository:** [GitHub Repository](https://github.com/facebookresearch/encodec)
27
+ - **Paper:** [EnCodec: End-to-End Neural Audio Codec](https://arxiv.org/abs/2210.13438)
28
+
29
+ ## Uses
30
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
31
+
32
+ ### Direct Use
33
+
34
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
35
+
36
+ EnCodec can be used directly as an audio codec for real-time compression and decompression of audio signals.
37
+ It provides high-quality audio compression and efficient decoding. The model was trained on various bandwiths, which can be specified when encoding (compressing) and decoding (decompressing).
38
+ Two different setup exist for EnCodec:
39
+
40
+ - Non-streamable: the input audio is split into chunks of 1 seconds, with an overlap of 10 ms, which are then encoded.
41
+ - Streamable: weight normalizationis used on the convolution layers, and the input is not split into chunks but rather padded on the left.
42
+
43
+ ### Downstream Use
44
+
45
+ EnCodec can be fine-tuned for specific audio tasks or integrated into larger audio processing pipelines for applications such as speech generation,
46
+ music generation, or text to speech tasks.
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ## How to Get Started with the Model
53
+
54
+ Use the following code to get started with the EnCodec model using a dummy example from the LibriSpeech dataset (~9MB). First, install the required Python packages:
55
+
56
+ ```
57
+ pip install --upgrade pip
58
+ pip install --upgrade transformers datasets[audio]
59
+ ```
60
+
61
+ Then load an audio sample, and run a forward pass of the model:
62
+
63
+ ```python
64
+ from datasets import load_dataset, Audio
65
+ from transformers import EncodecModel, AutoProcessor
66
+
67
+
68
+ # load a demonstration datasets
69
+ librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
70
+
71
+ # load the model + processor (for pre-processing the audio)
72
+ model = EncodecModel.from_pretrained("facebook/encodec_24khz")
73
+ processor = AutoProcessor.from_pretrained("facebook/encodec_24khz")
74
+
75
+ # cast the audio data to the correct sampling rate for the model
76
+ librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=processor.sampling_rate))
77
+ audio_sample = librispeech_dummy[0]["audio"]["array"]
78
+
79
+ # pre-process the inputs
80
+ inputs = processor(raw_audio=audio_sample, sampling_rate=processor.sampling_rate, return_tensors="pt")
81
+
82
+ # explicitly encode then decode the audio inputs
83
+ encoder_outputs = model.encode(inputs["input_values"], inputs["padding_mask"])
84
+ audio_values = model.decode(encoder_outputs.audio_codes, encoder_outputs.audio_scales, inputs["padding_mask"])[0]
85
+
86
+ # or the equivalent with a forward pass
87
+ audio_values = model(inputs["input_values"], inputs["padding_mask"]).audio_values
88
+ ```
89
+
90
+ ## Training Details
91
+
92
+ The model was trained for 300 epochs, with one epoch being 2,000 updates with the Adam optimizer with a batch size of 64 examples of 1 second each, a learning rate of 3 · 10−4
93
+ , β1 = 0.5, and β2 = 0.9. All the models are traind using 8 A100 GPUs.
94
+
95
+ ### Training Data
96
+
97
+
98
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
99
+
100
+ - For speech:
101
+ - DNS Challenge 4
102
+ - [Common Voice](https://huggingface.co/datasets/common_voice)
103
+ - For general audio:
104
+ - [AudioSet](https://huggingface.co/datasets/Fhrozen/AudioSet2K22)
105
+ - [FSD50K](https://huggingface.co/datasets/Fhrozen/FSD50k)
106
+ - For music:
107
+ - [Jamendo dataset](https://huggingface.co/datasets/rkstgr/mtg-jamendo)
108
+
109
+
110
+ They used four different training strategies to sample for these datasets:
111
+
112
+ - (s1) sample a single source from Jamendo with probability 0.32;
113
+ - (s2) sample a single source from the other datasets with the same probability;
114
+ - (s3) mix two sources from all datasets with a probability of 0.24;
115
+ - (s4) mix three sources from all datasets except music with a probability of 0.12.
116
+
117
+ The audio is normalized by file and a random gain between -10 and 6 dB id applied.
118
+
119
+ ## Evaluation
120
+
121
+ <!-- This section describes the evaluation protocols and provides the results. -->
122
+
123
+ ### Subjectif metric for restoration:
124
+
125
+ This models was evalutated using the MUSHRA protocol (Series, 2014), using both a hidden reference and a low anchor. Annotators were recruited using a
126
+ crowd-sourcing platform, in which they were asked to rate the perceptual quality of the provided samples in
127
+ a range between 1 to 100. They randomly select 50 samples of 5 seconds from each category of the the test set
128
+ and force at least 10 annotations per samples. To filter noisy annotations and outliers we remove annotators
129
+ who rate the reference recordings less then 90 in at least 20% of the cases, or rate the low-anchor recording
130
+ above 80 more than 50% of the time.
131
+
132
+ ### Objective metric for restoration:
133
+ The ViSQOL()ink) metric was used together with the Scale-Invariant Signal-to-Noise Ration (SI-SNR) (Luo & Mesgarani, 2019;
134
+ Nachmani et al., 2020; Chazan et al., 2021).
135
+
136
+ ### Results
137
+
138
+ The results of the evaluation demonstrate the superiority of EnCodec compared to the baselines across different bandwidths (1.5, 3, 6, and 12 kbps).
139
+
140
+ When comparing EnCodec with the baselines at the same bandwidth, EnCodec consistently outperforms them in terms of MUSHRA score.
141
+ Notably, EnCodec achieves better performance, on average, at 3 kbps compared to Lyra-v2 at 6 kbps and Opus at 12 kbps.
142
+ Additionally, by incorporating the language model over the codes, it is possible to achieve a bandwidth reduction of approximately 25-40%.
143
+ For example, the bandwidth of the 3 kbps model can be reduced to 1.9 kbps.
144
+
145
+
146
+ #### Summary
147
+
148
+ EnCodec is a state-of-the-art real-time neural audio compression model that excels in producing high-fidelity audio samples at various sample rates and bandwidths.
149
+ The model's performance was evaluated across different settings, ranging from 24kHz monophonic at 1.5 kbps to 48kHz stereophonic, showcasing both subjective and
150
+ objective results. Notably, EnCodec incorporates a novel spectrogram-only adversarial loss, effectively reducing artifacts and enhancing sample quality.
151
+ Training stability and interpretability were further enhanced through the introduction of a gradient balancer for the loss weights.
152
+ Additionally, the study demonstrated that a compact Transformer model can be employed to achieve an additional bandwidth reduction of up to 40% without compromising
153
+ quality, particularly in applications where low latency is not critical (e.g., music streaming).
154
+
155
+
156
+ ## Citation
157
+
158
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
159
+
160
+ **BibTeX:**
161
+
162
+ ```
163
+ @misc{défossez2022high,
164
+ title={High Fidelity Neural Audio Compression},
165
+ author={Alexandre Défossez and Jade Copet and Gabriel Synnaeve and Yossi Adi},
166
+ year={2022},
167
+ eprint={2210.13438},
168
+ archivePrefix={arXiv},
169
+ primaryClass={eess.AS}
170
+ }
171
+ ```