--- license: mit tags: - DAC - Descript Audio Codec - PyTorch --- # Descript Audio Codec (DAC) DAC is the state-of-the-art audio tokenizer with improvement upon the previous tokenizers like SoundStream and EnCodec. This model card provides an easy-to-use API for a *pretrained DAC* [1] for 16khz audio whose backbone and pretrained weights are from [its original reposotiry](https://github.com/descriptinc/descript-audio-codec). With this API, you can encode and decode by a single line of code either using CPU or GPU. Furhtermore, it supports chunk-based processing for memory-efficient processing, especially important for GPU processing. ### Model variations There are three types of model depending on an input audio sampling rate. | Model | Input audio sampling rate [khz] | | ------------------ | ----------------- | | [`hance-ai/descript-audio-codec-44khz`](https://huggingface.co/hance-ai/descript-audio-codec-44khz) | 44.1khz | | [`hance-ai/descript-audio-codec-24khz`](https://huggingface.co/hance-ai/descript-audio-codec-24khz) | 24khz | | [`hance-ai/descript-audio-codec-16khz`](https://huggingface.co/hance-ai/descript-audio-codec-16khz) | 16khz | # Dependency See `requirements.txt`. # Usage ### Load ```python from transformers import AutoModel # device setting device = 'cpu' # or 'cuda:0' # load model = AutoModel.from_pretrained('hance-ai/descript-audio-codec-16khz', trust_remote_code=True) model.to(device) ``` ### Encode ```python audio_filename = 'path/example_audio.wav' zq, s = model.encode(audio_filename) ``` `zq` is discrete embeddings with dimension of (1, num_RVQ_codebooks, token_length) and `s` is a token sequence with dimension of (1, num_RVQ_codebooks, token_length). ### Decode ```python # decoding from `zq` waveform = model.decode(zq=zq) # (1, 1, audio_length); the output has a mono channel. # decoding from `s` waveform = model.decode(s=s) # (1, 1, audio_length); the output has a mono channel. ``` ### Save a waveform as an audio file ```python model.waveform_to_audiofile(waveform, 'out.wav') ``` ### Save and load tokens ```python model.save_tensor(s, 'tokens.pt') loaded_s = model.load_tensor('tokens.pt') ``` # References [1] Kumar, Rithesh, et al. "High-fidelity audio compression with improved rvqgan." Advances in Neural Information Processing Systems 36 (2024).