--- license: cc-by-4.0 tags: - audio-to-audio pipeline_tag: audio-to-audio --- ## Paper LLaSA: Scaling Train Time and Test Time Compute for LLaMA based Speech Synthesis (Comming soon) Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model (AAAI 2025, xcodec 1.0) # Getting Started with XCodec2 on Hugging Face XCodec2 is a speech tokenizer that offers the following key features: 1. **Single Vector Quantization** 2. **50 Tokens per Second** 3. **Multilingual Speech Semantic Support and High-Quality Speech Reconstruction** To use `xcodec2`, ensure you have it installed. You can install it using the following command: ```bash conda create -n xcodec2 python=3.9 conda activate xcodec2 pip install xcodec2==0.1.3 (Fix the bug in the previous version to achieve better sound quality) ``` Then, ```python import torch import soundfile as sf from transformers import AutoConfig from xcodec2.modeling_xcodec2 import XCodec2Model model_path = "HKUST-Audio/xcodec2" model = XCodec2Model.from_pretrained(model_path) model.eval().cuda() wav, sr = sf.read("test.wav") wav_tensor = torch.from_numpy(wav).float().unsqueeze(0) # Shape: (1, T) with torch.no_grad(): # Only 16khz speech # Only supports single input. For batch inference, please refer to the link below. vq_code = model.encode_code(input_waveform=wav_tensor) print("Code:", vq_code ) recon_wav = model.decode_code(vq_code).cpu() # Shape: (1, 1, T') sf.write("reconstructed.wav", recon_wav[0, 0, :].numpy(), sr) print("Done! Check reconstructed.wav") ``` # If you want to train your own xcodec2, batch inference, or large-scale code extraction, the code is released [here](https://github.com/zhenye234/X-Codec-2.0).