|
--- |
|
language: |
|
- en |
|
tags: |
|
- audio |
|
- music |
|
- codec |
|
- neural-audio |
|
- audio-compression |
|
- transformers |
|
pipeline_tag: audio-to-audio |
|
library_name: transformers |
|
inference: true |
|
--- |
|
|
|
|
|
# XCodec Mini - Neural Audio Codec |
|
|
|
## Model Description |
|
|
|
XCodec Mini is a state-of-the-art neural audio codec designed for high-quality music compression and reconstruction. It combines semantic and acoustic encoding approaches to achieve efficient compression while maintaining audio quality. |
|
|
|
### Key Features |
|
|
|
- **Dual Encoding Architecture** |
|
- Semantic encoder for high-level musical features |
|
- Acoustic encoder for detailed sound information |
|
- Multi-scale processing for efficient compression |
|
|
|
- **Advanced Compression** |
|
- Multiple codebooks for flexible quality/size tradeoff |
|
- Support for 44.1kHz high-fidelity audio |
|
- Separate processing paths for vocals and instrumentals |
|
|
|
- **Technical Specifications** |
|
- Input: Raw audio at 44.1kHz |
|
- Output: Compressed representations and reconstructed audio |
|
- Model Size: [Add total size] |
|
- Compression Ratio: [Add typical ratio] |
|
|
|
## Intended Uses |
|
|
|
- High-quality music compression |
|
- Audio archival and storage |
|
- Music streaming applications |
|
- Audio processing pipelines |
|
|
|
## Training Data |
|
|
|
The model was trained on a diverse dataset of music, including: |
|
- Various genres and styles |
|
- Vocal and instrumental tracks |
|
- High-quality studio recordings |
|
|
|
## Performance and Limitations |
|
|
|
### Strengths |
|
- High-quality audio reconstruction |
|
- Efficient compression ratios |
|
- Separate handling of vocals and instrumentals |
|
- Support for high sample rates |
|
|
|
### Limitations |
|
- Computationally intensive for real-time applications |
|
- Requires significant GPU memory |
|
- Best suited for offline processing |
|
- May introduce artifacts in extreme compression settings |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture |
|
1. **Semantic Encoder** |
|
- Based on HuBERT architecture |
|
- Captures high-level musical features |
|
- Outputs semantic tokens |
|
|
|
2. **Acoustic Encoder** |
|
- Multi-scale convolutional architecture |
|
- Processes detailed sound information |
|
- Generates acoustic tokens |
|
|
|
3. **Dual Decoders** |
|
- Separate decoders for vocals and instrumentals |
|
- Multi-stage reconstruction process |
|
- Quality-focused design |
|
|
|
### Input Requirements |
|
- Audio Format: WAV/MP3 |
|
- Sample Rate: 44.1kHz |
|
- Channels: Mono/Stereo |
|
- Bit Depth: 16-bit |
|
|
|
### Output Format |
|
- Reconstructed Audio: 44.1kHz WAV |
|
- Intermediate Representations: Compressed tokens |
|
|
|
## Usage Guidelines |
|
|
|
### Hardware Requirements |
|
- GPU: NVIDIA GPU with 8GB+ VRAM |
|
- RAM: 16GB+ recommended |
|
- Storage: SSD recommended for faster processing |
|
|
|
### Software Requirements |
|
- Python 3.8+ |
|
- PyTorch 2.0+ |
|
- CUDA 11.0+ |
|
- Additional dependencies listed in installation guide |
|
|
|
## Ethical Considerations |
|
|
|
- **Copyright**: Users should ensure they have proper rights to process copyrighted material |
|
- **Attribution**: Proper attribution should be given when using this model |
|
- **Data Privacy**: Consider data privacy implications when processing sensitive audio |
|
|
|
|
|
## Additional Information |
|
|
|
### Model Weights |
|
The model requires several checkpoint files: |
|
- Semantic Encoder |
|
- Vocal Decoder |
|
- Instrumental Decoder |
|
- Final Checkpoint |
|
|
|
### Contact |
|
For issues and questions, please use the GitHub repository's issue tracker. |