XCodec Mini - Neural Audio Codec

Model Description

XCodec Mini is a state-of-the-art neural audio codec designed for high-quality music compression and reconstruction. It combines semantic and acoustic encoding approaches to achieve efficient compression while maintaining audio quality.

Key Features

  • Dual Encoding Architecture

    • Semantic encoder for high-level musical features
    • Acoustic encoder for detailed sound information
    • Multi-scale processing for efficient compression
  • Advanced Compression

    • Multiple codebooks for flexible quality/size tradeoff
    • Support for 44.1kHz high-fidelity audio
    • Separate processing paths for vocals and instrumentals
  • Technical Specifications

    • Input: Raw audio at 44.1kHz
    • Output: Compressed representations and reconstructed audio
    • Model Size: [Add total size]
    • Compression Ratio: [Add typical ratio]

Intended Uses

  • High-quality music compression
  • Audio archival and storage
  • Music streaming applications
  • Audio processing pipelines

Training Data

The model was trained on a diverse dataset of music, including:

  • Various genres and styles
  • Vocal and instrumental tracks
  • High-quality studio recordings

Performance and Limitations

Strengths

  • High-quality audio reconstruction
  • Efficient compression ratios
  • Separate handling of vocals and instrumentals
  • Support for high sample rates

Limitations

  • Computationally intensive for real-time applications
  • Requires significant GPU memory
  • Best suited for offline processing
  • May introduce artifacts in extreme compression settings

Technical Specifications

Model Architecture

  1. Semantic Encoder

    • Based on HuBERT architecture
    • Captures high-level musical features
    • Outputs semantic tokens
  2. Acoustic Encoder

    • Multi-scale convolutional architecture
    • Processes detailed sound information
    • Generates acoustic tokens
  3. Dual Decoders

    • Separate decoders for vocals and instrumentals
    • Multi-stage reconstruction process
    • Quality-focused design

Input Requirements

  • Audio Format: WAV/MP3
  • Sample Rate: 44.1kHz
  • Channels: Mono/Stereo
  • Bit Depth: 16-bit

Output Format

  • Reconstructed Audio: 44.1kHz WAV
  • Intermediate Representations: Compressed tokens

Usage Guidelines

Hardware Requirements

  • GPU: NVIDIA GPU with 8GB+ VRAM
  • RAM: 16GB+ recommended
  • Storage: SSD recommended for faster processing

Software Requirements

  • Python 3.8+
  • PyTorch 2.0+
  • CUDA 11.0+
  • Additional dependencies listed in installation guide

Ethical Considerations

  • Copyright: Users should ensure they have proper rights to process copyrighted material
  • Attribution: Proper attribution should be given when using this model
  • Data Privacy: Consider data privacy implications when processing sensitive audio

Additional Information

Model Weights

The model requires several checkpoint files:

  • Semantic Encoder
  • Vocal Decoder
  • Instrumental Decoder
  • Final Checkpoint

Contact

For issues and questions, please use the GitHub repository's issue tracker.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support