GGUF Models: Conversion and Upload to Hugging Face

This guide explains what GGUF models are, how to convert models to GGUF format, and how to upload them to the Hugging Face Hub.

What is GGUF?

GGUF (GGML Unified Format) is a file format for storing large language models, particularly optimized for efficient inference on consumer hardware. Key features of GGUF models include:

  • Successor to the GGML format
  • Designed for efficient quantization and inference
  • Supports a wide range of model architectures
  • Commonly used with libraries like llama.cpp for running LLMs on consumer hardware
  • Allows for reduced model size while maintaining good performance

Why and How to Convert to GGUF Format

Converting models to GGUF format offers several advantages:

  1. Reduced file size: GGUF models can be quantized to lower precision (e.g., int4, int8), significantly reducing model size.
  2. Faster inference: The format is optimized for quick loading and efficient inference on CPUs and consumer GPUs.
  3. Cross-platform compatibility: GGUF models can be used with libraries like llama.cpp, enabling deployment on various platforms.

To convert a model to GGUF format, we'll use the convert-hf-to-gguf.py script from the llama.cpp repository.

Steps to Convert a Model to GGUF

  1. Clone the llama.cpp repository:

    git clone https://github.com/ggerganov/llama.cpp.git
    
  2. Install required Python libraries:

    pip install -r llama.cpp/requirements.txt
    
  3. Verify the script and understand options:

    python llama.cpp/convert-hf-to-gguf-update.py -h
    
  4. Convert the HuggingFace model to GGUF:

    python llama.cpp/convert-hf-to-gguf-update.py ./models/8B/Meta-Llama-3-8B-Instruct --outfile Llama3-8B-instruct-Q8.0.gguf --outtype q8_0
    

    This command converts the model to 8-bit quantization (q8_0). You can choose different quantization levels like int4, int8, or keep it in f16 or f32 format.

Uploading GGUF Models to Hugging Face

Once you have your GGUF model, you can upload it to Hugging Face for easy sharing and versioning.

Prerequisites

  • Python 3.6+
  • huggingface_hub library installed (pip install huggingface_hub)
  • A Hugging Face account and API token

Upload Script

Save the following script as upload_gguf_model.py:

from huggingface_hub import HfApi

def push_to_hub(hf_token, local_path, model_id):
    api = HfApi(token=hf_token)
    api.create_repo(model_id, exist_ok=True, repo_type="model")

    api.upload_file(
                path_or_fileobj=local_path,
                path_in_repo="Meta-Llama-3-8B-Instruct.bf16.gguf",
                repo_id=model_id
            )
    
    print(f"Model successfully pushed to {model_id}")

# Example usage
hf_token = "your_huggingface_token_here"
local_path = "/path/to/your/local/model/directory"
model_id = "your-username/your-model-name"

push_to_hub(hf_token, local_path, model_id)

Usage

  1. Replace the placeholder values in the script:

    • your_huggingface_token_here: Your Hugging Face API token
    • /path/to/your/local/model/directory: The local path to your GGUF model files
    • your-username/your-model-name: Your desired model ID on Hugging Face
  2. Run the script:

    python upload_gguf_model.py
    

Best Practices

  • Include a README.md file with your model, detailing its architecture, quantization, and usage instructions.
  • Add a config.json file with model configuration details.
  • Include any necessary tokenizer files.

References

  1. llama.cpp GitHub Repository
  2. GGUF Format Discussion
  3. Hugging Face Documentation

For more detailed information and updates, please refer to the official documentation of llama.cpp and Hugging Face.

Downloads last month
13
GGUF
Model size
8.03B params
Architecture
llama

4-bit

16-bit

Inference API
Unable to determine this model’s pipeline type. Check the docs .