--- language: - en license: llama3 tags: - Llama-3 - instruct - finetune - chatml - gpt4 - synthetic data - distillation - function calling - json mode - axolotl - roleplaying - chat - quantization - AWQ base_model: meta-llama/Meta-Llama-3.2-3B widget: - example_title: Hermes 3 AWQ 4-bit messages: - role: system content: >- You are a sentient, superintelligent artificial general intelligence, here to teach and assist me. - role: user content: >- Write a short story about Goku discovering Kirby has teamed up with Majin Buu to destroy the world. model-index: - name: Hermes-3-Llama-3.2-3B-AWQ-4bit results: [] library_name: transformers --- # Hermes 3 - Llama-3.2 3B (AWQ 4-bit) ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/-kj_KflXsdpcZoTQsvx7W.jpeg) ## Model Description This is a 4-bit AWQ (Activation-aware Weight Quantization) quantized version of **Hermes 3 - Llama-3.2 3B**, a fine-tuned LLM developed by Nous Research. The quantization was performed to improve efficiency while maintaining strong performance, making the model suitable for low-memory devices and inference acceleration. For details on the original model, please see the [**Hermes 3 Technical Report**](https://arxiv.org/abs/2408.11857). ## Base Model Information Hermes 3 3B is a generalist language model fine-tuned from **Llama-3.2 3B**, with improvements in: - Reasoning - Roleplaying - Function calling & structured outputs - Multi-turn conversation - Long-context coherence This quantized version retains these enhancements while offering better efficiency. ## Performance Benchmarks The original **Hermes 3 3B** model achieved strong performance on various benchmarks. While the AWQ quantized version maintains high accuracy, minor variations may occur due to the quantization process. For benchmarking, refer to the original model's results. ## Prompt Format This model follows **ChatML formatting**, similar to OpenAI's API prompt structure. Example: ```python messages = [ {"role": "system", "content": "You are Hermes 3."}, {"role": "user", "content": "Hello, who are you?"} ] gen_input = tokenizer.apply_chat_template(messages, return_tensors="pt") model.generate(**gen_input) ``` For more details, see the [Hermes 3 documentation](https://huggingface.co/NousResearch/Hermes-3-Llama-3.2-3B). ## Inference with AWQ 4-bit Model To use this quantized model efficiently, load it with **AutoAWQ** or **transformers**: ```python from transformers import AutoTokenizer from autoawq import AutoAWQForCausalLM model_path = "your_model_path" tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) model = AutoAWQForCausalLM.from_quantized(model_path, device="cuda") prompt = "<|im_start|>user\nHello! How are you?<|im_end|>\n<|im_start|>assistant" input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda") output = model.generate(input_ids, max_new_tokens=256) response = tokenizer.decode(output[0], skip_special_tokens=True) print(response) ``` ## Quantized Model Use Cases - Running LLMs on lower-end **consumer GPUs** (e.g., RTX 3060, 4060, etc.) - **Faster inference** with minimal degradation in quality - **Edge computing & on-device AI** with constrained resources - **Cloud inference** with optimized performance/cost ratio ## Limitations & Considerations - **Minor accuracy loss** due to 4-bit quantization (slightly less precise responses in rare cases) - **Lower computational overhead** at the expense of some fine-grained details - **Best suited for inference**, rather than fine-tuning or continued training ## Citation If you use this model, please cite the original **Hermes 3 Technical Report**: ```bibtex @misc{teknium2024hermes3technicalreport, title={Hermes 3 Technical Report}, author={Ryan Teknium and Jeffrey Quesnelle and Chen Guang}, year={2024}, eprint={2408.11857}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2408.11857}, } ``` ## Acknowledgments This quantization was performed using the **AWQ** method for LLM optimization. The base model was developed by Nous Research, and quantization was applied to enhance deployment efficiency while preserving model quality. For further details, refer to [Nous Research](https://huggingface.co/NousResearch) and [Hermes 3 models](https://huggingface.co/NousResearch/Hermes-3-Llama-3.2-3B).