|
--- |
|
pipeline_tag: text-generation |
|
inference: false |
|
license: apache-2.0 |
|
library_name: transformers |
|
tags: |
|
- language |
|
- granite-3.1 |
|
- llama-cpp |
|
- gguf-my-repo |
|
base_model: ibm-granite/granite-3.1-2b-instruct |
|
--- |
|
|
|
# Triangle104/granite-3.1-2b-instruct-Q4_K_M-GGUF |
|
This model was converted to GGUF format from [`ibm-granite/granite-3.1-2b-instruct`](https://huggingface.co/ibm-granite/granite-3.1-2b-instruct) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space. |
|
Refer to the [original model card](https://huggingface.co/ibm-granite/granite-3.1-2b-instruct) for more details on the model. |
|
|
|
--- |
|
Model details: |
|
- |
|
Granite-3.1-2B-Instruct is a 2B parameter long-context instruct model |
|
finetuned from Granite-3.1-2B-Base using a combination of open source |
|
instruction datasets with permissive license and internally collected |
|
synthetic datasets tailored for solving long context problems. This |
|
model is developed using a diverse set of techniques with a structured |
|
chat format, including supervised finetuning, model alignment using |
|
reinforcement learning, and model merging. |
|
|
|
Developers: Granite Team, IBM |
|
GitHub Repository: ibm-granite/granite-3.1-language-models |
|
Website: Granite Docs |
|
Paper: Granite 3.1 Language Models (coming soon) |
|
Release Date: December 18th, 2024 |
|
License: Apache 2.0 |
|
|
|
|
|
Supported Languages: |
|
English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, |
|
Italian, Korean, Dutch, and Chinese. Users may finetune Granite 3.1 |
|
models for languages beyond these 12 languages. |
|
|
|
|
|
Intended Use: |
|
The model is designed to respond to general instructions and can be used |
|
to build AI assistants for multiple domains, including business |
|
applications. |
|
|
|
|
|
Capabilities |
|
|
|
|
|
Summarization |
|
Text classification |
|
Text extraction |
|
Question-answering |
|
Retrieval Augmented Generation (RAG) |
|
Code related tasks |
|
Function-calling tasks |
|
Multilingual dialog use cases |
|
Long-context tasks including long document/meeting summarization, long document QA, etc. |
|
|
|
|
|
Generation: |
|
This is a simple example of how to use Granite-3.1-2B-Instruct model. |
|
|
|
|
|
Install the following libraries: |
|
|
|
|
|
pip install torch torchvision torchaudio |
|
pip install accelerate |
|
pip install transformers |
|
|
|
|
|
|
|
Then, copy the snippet from the section that is relevant for your use case. |
|
|
|
|
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
device = "auto" |
|
model_path = "ibm-granite/granite-3.1-2b-instruct" |
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
# drop device_map if running on CPU |
|
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device) |
|
model.eval() |
|
# change input text as desired |
|
chat = [ |
|
{ "role": "user", "content": "Please list one IBM Research laboratory located in the United States. You should only output its name and location." }, |
|
] |
|
chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) |
|
# tokenize the text |
|
input_tokens = tokenizer(chat, return_tensors="pt").to(device) |
|
# generate output tokens |
|
output = model.generate(**input_tokens, |
|
max_new_tokens=100) |
|
# decode output tokens into text |
|
output = tokenizer.batch_decode(output) |
|
# print output |
|
print(output) |
|
|
|
|
|
|
|
Model Architecture: |
|
Granite-3.1-2B-Instruct is based on a decoder-only dense transformer |
|
architecture. Core components of this architecture are: GQA and RoPE, |
|
MLP with SwiGLU, RMSNorm, and shared input/output embeddings. |
|
|
|
--- |
|
## Use with llama.cpp |
|
Install llama.cpp through brew (works on Mac and Linux) |
|
|
|
```bash |
|
brew install llama.cpp |
|
|
|
``` |
|
Invoke the llama.cpp server or the CLI. |
|
|
|
### CLI: |
|
```bash |
|
llama-cli --hf-repo Triangle104/granite-3.1-2b-instruct-Q4_K_M-GGUF --hf-file granite-3.1-2b-instruct-q4_k_m.gguf -p "The meaning to life and the universe is" |
|
``` |
|
|
|
### Server: |
|
```bash |
|
llama-server --hf-repo Triangle104/granite-3.1-2b-instruct-Q4_K_M-GGUF --hf-file granite-3.1-2b-instruct-q4_k_m.gguf -c 2048 |
|
``` |
|
|
|
Note: You can also use this checkpoint directly through the [usage steps](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage) listed in the Llama.cpp repo as well. |
|
|
|
Step 1: Clone llama.cpp from GitHub. |
|
``` |
|
git clone https://github.com/ggerganov/llama.cpp |
|
``` |
|
|
|
Step 2: Move into the llama.cpp folder and build it with `LLAMA_CURL=1` flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux). |
|
``` |
|
cd llama.cpp && LLAMA_CURL=1 make |
|
``` |
|
|
|
Step 3: Run inference through the main binary. |
|
``` |
|
./llama-cli --hf-repo Triangle104/granite-3.1-2b-instruct-Q4_K_M-GGUF --hf-file granite-3.1-2b-instruct-q4_k_m.gguf -p "The meaning to life and the universe is" |
|
``` |
|
or |
|
``` |
|
./llama-server --hf-repo Triangle104/granite-3.1-2b-instruct-Q4_K_M-GGUF --hf-file granite-3.1-2b-instruct-q4_k_m.gguf -c 2048 |
|
``` |
|
|