tiiuae
/

falcon-mamba-7b-instruct-4bit

Safetensors

English

falcon_mamba

4-bit precision

bitsandbytes

Model card Files Files and versions Community

ybelkada commited on Aug 12, 2024

Commit

f02f741

verified ·

1 Parent(s): 17983bf

Update README.md

Browse files

Files changed (1) hide show

README.md +21 -72

README.md CHANGED Viewed

@@ -8,6 +8,10 @@ language:
 <img src="https://huggingface.co/datasets/tiiuae/documentation-images/resolve/main/falcon_mamba/thumbnail.png" alt="drawing" width="800"/>
 #  Table of Contents
 0. [TL;DR](#TL;DR)
@@ -39,23 +43,7 @@ Find below some example scripts on how to use the model in `transformers` (Make
 ### Running the model on a CPU
-<details>
-<summary> Click to expand </summary>
-```python
-from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
-model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b")
-input_text = "Question: How many hours in one day? Answer: "
-input_ids = tokenizer(input_text, return_tensors="pt").input_ids
-outputs = model.generate(input_ids)
-print(tokenizer.decode(outputs[0]))
-```
-</details>
 ### Running the model on a GPU
@@ -66,11 +54,14 @@ print(tokenizer.decode(outputs[0]))
 # pip install accelerate
 from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
-model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b", device_map="auto")
-input_text = "Question: How many hours in one day? Answer: "
-input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
 outputs = model.generate(input_ids)
 print(tokenizer.decode(outputs[0]))
@@ -87,38 +78,16 @@ print(tokenizer.decode(outputs[0]))
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
-model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b", torch_dtype=torch.bfloat16).to(0)
 model = torch.compile(model)
-input_text = "Question: How many hours in one day? Answer: "
-input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
-outputs = model.generate(input_ids)
-print(tokenizer.decode(outputs[0]))
-```
-</details>
-### Running the model on a GPU using different precisions
-#### FP16
-<details>
-<summary> Click to expand </summary>
-```python
-# pip install accelerate
-import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
-model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b", device_map="auto", torch_dtype=torch.float16)
-input_text = "Question: How many hours in one day? Answer: "
-input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
 outputs = model.generate(input_ids)
 print(tokenizer.decode(outputs[0]))
@@ -126,28 +95,6 @@ print(tokenizer.decode(outputs[0]))
 </details>
-#### 4-bit
-<details>
-<summary> Click to expand </summary>
-```python
-# pip install bitsandbytes accelerate
-from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
-tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
-model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b", device_map="auto", quantization_config=BitsAndBytesConfig(load_in_4bit=True))
-input_text = "Question: How many hours in one day? Answer: "
-input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
-outputs = model.generate(input_ids)
-print(tokenizer.decode(outputs[0]))
-```
-</details>
-<br>
 # Training Details
@@ -164,6 +111,8 @@ In particular, we used samples coming from [Fineweb-edu](https://huggingface.co/
 The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7B)/[11B](https://huggingface.co/tiiuae/falcon-11B) tokenizer.
 ## Training Procedure
 Falcon-Mamba-7B was trained on 256 H100 80GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=1, PP=1, DP=256) combined with ZeRO.

 <img src="https://huggingface.co/datasets/tiiuae/documentation-images/resolve/main/falcon_mamba/thumbnail.png" alt="drawing" width="800"/>
+**Make sure to install bitsandbytes and have a GPU compatible with bitsandbytes to run this model**
+Model card for FalconMamba Instruct model - quantized in 4bit precision
 #  Table of Contents
 0. [TL;DR](#TL;DR)
 ### Running the model on a CPU
+The model is quantized in 4-bit precision with `bitsandbytes` you can only use it with a compatible GPU.
 ### Running the model on a GPU
 # pip install accelerate
 from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b-instruct-4bit")
+model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b-instruct-4bit", device_map="auto")
+# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
+messages = [
+    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
+]
+input_ids = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True).input_ids.to("cuda")
 outputs = model.generate(input_ids)
 print(tokenizer.decode(outputs[0]))
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b-instruct-4bit")
+model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b-instruct-4bit", torch_dtype=torch.bfloat16).to(0)
 model = torch.compile(model)
+# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
+messages = [
+    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
+]
+input_ids = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True).input_ids.to("cuda")
 outputs = model.generate(input_ids)
 print(tokenizer.decode(outputs[0]))
 </details>
 # Training Details
 The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7B)/[11B](https://huggingface.co/tiiuae/falcon-11B) tokenizer.
+After pre-training, the model has been further fine-tuned on instruction data.
 ## Training Procedure
 Falcon-Mamba-7B was trained on 256 H100 80GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=1, PP=1, DP=256) combined with ZeRO.