Safetensors
English
falcon_mamba
4-bit precision
bitsandbytes
ybelkada commited on
Commit
f02f741
·
verified ·
1 Parent(s): 17983bf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -72
README.md CHANGED
@@ -8,6 +8,10 @@ language:
8
 
9
  <img src="https://huggingface.co/datasets/tiiuae/documentation-images/resolve/main/falcon_mamba/thumbnail.png" alt="drawing" width="800"/>
10
 
 
 
 
 
11
  # Table of Contents
12
 
13
  0. [TL;DR](#TL;DR)
@@ -39,23 +43,7 @@ Find below some example scripts on how to use the model in `transformers` (Make
39
 
40
  ### Running the model on a CPU
41
 
42
- <details>
43
- <summary> Click to expand </summary>
44
-
45
- ```python
46
- from transformers import AutoTokenizer, AutoModelForCausalLM
47
-
48
- tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
49
- model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b")
50
-
51
- input_text = "Question: How many hours in one day? Answer: "
52
- input_ids = tokenizer(input_text, return_tensors="pt").input_ids
53
-
54
- outputs = model.generate(input_ids)
55
- print(tokenizer.decode(outputs[0]))
56
- ```
57
-
58
- </details>
59
 
60
  ### Running the model on a GPU
61
 
@@ -66,11 +54,14 @@ print(tokenizer.decode(outputs[0]))
66
  # pip install accelerate
67
  from transformers import AutoTokenizer, AutoModelForCausalLM
68
 
69
- tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
70
- model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b", device_map="auto")
71
 
72
- input_text = "Question: How many hours in one day? Answer: "
73
- input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
 
 
 
74
 
75
  outputs = model.generate(input_ids)
76
  print(tokenizer.decode(outputs[0]))
@@ -87,38 +78,16 @@ print(tokenizer.decode(outputs[0]))
87
  import torch
88
  from transformers import AutoTokenizer, AutoModelForCausalLM
89
 
90
- tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
91
- model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b", torch_dtype=torch.bfloat16).to(0)
92
 
93
  model = torch.compile(model)
94
 
95
- input_text = "Question: How many hours in one day? Answer: "
96
- input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
97
-
98
- outputs = model.generate(input_ids)
99
- print(tokenizer.decode(outputs[0]))
100
- ```
101
-
102
- </details>
103
-
104
-
105
- ### Running the model on a GPU using different precisions
106
-
107
- #### FP16
108
-
109
- <details>
110
- <summary> Click to expand </summary>
111
-
112
- ```python
113
- # pip install accelerate
114
- import torch
115
- from transformers import AutoTokenizer, AutoModelForCausalLM
116
-
117
- tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
118
- model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b", device_map="auto", torch_dtype=torch.float16)
119
-
120
- input_text = "Question: How many hours in one day? Answer: "
121
- input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
122
 
123
  outputs = model.generate(input_ids)
124
  print(tokenizer.decode(outputs[0]))
@@ -126,28 +95,6 @@ print(tokenizer.decode(outputs[0]))
126
 
127
  </details>
128
 
129
- #### 4-bit
130
-
131
- <details>
132
- <summary> Click to expand </summary>
133
-
134
- ```python
135
- # pip install bitsandbytes accelerate
136
- from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
137
-
138
- tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
139
- model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b", device_map="auto", quantization_config=BitsAndBytesConfig(load_in_4bit=True))
140
-
141
- input_text = "Question: How many hours in one day? Answer: "
142
- input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
143
-
144
- outputs = model.generate(input_ids)
145
- print(tokenizer.decode(outputs[0]))
146
- ```
147
-
148
- </details>
149
-
150
- <br>
151
 
152
  # Training Details
153
 
@@ -164,6 +111,8 @@ In particular, we used samples coming from [Fineweb-edu](https://huggingface.co/
164
 
165
  The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7B)/[11B](https://huggingface.co/tiiuae/falcon-11B) tokenizer.
166
 
 
 
167
  ## Training Procedure
168
  Falcon-Mamba-7B was trained on 256 H100 80GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=1, PP=1, DP=256) combined with ZeRO.
169
 
 
8
 
9
  <img src="https://huggingface.co/datasets/tiiuae/documentation-images/resolve/main/falcon_mamba/thumbnail.png" alt="drawing" width="800"/>
10
 
11
+ **Make sure to install bitsandbytes and have a GPU compatible with bitsandbytes to run this model**
12
+
13
+ Model card for FalconMamba Instruct model - quantized in 4bit precision
14
+
15
  # Table of Contents
16
 
17
  0. [TL;DR](#TL;DR)
 
43
 
44
  ### Running the model on a CPU
45
 
46
+ The model is quantized in 4-bit precision with `bitsandbytes` you can only use it with a compatible GPU.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  ### Running the model on a GPU
49
 
 
54
  # pip install accelerate
55
  from transformers import AutoTokenizer, AutoModelForCausalLM
56
 
57
+ tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b-instruct-4bit")
58
+ model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b-instruct-4bit", device_map="auto")
59
 
60
+ # We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
61
+ messages = [
62
+ {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
63
+ ]
64
+ input_ids = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True).input_ids.to("cuda")
65
 
66
  outputs = model.generate(input_ids)
67
  print(tokenizer.decode(outputs[0]))
 
78
  import torch
79
  from transformers import AutoTokenizer, AutoModelForCausalLM
80
 
81
+ tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b-instruct-4bit")
82
+ model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b-instruct-4bit", torch_dtype=torch.bfloat16).to(0)
83
 
84
  model = torch.compile(model)
85
 
86
+ # We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
87
+ messages = [
88
+ {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
89
+ ]
90
+ input_ids = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True).input_ids.to("cuda")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
  outputs = model.generate(input_ids)
93
  print(tokenizer.decode(outputs[0]))
 
95
 
96
  </details>
97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98
 
99
  # Training Details
100
 
 
111
 
112
  The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7B)/[11B](https://huggingface.co/tiiuae/falcon-11B) tokenizer.
113
 
114
+ After pre-training, the model has been further fine-tuned on instruction data.
115
+
116
  ## Training Procedure
117
  Falcon-Mamba-7B was trained on 256 H100 80GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=1, PP=1, DP=256) combined with ZeRO.
118