jartine commited on
Commit
3af6a03
1 Parent(s): b4d83da

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -14
README.md CHANGED
@@ -114,20 +114,21 @@ speedups for llama.cpp's simplest quants: Q8\_0 and Q4\_0.
114
  This model is very large. Even at Q2 quantization, it's still well-over
115
  twice as large the highest tier NVIDIA gaming GPUs. llamafile supports
116
  splitting models over multiple GPUs (for NVIDIA only currently) if you
117
- have such a system. The best way to get one, if you don't, is to pay a
118
- few bucks an hour to rent a 4x RTX 4090 rig off vast.ai.
119
-
120
- Mac Studio is a good option for running this model. An M2 Ultra desktop
121
- from Apple is affordable and has 128GB of unified RAM+VRAM. If you have
122
- one, then llamafile will use your Metal GPU. Try starting out with the
123
- `Q4_0` quantization level.
124
-
125
- Another good option for running very large, large language models is to
126
- just use CPU. We developed new tensor multiplication kernels on the
127
- llamafile project specifically to speed up "mixture of experts" LLMs
128
- like Mixtral. On a AMD Threadripper Pro 7995WX with 256GB of 5200 MT/s
129
- RAM, llamafile v0.8 runs Mixtral 8x22B Q4\_0 at 98 tokens per second for
130
- evaluation, and it predicts 9.44 tokens per second.
 
131
 
132
  ---
133
 
 
114
  This model is very large. Even at Q2 quantization, it's still well-over
115
  twice as large the highest tier NVIDIA gaming GPUs. llamafile supports
116
  splitting models over multiple GPUs (for NVIDIA only currently) if you
117
+ have such a system. The easiest way to have one, if you don't, is to pay
118
+ a few bucks an hour to rent a 4x RTX 4090 rig off vast.ai.
119
+
120
+ Mac Studio is a good option for running this model locally. An M2 Ultra
121
+ desktop from Apple is affordable and has 128GB of unified RAM+VRAM. If
122
+ you have one, then llamafile will use your Metal GPU. Try starting out
123
+ with the `Q4_0` quantization level.
124
+
125
+ Another good option for running large, large language models locally and
126
+ fully under your control is to just use CPU inference. We developed new
127
+ tensor multiplication kernels on the llamafile project specifically to
128
+ speed up "mixture of experts" LLMs like Mixtral. On a AMD Threadripper
129
+ Pro 7995WX with 256GB of 5200 MT/s RAM, llamafile v0.8 runs Mixtral
130
+ 8x22B Q4\_0 on Linux at 98 tokens per second for evaluation, and it
131
+ predicts 9.44 tokens per second.
132
 
133
  ---
134