|
--- |
|
license: mit |
|
--- |
|
- Source Mistral 7B model:</br> |
|
https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/ |
|
|
|
- This model is converted from Bfloat16 datatype to Int8 datatype with convert tool from:</br> |
|
https://github.com/ggerganov/llama.cpp |
|
|
|
- Deployment on CPU:</br> |
|
Pull the ready-made llama.cpp container: |
|
``` |
|
docker pull ghcr.io/ggerganov/llama.cpp:server |
|
``` |
|
Assuming mistral-7B-instruct-v0.2-q8.gguf file is downloaded to /path/to/models directory on local machine, run the container accesing the model with: |
|
``` |
|
docker run -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server -m /models/istral-7B-instruct-v0.2-q8.gguf --port 8000 --host 0.0.0.0 -n 512 |
|
``` |
|
- Test the deployment accessing the model with the browser at http://localhost:8000 |
|
- llama.cpp server also provides OpenAI compatible API |
|
- Deployment on CUDA GPU:</br> |
|
``` |
|
docker pull ghcr.io/ggerganov/llama.cpp:server-cuda |
|
``` |
|
``` |
|
docker run --gpus all -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server-cuda -m /models/mistral-7B-instruct-v0.2-q8.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 50 |
|
``` |
|
- If CUDA GPU with 16GB RAM is available, the version of the model converted to float16 may be interesting, available in this repo:</br> |
|
https://huggingface.co/itod/mistral-7B-instruct-v0.2-f16 |
|
- More details about usage is avalable in llama.cpp documentation:</br> |
|
https://github.com/ggerganov/llama.cpp/tree/master/examples/server |