|
--- |
|
license: mit |
|
tags: |
|
- ctranslate2 |
|
--- |
|
# Fast-Inference with Ctranslate2 |
|
Speedup inference by 2x-8x using int8 inference in C++ |
|
|
|
quantized version of [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b) |
|
```bash |
|
pip install hf_hub_ctranslate2>=1.0.0 ctranslate2>=3.13.0 |
|
``` |
|
|
|
|
|
Checkpoint compatible to [ctranslate2](https://github.com/OpenNMT/CTranslate2) and [hf-hub-ctranslate2](https://github.com/michaelfeil/hf-hub-ctranslate2) |
|
- `compute_type=int8_float16` for `device="cuda"` |
|
- `compute_type=int8` for `device="cpu"` |
|
|
|
```python |
|
from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub |
|
|
|
model_name = "michaelfeil/ct2fast-dolly-v2-12b" |
|
model = GeneratorCT2fromHfHub( |
|
# load in int8 on CUDA |
|
model_name_or_path=model_name, |
|
device="cuda", |
|
compute_type="int8_float16" |
|
) |
|
outputs = model.generate( |
|
text=["How do you call a fast Flan-ingo?", "User: How are you doing?"], |
|
) |
|
print(outputs) |
|
``` |
|
|
|
# Licence and other remarks: |
|
This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo. |
|
|
|
# Usage of Dolly-v2: |
|
According to the Intruction Pipeline of databricks/dolly-v2-12b |
|
```python |
|
# from https://huggingface.co/databricks/dolly-v2-12b |
|
def encode_prompt(instruction): |
|
INSTRUCTION_KEY = "### Instruction:" |
|
RESPONSE_KEY = "### Response:" |
|
END_KEY = "### End" |
|
INTRO_BLURB = ( |
|
"Below is an instruction that describes a task. Write a response that appropriately completes the request." |
|
) |
|
|
|
# This is the prompt that is used for generating responses using an already trained model. It ends with the response |
|
# key, where the job of the model is to provide the completion that follows it (i.e. the response itself). |
|
PROMPT_FOR_GENERATION_FORMAT = """{intro} |
|
{instruction_key} |
|
{instruction} |
|
{response_key} |
|
""".format( |
|
intro=INTRO_BLURB, |
|
instruction_key=INSTRUCTION_KEY, |
|
instruction="{instruction}", |
|
response_key=RESPONSE_KEY, |
|
) |
|
return PROMPT_FOR_GENERATION_FORMAT.format(instruction=instruction) |
|
``` |