Trelis
/

mpt-7b-instruct-hosted-inference-8bit

Text Generation

hosted inference

8-bit precision

text-generation-inference

Model card Files Files and versions Community

RonanMcGovern commited on Aug 14, 2023

Commit

043bfe2

·

1 Parent(s): 9e32c4e

add inference guide

Files changed (1) hide show

README.md +51 -2

README.md CHANGED Viewed

@@ -6,15 +6,64 @@ tags:
 - Composer
 - MosaicML
 - llm-foundry
 inference: true
 ---
-# Llama 2 - hosted inference
-This is simply an 8-bit version of the Llama-2-7B model.
 - 8-bits allows the model to be below 10 GB
 - This allows for hosted inference of the model on the model's home page
 ~

 - Composer
 - MosaicML
 - llm-foundry
+- hosted inference
+- 8 bit
+- 8bit
+- 8-bit
 inference: true
 ---
+# MPT 7B Instruct - hosted inference
+This is simply an 8-bit version of the mpt-7b-instruct model.
 - 8-bits allows the model to be below 10 GB
 - This allows for hosted inference of the model on the model's home page
+- Note that inference may be slow unless you have a HuggingFace Pro plan.
+If you want to run inference yourself (e.g. in a Colab notebook) you can try:
+```
+!pip install -q -U git+https://github.com/huggingface/accelerate.git
+!pip install -q -U bitsandbytes
+!pip install -q -U git+https://github.com/huggingface/transformers.git
+model_id = 'Trelis/mpt-7b-instruct-hosted-inference-8bit'
+import transformers
+from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, TextStreamer
+config = transformers.AutoConfig.from_pretrained(model_id, trust_remote_code=True)
+config.init_device = 'cuda:0' # Unclear whether this really helps a lot or interacts with device_map.
+config.max_seq_len = 512
+model = AutoModelForCausalLM.from_pretrained(model_id, load_in_8bit=True, config=config)
+# MPT Inference
+def stream(user_instruction):
+    INSTRUCTION_KEY = "### Instruction:"
+    RESPONSE_KEY = "### Response:"
+    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
+    PROMPT_FOR_GENERATION_FORMAT = """{intro}
+    {instruction_key}
+    {instruction}
+    {response_key}
+    """.format(
+        intro=INTRO_BLURB,
+        instruction_key=INSTRUCTION_KEY,
+        instruction="{instruction}",
+        response_key=RESPONSE_KEY,
+    )
+    prompt = PROMPT_FOR_GENERATION_FORMAT.format(instruction=user_instruction)
+    inputs = tokenizer([prompt], return_tensors="pt").to("cuda:0")
+    streamer = TextStreamer(tokenizer)
+    # Despite returning the usual output, the streamer will also print the generated text to stdout.
+    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=500, eos_token_id=0, temperature=1)
+stream('Count to ten')
+```
 ~