Text Generation
Transformers
Safetensors
llama
text-generation-inference
Inference Endpoints

Cant't get it up running

#2
by StephanNoller - opened

Hi all, wanted to get Teuken up and running under vllm but everything failed so far. I am running a pod on runpod that serves occiglot so far without problems and tried the same settings for Teuken, but it is somehow not working (logs seem ok but still no response). These are my settings to run it: --host 0.0.0.0 --port 8000 --model openGPT-X/Teuken-7B-instruct-commercial-v0.4 --dtype bfloat16 --enforce-eager --gpu-memory-utilization 0.95 --trust-remote-code.
Also tried so set it up here on HF as an Inference Endpoint but does not work either. At HF i am doubting that my --trust-remote-code is going through well, but not sure. Any ideas? Or recommendations where i can get it deployed in a better way? Can it be that i am running into problems because of the chat-template? (using OpenAI api). Thanks in advance,
Stephan

ok, it runs now - turns out that everything was actually up and running but indeed i think i am having issues with the chat-template. The roles "user" or "assistant" are not accepted obviously. It only works atm if i pass one message with the role "user". How can i change this?

btw i am calling it like this (took example from the readme/model card):

from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
completion = client.chat.completions.create(
model="openGPT-X/Teuken-7B-instruct-research-v0.4",
messages=[{"role": "User", "content": "Hallo"}],
extra_body={"chat_template":"DE"}
)
print(f"Assistant: {completion}")

StephanNoller changed discussion status to closed
StephanNoller changed discussion status to open

Hi @StephanNoller , thanks for your investigations. For this instruction-tuned model, we have used a selection of system messages listed here for training: https://huggingface.co/openGPT-X/Teuken-7B-instruct-commercial-v0.4/blob/main/gptx_tokenizer.py#L432

We suggest also using these system messages for the inference. In case you want to test custom system messages, you could specify a custom chat template with vLLM as described here https://huggingface.co/openGPT-X/Teuken-7B-instruct-commercial-v0.4#usage-with-vllm-server and set the --chat-template parameter to the following Jinja-Template:

{%- if messages[0]["role"] == "system" %}
{{- messages[0]['role']|capitalize + ': ' + messages[0]['content'] + '\\n' }}
{%- set loop_messages = messages[1:] %}
{%- else %}
System: Ein Gespräch zwischen einem Menschen und einem Assistenten mit künstlicher Intelligenz. Der Assistent gibt hilfreiche und höfliche Antworten auf die Fragen des Menschen.{{- '\\n'}}
{%- set loop_messages = messages %}
{%- endif %}
{%- for message in loop_messages %}
{%- if message['role']|lower == 'user' %}
{{- message['role']|capitalize + ': ' + message['content'] + '\\n' }}
{%- elif message['role']|lower == 'assistant' %}
{{- message['role']|capitalize + ': ' + message['content'] + '</s>' + '\\n' }}
{%- else %}
{{- raise_exception('Only user and assistant roles are supported!') }}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- 'Assistant: '}}
{%- endif %}

Sign up or log in to comment