metadata
language:
- pl
license: apache-2.0
library_name: transformers
tags:
- finetuned
- gguf
inference: false
pipeline_tag: text-generation
base_model: speakleash/Bielik-11B-v2.3-Instruct
Bielik-11B-v2.3-Instruct-GPTQ
This repo contains OpenVino 4bit format model files for SpeakLeash's Bielik-11B-v.2.3-Instruct.
DISCLAIMER: Be aware that quantised models show reduced response quality and possible hallucinations!
Model usage with OpenVino
This model can be deployed efficiently using the OpenVino. Below you can find two ways of model inference: using Intel Optimum, pure OpenVino library.
The most simple LLM inferencing code with OpenVINO and the optimum-intel library.
from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
model_id = "speakleash/Bielik-11B-v2.3-Instruct-4bit-ov"
model = OVModelForCausalLM.from_pretrained(model_id, use_cache=False)
question = "Dlaczego ryby nie potrafi膮 fruwa膰?"
prompt_text_bielik = f"""<s><|im_start|> system
Odpowiadaj kr贸tko, precyzyjnie i wy艂膮cznie w j臋zyku polskim.<|im_end|>
<|im_start|> user
{question}<|im_end|>
<|im_start|> assistant
"""
tokenizer = AutoTokenizer.from_pretrained(model_id)
inputs = tokenizer(prompt_text_bielik, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Run an LLM model with only OpenVINO (additionaly we provided code which uses 'greedy decoding' instead of sampling).
import openvino as ov
import numpy as np
from transformers import AutoTokenizer
model_path = "speakleash/Bielik-11B-v2.3-Instruct-4bit-ov/openvino_model.xml"
tokenizer = AutoTokenizer.from_pretrained("speakleash/Bielik-11B-v2.3-Instruct-4bit-ov")
ov_model = ov.Core().read_model(model_path)
compiled_model = ov.compile_model(ov_model, "CPU")
infer_request = compiled_model.create_infer_request()
question = "Dlaczego ryby nie potrafi膮 fruwa膰?"
prompt_text_bielik = f"""<s><|im_start|> system
Odpowiadaj kr贸tko, precyzyjnie i wy艂膮cznie w j臋zyku polskim.<|im_end|>
<|im_start|> user
{question}<|im_end|>
<|im_start|> assistant
"""
tokens = tokenizer.encode(prompt_text_bielik, return_tensors="np")
input_ids = tokens
attention_mask = np.ones_like(input_ids)
position_ids = np.arange(len(tokens[0])).reshape(1, -1)
beam_idx = np.array([0], dtype=np.int32)
infer_request.reset_state()
prev_output = ''
generated_text_ids = np.array([], dtype=np.int32)
num_max_token_for_generation = 500
print(f'Pytanie: {question}')
print("Odpowied藕:", end=' ', flush=True)
for _ in range(num_max_token_for_generation):
response = infer_request.infer(inputs={
'input_ids': input_ids,
'attention_mask': attention_mask,
'position_ids': position_ids,
'beam_idx': beam_idx
})
next_token_logits = response['logits'][0, -1, :]
sampled_id = np.argmax(next_token_logits) # Greedy decoding
generated_text_ids = np.append(generated_text_ids, sampled_id)
output_text = tokenizer.decode(generated_text_ids)
print(output_text[len(prev_output):], end='', flush=True)
prev_output = output_text
input_ids = np.array([[sampled_id]], dtype=np.int64)
attention_mask = np.array([[1]], dtype=np.int64)
position_ids = np.array([[position_ids[0, -1] + 1]], dtype=np.int64)
if sampled_id == tokenizer.eos_token_id:
print('\n\n*** Zako艅czono generowanie.')
break
print(f'\n\n*** Wygenerowano {len(generated_text_ids)} token贸w.')
Model description:
- Developed by: SpeakLeash & ACK Cyfronet AGH
- Language: Polish
- Model type: causal decoder-only
- Quant from: Bielik-11B-v2.3-Instruct
- Finetuned from: Bielik-11B-v2
- License: Apache 2.0 and Terms of Use
Responsible for model quantization
- Remigiusz KinasSpeakLeash - team leadership, conceptualizing, calibration data preparation, process creation and quantized model delivery.
Contact Us
If you have any questions or suggestions, please use the discussion tab. If you want to contact us directly, join our Discord SpeakLeash.