SamoXXX's picture
Update README.md
fad07e5 verified
|
raw
history blame
4.7 kB
metadata
language:
  - pl
license: apache-2.0
library_name: transformers
tags:
  - finetuned
  - gguf
inference: false
pipeline_tag: text-generation
base_model: speakleash/Bielik-11B-v2.3-Instruct

Bielik-11B-v2.3-Instruct-GPTQ

This repo contains OpenVino 4bit format model files for SpeakLeash's Bielik-11B-v.2.3-Instruct.

DISCLAIMER: Be aware that quantised models show reduced response quality and possible hallucinations!

Model usage with OpenVino

This model can be deployed efficiently using the OpenVino. Below you can find two ways of model inference: using Intel Optimum, pure OpenVino library.

The most simple LLM inferencing code with OpenVINO and the optimum-intel library.

from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer

model_id = "speakleash/Bielik-11B-v2.3-Instruct-4bit-ov"
model = OVModelForCausalLM.from_pretrained(model_id, use_cache=False)

question = "Dlaczego ryby nie potrafi膮 fruwa膰?"

prompt_text_bielik = f"""<s><|im_start|> system
Odpowiadaj kr贸tko, precyzyjnie i wy艂膮cznie w j臋zyku polskim.<|im_end|> 
<|im_start|> user
{question}<|im_end|> 
<|im_start|> assistant
"""

tokenizer = AutoTokenizer.from_pretrained(model_id)
inputs = tokenizer(prompt_text_bielik, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Run an LLM model with only OpenVINO (additionaly we provided code which uses 'greedy decoding' instead of sampling).

import openvino as ov
import numpy as np
from transformers import AutoTokenizer

model_path = "speakleash/Bielik-11B-v2.3-Instruct-4bit-ov/openvino_model.xml"
tokenizer = AutoTokenizer.from_pretrained("speakleash/Bielik-11B-v2.3-Instruct-4bit-ov")

ov_model = ov.Core().read_model(model_path)
compiled_model = ov.compile_model(ov_model, "CPU")
infer_request = compiled_model.create_infer_request()

question = "Dlaczego ryby nie potrafi膮 fruwa膰?"
prompt_text_bielik = f"""<s><|im_start|> system
Odpowiadaj kr贸tko, precyzyjnie i wy艂膮cznie w j臋zyku polskim.<|im_end|>
<|im_start|> user
{question}<|im_end|>
<|im_start|> assistant
"""

tokens = tokenizer.encode(prompt_text_bielik, return_tensors="np")
input_ids = tokens
attention_mask = np.ones_like(input_ids)
position_ids = np.arange(len(tokens[0])).reshape(1, -1)
beam_idx = np.array([0], dtype=np.int32)

infer_request.reset_state()

prev_output = ''
generated_text_ids = np.array([], dtype=np.int32)
num_max_token_for_generation = 500

print(f'Pytanie: {question}')
print("Odpowied藕:", end=' ', flush=True)

for _ in range(num_max_token_for_generation):
    response = infer_request.infer(inputs={
        'input_ids': input_ids,
        'attention_mask': attention_mask,
        'position_ids': position_ids,
        'beam_idx': beam_idx
    })
    
    next_token_logits = response['logits'][0, -1, :]
    sampled_id = np.argmax(next_token_logits)  # Greedy decoding
    generated_text_ids = np.append(generated_text_ids, sampled_id)
    
    output_text = tokenizer.decode(generated_text_ids)
    print(output_text[len(prev_output):], end='', flush=True)
    prev_output = output_text
    
    input_ids = np.array([[sampled_id]], dtype=np.int64)
    attention_mask = np.array([[1]], dtype=np.int64)
    position_ids = np.array([[position_ids[0, -1] + 1]], dtype=np.int64)
    
    if sampled_id == tokenizer.eos_token_id:
        print('\n\n*** Zako艅czono generowanie.')
        break

print(f'\n\n*** Wygenerowano {len(generated_text_ids)} token贸w.')

Model description:

Responsible for model quantization

  • Remigiusz KinasSpeakLeash - team leadership, conceptualizing, calibration data preparation, process creation and quantized model delivery.

Contact Us

If you have any questions or suggestions, please use the discussion tab. If you want to contact us directly, join our Discord SpeakLeash.