bad
import torch
from transformers import pipeline, AutoTokenizer
model_id = "ChenMnZ/Llama-3-8b-instruct-EfficientQAT-w2g64-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
pipe = pipeline(
"text-generation",
model=model_id,
tokenizer=tokenizer, # Pass the tokenizer to the pipeline
torch_dtype=torch.float16,
device_map="auto",
)
Change 1: Disable sampling or adjust temperature
output = pipe("ai is ", max_new_tokens=50, do_sample=False) # Disable sampling
OR
output = pipe("The key to life is", max_new_tokens=50, do_sample=True, temperature=0.1) # Adjust temperature for less randomness
Change 2: Add a logits processor to handle invalid probabilities
This method requires more advanced knowledge and might not solve the issue in this specific case.
from transformers import LogitsProcessorList, MinLengthLogitsProcessor
logits_processor = LogitsProcessorList([
MinLengthLogitsProcessor(1, pipe.tokenizer.eos_token_id), # Ensure minimum length of 1
])
output = pipe("The key to life is", max_new_tokens=50, do_sample=True, temperature=0.7, logits_processor=logits_processor)
print(output[0]['generated_text']) # Access the generated text from the output
/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:795: FutureWarning: resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True
.
warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret HF_TOKEN
does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for compatibililty.
INFO - Auto pick kernel based on compatibility: <class 'gptqmodel.nn_modules.qlinear.torch.TorchQuantLinear'>
/usr/local/lib/python3.11/dist-packages/transformers/modeling_utils.py:4371: FutureWarning: _is_quantized_training_enabled
is going to be deprecated in transformers 4.39.0. Please use model.hf_quantizer.is_trainable
instead
warnings.warn(
The cos_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the LlamaAttention
class
The sin_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the LlamaAttention
class
ai is !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
colab t4
from transformers import AutoTokenizer
from gptqmodel import GPTQModel
Use a quantized model path from Hugging Face model hub
quant_dir = "ChenMnZ/Llama-3-8b-instruct-EfficientQAT-w2g64-GPTQ" # Example quantized model
tokenizer = AutoTokenizer.from_pretrained(quant_dir, use_fast=True)
Load the quantized model
model = GPTQModel.from_quantized(quant_dir)
Perform inference
prompt = "Model quantization is"
input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)
generated_ids = model.generate(**input_ids)
print(tokenizer.decode(generated_ids[0]))
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Fetching 8 files: 100%
8/8 [00:00<00:00, 4.57it/s]
.gitattributes: 100%
1.52k/1.52k [00:00<00:00, 58.6kB/s]
README.md: 100%
8.72k/8.72k [00:00<00:00, 167kB/s]
quantize_config.json: 100%
466/466 [00:00<00:00, 14.0kB/s]
INFO - Ignoring unknown parameter in the quantization configuration: model_name_or_path.
INFO - Ignoring unknown parameter in the quantization configuration: model_file_base_name.
INFO - Estimated Quantization BPW (bits per weight): 2.34375 bpw, based on [bits: 2, group_size: 64]
INFO - Auto enabling flash attention2
INFO - Auto pick kernel based on compatibility: <class 'gptqmodel.nn_modules.qlinear.dynamic_cuda.DynamicCudaQuantLinear'>
INFO - make_quant: Linear candidates: [<class 'gptqmodel.nn_modules.qlinear.dynamic_cuda.DynamicCudaQuantLinear'>, <class 'gptqmodel.nn_modules.qlinear.torch.TorchQuantLinear'>]
INFO - make_quant: Selected linear: <class 'gptqmodel.nn_modules.qlinear.dynamic_cuda.DynamicCudaQuantLinear'>
.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting pad_token_id
to eos_token_id
:128001 for open-end generation.
/usr/local/lib/python3.11/dist-packages/transformers/generation/utils.py:1141: UserWarning: Using the model-agnostic default max_length
(=20) to control the generation length. We recommend setting max_new_tokens
to control the maximum length of the generation.
warnings.warn(
RuntimeError Traceback (most recent call last)
in <cell line: 0>()
13 prompt = "Model quantization is"
14 input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)
---> 15 generated_ids = model.generate(**input_ids)
16 print(tokenizer.decode(generated_ids[0]))
18 frames
/usr/local/lib/python3.11/dist-packages/gptqmodel/nn_modules/qlinear/dynamic_cuda.py in forward(self, x)
129 )
130
--> 131 out = out.to(x.dtype).reshape(out_shape)
132 if self.bias is not None:
133 out.add_(self.bias)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
from transformers import AutoTokenizer
from gptqmodel import GPTQModel
quant_dir = "ChenMnZ/Llama-3-8b-instruct-EfficientQAT-w2g64-GPTQ"
quant_dir = "ChenMnZ/Llama-2-7b-EfficientQAT-w2g128-BitBLAS"
or local path
tokenizer = AutoTokenizer.from_pretrained(quant_dir, use_fast=True)
load quantized model to the first GPU
model = GPTQModel.from_quantized(quant_dir)
inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("Model quantization is", return_tensors="pt").to(model.device))[0]))
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Fetching 8 files: 100%
8/8 [00:00<00:00, 395.16it/s]
INFO - Ignoring unknown parameter in the quantization configuration: model_name_or_path.
INFO - Ignoring unknown parameter in the quantization configuration: model_file_base_name.
INFO - Estimated Quantization BPW (bits per weight): 2.34375 bpw, based on [bits: 2, group_size: 64]
INFO - Auto enabling flash attention2
INFO - make_quant: Linear candidates: [<class 'gptqmodel.nn_modules.qlinear.dynamic_cuda.DynamicCudaQuantLinear'>, <class 'gptqmodel.nn_modules.qlinear.torch.TorchQuantLinear'>]
INFO - make_quant: Selected linear: <class 'gptqmodel.nn_modules.qlinear.dynamic_cuda.DynamicCudaQuantLinear'>
.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting pad_token_id
to eos_token_id
:128001 for open-end generation.
RuntimeError Traceback (most recent call last)
in <cell line: 0>()
13
14 # inference with model.generate
---> 15 print(tokenizer.decode(model.generate(**tokenizer("Model quantization is", return_tensors="pt").to(model.device))[0]))
18 frames
/usr/local/lib/python3.11/dist-packages/gptqmodel/nn_modules/qlinear/dynamic_cuda.py in forward(self, x)
129 )
130
--> 131 out = out.to(x.dtype).reshape(out_shape)
132 if self.bias is not None:
133 out.add_(self.bias)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
from transformers import AutoTokenizer, pipeline
import torch
تحميل tokenizer مع add_prefix_space=False
tokenizer = AutoTokenizer.from_pretrained(
"ChenMnZ/Llama-2-7b-EfficientQAT-w2g128-GPTQ",
trust_remote_code=True,
add_prefix_space=False # إضافة هذه السطر
)
إنشاء pipeline مع إعدادات مُحسَّنة
pipe = pipeline(
"text-generation",
model="ChenMnZ/Llama-2-7b-EfficientQAT-w2g128-GPTQ",
tokenizer=tokenizer,
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto",
# إعدادات إضافية لتحسين الأداء
max_length=256, # تحديد الطول الإجمالي للنص الناتج
num_beams=4, # عدد الحزم في خوارزمية Beam Search
early_stopping=True, # إيقاف التوليد مبكرًا إذا تم الوصول إلى نهاية الجملة
)
توليد النص باستخدام max_length
output = pipe("Who is AI?", do_sample=False)
print(output)
/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:795: FutureWarning: resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True
.
warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret HF_TOKEN
does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
You set add_prefix_space
. The tokenizer needs to be converted from the slow tokenizers
ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for compatibililty.
INFO - Auto pick kernel based on compatibility: <class 'gptqmodel.nn_modules.qlinear.torch.TorchQuantLinear'>
/usr/local/lib/python3.11/dist-packages/transformers/modeling_utils.py:4371: FutureWarning: _is_quantized_training_enabled
is going to be deprecated in transformers 4.39.0. Please use model.hf_quantizer.is_trainable
instead
warnings.warn(
The cos_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the LlamaAttention
class
The sin_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the LlamaAttention
class
[{'generated_text': 'Who is AI?\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x02'}]
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "ChenMnZ/Llama-3-8b-EfficientQAT-w2g128-GPTQ"
Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
Format the input
prompt = "Give me a short introduction to large language model."
formatted_prompt = f"{tokenizer.bos_token}{prompt}{tokenizer.eos_token}"
Tokenize and move to device
model_inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
Generate response
generated_ids = model.generate(
**model_inputs,
max_new_tokens=100,
pad_token_id=tokenizer.eos_token_id
)
Decode output
response = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(response)
/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:795: FutureWarning: resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True
.
warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret HF_TOKEN
does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for compatibililty.
INFO - Auto pick kernel based on compatibility: <class 'gptqmodel.nn_modules.qlinear.torch.TorchQuantLinear'>
/usr/local/lib/python3.11/dist-packages/transformers/modeling_utils.py:4371: FutureWarning: _is_quantized_training_enabled
is going to be deprecated in transformers 4.39.0. Please use model.hf_quantizer.is_trainable
instead
warnings.warn(
The cos_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the LlamaAttention
class
The sin_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the LlamaAttention
class
/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:795: FutureWarning: resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True
.
warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set padding_side='left'
when initializing the tokenizer.
Give me a short introduction to large language model.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!