10 days ago

import torch
from transformers import pipeline, AutoTokenizer

model_id = "ChenMnZ/Llama-3-8b-instruct-EfficientQAT-w2g64-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

pipe = pipeline(
"text-generation",
model=model_id,
tokenizer=tokenizer, # Pass the tokenizer to the pipeline
torch_dtype=torch.float16,
device_map="auto",
)

Change 1: Disable sampling or adjust temperature

output = pipe("ai is ", max_new_tokens=50, do_sample=False) # Disable sampling

OR

output = pipe("The key to life is", max_new_tokens=50, do_sample=True, temperature=0.1) # Adjust temperature for less randomness

Change 2: Add a logits processor to handle invalid probabilities

This method requires more advanced knowledge and might not solve the issue in this specific case.

from transformers import LogitsProcessorList, MinLengthLogitsProcessor

logits_processor = LogitsProcessorList([

MinLengthLogitsProcessor(1, pipe.tokenizer.eos_token_id), # Ensure minimum length of 1

])

output = pipe("The key to life is", max_new_tokens=50, do_sample=True, temperature=0.7, logits_processor=logits_processor)

print(output[0]['generated_text']) # Access the generated text from the output

/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:795: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret HF_TOKEN does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for compatibililty.
INFO - Auto pick kernel based on compatibility: <class 'gptqmodel.nn_modules.qlinear.torch.TorchQuantLinear'>
/usr/local/lib/python3.11/dist-packages/transformers/modeling_utils.py:4371: FutureWarning: _is_quantized_training_enabled is going to be deprecated in transformers 4.39.0. Please use model.hf_quantizer.is_trainable instead
warnings.warn(
The cos_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the LlamaAttention class
The sin_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the LlamaAttention class
ai is !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

rakmik

10 days ago

colab t4

rakmik

10 days ago

from transformers import AutoTokenizer
from gptqmodel import GPTQModel

Use a quantized model path from Hugging Face model hub

quant_dir = "ChenMnZ/Llama-3-8b-instruct-EfficientQAT-w2g64-GPTQ" # Example quantized model

tokenizer = AutoTokenizer.from_pretrained(quant_dir, use_fast=True)

Load the quantized model

model = GPTQModel.from_quantized(quant_dir)

Perform inference

prompt = "Model quantization is"
input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)
generated_ids = model.generate(**input_ids)
print(tokenizer.decode(generated_ids[0]))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Fetching 8 files: 100%
8/8 [00:00<00:00, 4.57it/s]
.gitattributes: 100%
1.52k/1.52k [00:00<00:00, 58.6kB/s]
README.md: 100%
8.72k/8.72k [00:00<00:00, 167kB/s]
quantize_config.json: 100%
466/466 [00:00<00:00, 14.0kB/s]
INFO - Ignoring unknown parameter in the quantization configuration: model_name_or_path.
INFO - Ignoring unknown parameter in the quantization configuration: model_file_base_name.
INFO - Estimated Quantization BPW (bits per weight): 2.34375 bpw, based on [bits: 2, group_size: 64]
INFO - Auto enabling flash attention2
INFO - Auto pick kernel based on compatibility: <class 'gptqmodel.nn_modules.qlinear.dynamic_cuda.DynamicCudaQuantLinear'>
INFO - make_quant: Linear candidates: [<class 'gptqmodel.nn_modules.qlinear.dynamic_cuda.DynamicCudaQuantLinear'>, <class 'gptqmodel.nn_modules.qlinear.torch.TorchQuantLinear'>]
INFO - make_quant: Selected linear: `<class 'gptqmodel.nn_modules.qlinear.dynamic_cuda.DynamicCudaQuantLinear'>`.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
/usr/local/lib/python3.11/dist-packages/transformers/generation/utils.py:1141: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
warnings.warn(

RuntimeError Traceback (most recent call last)
in <cell line: 0>()
13 prompt = "Model quantization is"
14 input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)
---> 15 generated_ids = model.generate(**input_ids)
16 print(tokenizer.decode(generated_ids[0]))

18 frames
/usr/local/lib/python3.11/dist-packages/gptqmodel/nn_modules/qlinear/dynamic_cuda.py in forward(self, x)
129 )
130
--> 131 out = out.to(x.dtype).reshape(out_shape)
132 if self.bias is not None:
133 out.add_(self.bias)

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

rakmik

10 days ago

from transformers import AutoTokenizer
from gptqmodel import GPTQModel

quant_dir = "ChenMnZ/Llama-3-8b-instruct-EfficientQAT-w2g64-GPTQ"

quant_dir = "ChenMnZ/Llama-2-7b-EfficientQAT-w2g128-BitBLAS"

or local path

tokenizer = AutoTokenizer.from_pretrained(quant_dir, use_fast=True)

load quantized model to the first GPU

model = GPTQModel.from_quantized(quant_dir)

inference with model.generate

print(tokenizer.decode(model.generate(**tokenizer("Model quantization is", return_tensors="pt").to(model.device))[0]))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Fetching 8 files: 100%
8/8 [00:00<00:00, 395.16it/s]
INFO - Ignoring unknown parameter in the quantization configuration: model_name_or_path.
INFO - Ignoring unknown parameter in the quantization configuration: model_file_base_name.
INFO - Estimated Quantization BPW (bits per weight): 2.34375 bpw, based on [bits: 2, group_size: 64]
INFO - Auto enabling flash attention2
INFO - make_quant: Linear candidates: [<class 'gptqmodel.nn_modules.qlinear.dynamic_cuda.DynamicCudaQuantLinear'>, <class 'gptqmodel.nn_modules.qlinear.torch.TorchQuantLinear'>]
INFO - make_quant: Selected linear: `<class 'gptqmodel.nn_modules.qlinear.dynamic_cuda.DynamicCudaQuantLinear'>`.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

RuntimeError Traceback (most recent call last)
in <cell line: 0>()
13
14 # inference with model.generate
---> 15 print(tokenizer.decode(model.generate(**tokenizer("Model quantization is", return_tensors="pt").to(model.device))[0]))

18 frames
/usr/local/lib/python3.11/dist-packages/gptqmodel/nn_modules/qlinear/dynamic_cuda.py in forward(self, x)
129 )
130
--> 131 out = out.to(x.dtype).reshape(out_shape)
132 if self.bias is not None:
133 out.add_(self.bias)

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

rakmik

10 days ago

from transformers import AutoTokenizer, pipeline
import torch

تحميل tokenizer مع add_prefix_space=False

tokenizer = AutoTokenizer.from_pretrained(
"ChenMnZ/Llama-2-7b-EfficientQAT-w2g128-GPTQ",
trust_remote_code=True,
add_prefix_space=False # إضافة هذه السطر
)

إنشاء pipeline مع إعدادات مُحسَّنة

pipe = pipeline(
"text-generation",
model="ChenMnZ/Llama-2-7b-EfficientQAT-w2g128-GPTQ",
tokenizer=tokenizer,
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto",
# إعدادات إضافية لتحسين الأداء
max_length=256, # تحديد الطول الإجمالي للنص الناتج
num_beams=4, # عدد الحزم في خوارزمية Beam Search
early_stopping=True, # إيقاف التوليد مبكرًا إذا تم الوصول إلى نهاية الجملة
)

توليد النص باستخدام max_length

output = pipe("Who is AI?", do_sample=False)
print(output)

/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:795: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret HF_TOKEN does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers
ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for compatibililty.
INFO - Auto pick kernel based on compatibility: <class 'gptqmodel.nn_modules.qlinear.torch.TorchQuantLinear'>
/usr/local/lib/python3.11/dist-packages/transformers/modeling_utils.py:4371: FutureWarning: _is_quantized_training_enabled is going to be deprecated in transformers 4.39.0. Please use model.hf_quantizer.is_trainable instead
warnings.warn(
The cos_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the LlamaAttention class
The sin_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the LlamaAttention class
[{'generated_text': 'Who is AI?\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x02'}]

rakmik

10 days ago

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "ChenMnZ/Llama-3-8b-EfficientQAT-w2g128-GPTQ"

Load model and tokenizer

model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Format the input

prompt = "Give me a short introduction to large language model."
formatted_prompt = f"{tokenizer.bos_token}{prompt}{tokenizer.eos_token}"

Tokenize and move to device

model_inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)

Generate response

generated_ids = model.generate(
**model_inputs,
max_new_tokens=100,
pad_token_id=tokenizer.eos_token_id
)

Decode output

response = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

print(response)

/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:795: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret HF_TOKEN does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for compatibililty.
INFO - Auto pick kernel based on compatibility: <class 'gptqmodel.nn_modules.qlinear.torch.TorchQuantLinear'>
/usr/local/lib/python3.11/dist-packages/transformers/modeling_utils.py:4371: FutureWarning: _is_quantized_training_enabled is going to be deprecated in transformers 4.39.0. Please use model.hf_quantizer.is_trainable instead
warnings.warn(
The cos_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the LlamaAttention class
The sin_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the LlamaAttention class
/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:795: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set padding_side='left' when initializing the tokenizer.
Give me a short introduction to large language model.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

ChenMnZ
/

Llama-3-8b-instruct-EfficientQAT-w2g64-GPTQ

bad

Change 1: Disable sampling or adjust temperature

OR

output = pipe("The key to life is", max_new_tokens=50, do_sample=True, temperature=0.1) # Adjust temperature for less randomness

Change 2: Add a logits processor to handle invalid probabilities

This method requires more advanced knowledge and might not solve the issue in this specific case.

from transformers import LogitsProcessorList, MinLengthLogitsProcessor

logits_processor = LogitsProcessorList([

MinLengthLogitsProcessor(1, pipe.tokenizer.eos_token_id), # Ensure minimum length of 1

])

output = pipe("The key to life is", max_new_tokens=50, do_sample=True, temperature=0.7, logits_processor=logits_processor)

Use a quantized model path from Hugging Face model hub

Load the quantized model

Perform inference

quant_dir = "ChenMnZ/Llama-2-7b-EfficientQAT-w2g128-BitBLAS"

or local path

load quantized model to the first GPU

inference with model.generate

تحميل tokenizer مع add_prefix_space=False

إنشاء pipeline مع إعدادات مُحسَّنة

توليد النص باستخدام max_length

Load model and tokenizer

Format the input

Tokenize and move to device

Generate response

Decode output