Context Length and Max New Tokens
-max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096
How to increase using
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
same here, the mistral 7b base model is 8k context lenght I understand, this model is 4k? or is a typo in the readme.md ?
Thanks!
Just change it to
---max-input-length 7892 --max-total-tokens 8192 --max-batch-prefill-tokens 8192
or whatever you want
How to do here
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
def run():
model_name_or_path = "TheBloke/zephyr-7B-beta-GPTQ"
# To use a different branch, change revision
# For example: revision="gptq-4bit-32g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
device_map="auto",
trust_remote_code=False,
revision="main")
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
# Open the text file for reading
with open('data.txt', 'r') as file:
# Read the entire content of the file into a string
file_content = file.read()
prompt = f"Extract the usefull information from the following given text: {file_content} and convert the extracted data in the structured format using valid json only."
prompt_template=f'''<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>
'''
print("\n\n*** Generate:")
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))
# Inference can also be done using transformers' pipeline
print("*** Pipeline:")
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.95,
top_k=40,
repetition_penalty=1.1
)
print(pipe(prompt_template)[0]['generated_text'])