CodeLlama 13B Instruct - GPTQ - TensorRT-LLM - RTX4090

Description

This repo contains TensorRT-LLM GPTQ model files for Meta's CodeLlama 13B Instruct built for a single RTX 4090 card and using tensorrt_llm version 0.15.0.dev2024101500. It's a 4-bit quantized version based on the main branch of the TheBloke CodeLlama 13B Instruct - GPTQ model.

TensorRT commands

To build this model, the following commands were run from the base folder of the TensorRT-LLM repository (see installation instructions in the repository for more information):

python examples/llama/convert_checkpoint.py \
    --model_dir ./CodeLlama-13b-Instruct-hf \
    --output_dir ./CodeLlama-13b-Instruct-hf_checkpoint \
    --dtype float16 \
    --quant_ckpt_path ./CodeLlama-13B-Instruct-GPTQ/model.safetensors \
    --use_weight_only \
    --weight_only_precision int4_gptq \
    --per_group

And then:

trtllm-build \
    --checkpoint_dir ./CodeLlama-13b-Instruct-hf_checkpoint \
    --output_dir ./CodeLlama-13B-Instruct-GPTQ_TensorRT \
    --gemm_plugin float16 \
    --max_input_len 8192 \
    --max_seq_len 8192

Prompt template: CodeLlama

[INST] <<SYS>>
Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:
<</SYS>>

{prompt}
 [/INST] 

How to use this model from Python code

Using TensorRT-LLM API

Install the necessary packages

pip3 install tensorrt_llm==0.15.0.dev2024101500 -U --pre --extra-index-url https://pypi.nvidia.com

Beware that this command should not be run from a virtual environment (or twice, one time outside venv and then using venv).

Use the TensorRT-LLM API

from tensorrt_llm import LLM, SamplingParams

system_prompt = \
    "[INST] <<SYS>>\n" +\
    "Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:" +\
    "\n<</SYS>>\n\n"

user_prompt = \
    "<Your user prompt>" +\
    " [/INST] "

prompts = [
    system_prompt + user_prompt,
]
sampling_params = SamplingParams(max_tokens=512, temperature=1.31, top_p=0.14, top_k=49, repetition_penalty=1.17)

llm = LLM(model="./CodeLlama-13B-Instruct-GPTQ_TensorRT")

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Using Oobabooga's Text Generation WebUI

Follow instructions described here: https://github.com/oobabooga/text-generation-webui/pull/5715 Use version 0.15.0.dev2024101500 of tensorrt_llm instead of 0.10.0.

Downloads last month
25
Inference Examples
Inference API (serverless) has been turned off for this model.

Model tree for dimitribarbot/CodeLlama-13B-Instruct-GPTQ-TensorRT-LLM-RTX-4090

Finetuned
(4)
this model